C5.0 Class Probability Shrinkage

(The image above has nothing do to with this post. It does, however, show the prize that my daughter won during a recent vacation to Virginia and how I got it back home).

I was recently asked to explain a potential disconnect in C5.0 between the class probabilities shown in the terminal nodes and the values generated by the prediction code.

Here is an example using the iris data:

  
> library(C50)
> mod <- C5.0(Species ~ ., data = iris)
> summary(mod)
Call:
C5.0.formula(formula = Species ~ ., data = iris)

C5.0 [Release 2.07 GPL Edition]  	Tue Sep  8 12:49:43 2015
-------------------------------

Class specified by attribute `outcome'

Read 150 cases (5 attributes) from undefined.data

Decision tree:

Petal.Length <= 1.9: setosa (50)
Petal.Length > 1.9:
:...Petal.Width > 1.7: virginica (46/1)
    Petal.Width <= 1.7:
    :...Petal.Length <= 4.9: versicolor (48/1)
        Petal.Length > 4.9: virginica (6/2)

Evaluation on training data (150 cases):

	    Decision Tree   
	  ----------------  
	  Size      Errors  

	     4    4( 2.7%)   <<

	   (a)   (b)   (c)    <-classified as
	  ----  ----  ----
	    50                (a): class setosa
	          47     3    (b): class versicolor
	           1    49    (c): class virginica

	Attribute usage:

	100.00%	Petal.Length
	 66.67%	Petal.Width

Time: 0.0 secs

Suppose that we are predicting the sample in row 130 with a petal length of 5.8 and a petal width of 1.6. From this tree, the terminal node shows "virginica (6/2)" which means a predicted class of the virginica species with a probability of 4/6 = 0.66667. However, we get a different predicted probability:

  
    > predict(mod, iris[130,], type = "prob")
        setosa versicolor virginica
130 0.04761905  0.3333333 0.6190476

When we wanted to describe the technical aspects of the C5.0 and cubist models, the main source of information on these models was the raw C source code from the RuleQuest website. For many years, both of these models were proprietary commercial products and we only recently open-sourced. Our intuition is that Quinlan quietly evolved these models from the versions described in the most recent publications to what they are today. For example, it would not be unreasonable to assume that C5.0 uses AdaBoost. From the sources, a similar reweighting scheme is used but it does not appear to be the same.

For classifying new samples, the C sources have

ClassNo PredictTreeClassify(DataRec Case, Tree DecisionTree)
/*      ------------  */
{
    ClassNo    c, C;
    double    Prior;

    /*  Save total leaf count in ClassSum[0]  */
    ForEach(c, 0, MaxClass)
    {
        ClassSum[c] = 0;
    }

    PredictFindLeaf(Case, DecisionTree, Nil, 1.0);
    C = SelectClassGen(DecisionTree->Leaf, (Boolean)(MCost != Nil), ClassSum);

    /*  Set all confidence values in ClassSum  */
    ForEach(c, 1, MaxClass)
    {
        Prior = DecisionTree->ClassDist[c] / DecisionTree->Cases;
        ClassSum[c] = (ClassSum[0] * ClassSum[c] + Prior) / (ClassSum[0] + 1);
    }
    Confidence = ClassSum[C];
    return C;
}

Here:

The predicted probability is the "confidence" value
The prior is the class probabilities from the training set. For the iris data, this value is 1/3 for each of the classes
The array ClassSum is the probabilities of each class in the terminal node although ClassSum[0] is the number of samples in the terminal node (which, if there are missing values, can be fractional).

For sample 130, the virginica values are:

  (ClassSum[0] * ClassSum[c] + Prior) / (ClassSum[0] + 1)
= (          6 *       (4/6) + (1/3)) / (          6 + 1) 
= 0.6190476

Why is it doing this? This will tend to avoid class predictions that are absolute zero or one.

Basically, it can be viewed to be similar to how Bayesian methods operate where the simple probability estimates are "shrunken" towards the prior probabilities. Note that, as the number of samples in the terminal nodes (ClassSum[0]) becomes large, this operation has less effect on the final results. Suppose ClassSum[0] = 10000, then the predicted virginica probability would be 0.6663337, which is closer to the simple estimate.

This is very much related to the Laplace Correction. Traditionally, we would add a value of one to the denominator of the simple estimate and add the number of classes to the bottom, resulting in (4+1)/(6+3) = 0.5555556. C5.0 is substituting the prior probabilities and their sum (always one) into this equation instead.

To be fair, there are well known Bayesian estimates of the sample proportions under different prior distributions for the two class case. For example, if there were two classes, the estimate of the class probability under a uniform prior would be the same as the basic Laplace correction (using the integers and not the fractions). A more flexible Bayesian approach is the Beta-Binomial model, which uses a Beta prior instead of the uniform. The downside here is that two extra parameters need to be estimated (and it only is defined for two classes)

Optimizing Probability Thresholds for Class Imbalances

One of the toughest problems in predictive model occurs when the classes have a severe imbalance. We spend an entire chapter on this subject itself. One consequence of this is that the performance is generally very biased against the class with the smallest frequencies. For example, if the data have a majority of samples belonging to the first class and very few in the second class, most predictive models will maximize accuracy by predicting everything to be the first class. As a result there's usually great sensitivity but poor specificity.

As a demonstration will use a simulation system described here. By default it has about a 50-50 class frequency but we can change this by altering the function argument called intercept:

library(caret)
 
set.seed(442)
training <- twoClassSim(n = 1000, intercept = -16)
testing <- twoClassSim(n = 1000, intercept = -16)

(All the code used here was formatted by Pretty R at inside-R.org.)

In the training set the class frequency breakdown looks like this:

> table(training$Class)

Class1 Class2 
   899    101

There is almost a 9:1 imbalance in these data.

Let's use a standard random forest model with these data using the default value of mtry. We'll also use 10-fold cross validation to get a sense of performance:

> set.seed(949)
> mod0 <- train(Class ~ ., data = training,
+               method = "rf",
+               metric = "ROC",
+               tuneGrid = data.frame(mtry = 3),
+               trControl = trainControl(method = "cv",
+                                        classProbs = TRUE,
+                                        summaryFunction = twoClassSummary))
> getTrainPerf(mod0)
   TrainROC TrainSens TrainSpec method
1 0.9288911      0.99 0.4736364     rf

The area under the ROC curve is very high, indicating that the model has very good predictive power for these data. Here's a test set ROC curve for this model:

The plot shows the default probability cut off value of 50%. The sensitivity and specificity values associated with this point indicate that performance is not that good when an actual call needs to be made on a sample.

One of the most common ways to deal with this is to determine an alternate probability cut off using the ROC curve. But to do this well, another set of data (not the test set) is needed to set the cut off and the test set is used to validate it. We don't have a lot of data this is difficult since we will be spending some of our data just to get a single cut off value.

Alternatively the model can be tuned, using resampling, to determine any model tuning parameters as well as an appropriate cut off for the probabilities.

The latest update to the caret package allows users to define their own modeling and prediction components. This also gives us a huge amount of flexibility for creating your own models or doing some things that were originally intended by the package. This page shows a lot of the details for creating custom models.

Suppose the model has one tuning parameter and we want to look at four candidate values for tuning. Suppose we also want to tune the probability cut off over 20 different thresholds. Now we have to look at 20×4=80 different models (and that is for each resample). One other feature that has been opened up his ability to use sequential parameters: these are tuning parameters that don't require a completely new model fit to produce predictions. In this case, we can fit one random forest model and get it's predicted class probabilities and evaluate the candidate probability cutoffs using these same hold-out samples. Again, there's a lot of details on this page and, without going into them, our code for these analyses can be found here.

Basically, we define a list of model components (such as the fitting code, the prediction code, etc.) and feed this into the train function instead of using a pre-listed model string (such as method = "rf"). For this model and these data, there was an 8% increase in training time to evaluate 20 additional values of the probability cut off.

How do we optimize this model? Normally we might look at the area under the ROC curve as a metric to choose our final values. In this case the ROC curve is independent of the probability threshold so we have to use something else. A common technique to evaluate a candidate threshold is see how close it is to the perfect model where sensitivity and specificity are one. Our code will use the distance between the current model's performance and the best possible performance and then have train minimize this distance when choosing it's parameters. Here is the code that we use to calculate this:

fourStats <- function (data, lev = levels(data$obs), model = NULL) {
  ## This code will get use the area under the ROC curve and the
  ## sensitivity and specificity values using the current candidate
  ## value of the probability threshold.
  out <- c(twoClassSummary(data, lev = levels(data$obs), model = NULL))
 
  ## The best possible model has sensitivity of 1 and specifity of 1. 
  ## How far are we from that value?
  coords <- matrix(c(1, 1, out["Spec"], out["Sens"]), 
                   ncol = 2, 
                   byrow = TRUE)
  colnames(coords) <- c("Spec", "Sens")
  rownames(coords) <- c("Best", "Current")
  c(out, Dist = dist(coords)[1])
}

Now let's run our random forest model and see what it comes up with for the best possible threshold:

set.seed(949)
mod1 <- train(Class ~ ., data = training,
              ## 'modelInfo' is a list object found in the linked
              ## source code
              method = modelInfo,
              ## Minimize the distance to the perfect model
              metric = "Dist",
              maximize = FALSE,
              tuneLength = 20,
              trControl = trainControl(method = "cv",
                                       classProbs = TRUE,
                                       summaryFunction = fourStats))

The resulting model output notes that:

Tuning parameter 'mtry' was held constant at a value of 3
Dist was used to select the optimal model using  the smallest value.
The final values used for the model were mtry = 3 and threshold = 0.887.

Using ggplot(mod1) will show the performance profile. Instead here is a plot of the sensitivity, specificity, and distance to the perfect model:

You can see that as we increase the probability cut off for the first class it takes more and more evidence for a sample to be predicted as the first class. As a result the sensitivity goes down when the threshold becomes very large. The upside is that we can increase specificity in the same way. The blue curve shows the distance to the perfect model. The value of 0.887 was found to be optimal.

Now we can use the test set ROC curve to validate the cut off we chose by resampling. Here the cut off closest to the perfect model is 0.881. We were able to find a good probability cut off value without setting aside another set of data for tuning the cut off.

One great thing about this code is that it will automatically apply the optimized probability threshold when predicting new samples. Here is an example:

  Class1 Class2  Class Note
1  0.874  0.126 Class2    *
2  1.000  0.000 Class1     
3  0.930  0.070 Class1     
4  0.794  0.206 Class2    *
5  0.836  0.164 Class2    *
6  0.988  0.012 Class1

However we should be careful because the probability values are not consistent with our usual notion of a 50-50 cut off.

Calibration Affirmation

In the book, we discuss the notion of a probability model being "well calibrated". There are many different mathematical techniques that classification models use to produce class probabilities. Some of values are "probability-like" in that they are between zero and one and sum to one. This doesn't necessarily mean that the probability estimates are consistent with the true event rate seen with similar samples. Suppose a doctor told you that you have a model predicts that you have a 90% chance of having cancer. Is that number consistent with the actual likelihood of you being sick?

As an example, we can look at the cell image segmentation data use in the text (from this paper). The data are in the caret package. The outcome is whether or not a cell in an image is well segmented (WS) or poorly segments (PS). We try to predict this using measured characteristics of the supposed cell.

We can fit a partial least squares discriminant analysis model to these data. This model can use the basic PLS model for regression (based on binary dummy variables for the classes) and we can post-process the data into class probabilities and class predictions. One method to do this is the softmax function. This simple mathematical formula can used but, in my experience, doesn't lead to very realistic predictions. Let's fit that model:

library(caret)
data(segmentationData)
 
## Retain the original training set
segTrain <- subset(segmentationData, Case == "Train")
## Remove the first three columns (identifier columns)
segTrainX <- segTrain[, -(1:3)]
segTrainClass <- segTrain$Class
 
## Test Data
segTest <- subset(segmentationData, Case != "Train")
segTestX <- segTest[, -(1:3)]
segTestClass <- segTest$Class

set.seed(1)
useSoftmax <- train(segTrainX, segTrainClass, method = "pls",
                    preProc = c("center", "scale"),
                    ## Tune over 15 values of ncomp using 10 fold-CV
                    tuneLength = 15,
                    trControl = trainControl(method = "cv"),
                    probMethod = "softmax")
 
## Based on cross-validation, how well does the model fit?
getTrainPerf(useSoftmax)

##    TrainAccuracy TrainKappa method
##  1        0.8108      0.592    pls

## Predict the test set probabilities
SoftmaxProbs <- predict(useSoftmax, segTestX, type = "prob")[,"PS",]
 
## Put these into a data frame:
testProbs <- data.frame(Class = segTestClass,
                        Softmax = SoftmaxProbs)

(All the code used here was formatted by Pretty R at inside-R.org.)

The accuracy and Kappa statistics reflect a reasonable model.

We are interested in knowing if cells are poorly segmented. What is the distribution of the test set class probabilities for the two classes?

library(ggplot2)
softmaxHist <- ggplot(testProbs, aes(x = Softmax))
softmaxHist <- softmaxHist + geom_histogram(binwidth = 0.02)
softmaxHist+ facet_grid(Class ~ .) + xlab("Pr[Poorly Segmented]")

Looking at the bottom panel, the mode of the distribution is around 40%. Also, very little of the truly well segmented samples are not confidently predicted as such since most probabilities are greater than 30%. The same is true for the poorly segmented cells. Very few values are greater than 80%.

What does the calibration plot look like?

plot(calibration(Class ~ Softmax, data = testProbs), type = "l")

This isn't very close to the 45 degree reference line so we shouldn't expect the probabilities to be very realistic.

Another approach for PLS models is to use Bayes' rule to estimate the class probabilities. This is a little more complicated but at least it is based on probability theory. We can fit the model this way using the option probMethod = "Bayes":

set.seed(1)
useBayes <- train(segTrainX, segTrainClass, method = "pls",
                  preProc = c("center", "scale"),
                  tuneLength = 15,
                  trControl = trainControl(method = "cv"),
                  probMethod = "Bayes")
 
## Compare models:
getTrainPerf(useBayes)

##   TrainAccuracy TrainKappa method
## 1        0.8108     0.5984    pls

getTrainPerf(useSoftmax)

##   TrainAccuracy TrainKappa method
## 1        0.8108      0.592    pls

BayesProbs <- predict(useBayes, segTestX, type = "prob")[,"PS",]
 
## Put these into a data frame as before:
testProbs$Bayes <- BayesProbs

Performance is about the same. The test set ROC curves both have AUC values of 0.87.

What do these probabilities look like?

BayesHist <- ggplot(testProbs, aes(x = Bayes))
BayesHist <- BayesHist + geom_histogram(binwidth = 0.02)
BayesHist + facet_grid(Class ~ .) + xlab("Pr[Poorly Segmented]")

Even though the accuracy and ROC curve results show the models to be very similar, the class probabilities are very different. Here, the most confidently predicted samples have probabilities closer to zero or one. Based on the calibration plot, is this model any better?

Much better! Looking at the right-hand side of the plot, the event rate of samples with Bayesian generated probabilities is about 97% for class probabilities between 91% and 100%. The softmax model has an event rate of zero.

This is a good example because, by most measures,the model performance is almost identical. However, one generates much more realistic values than the other.

C5.0 Class Probability Shrinkage

Optimizing Probability Thresholds for Class Imbalances

Calibration Affirmation

Applied Predictive Modeling