Optimizing Probability Thresholds for Class Imbalances

One of the toughest problems in predictive model occurs when the classes have a severe imbalance. We spend an entire chapter on this subject itself. One consequence of this is that the performance is generally very biased against the class with the smallest frequencies. For example, if the data have a majority of samples belonging to the first class and very few in the second class, most predictive models will maximize accuracy by predicting everything to be the first class. As a result there's usually great sensitivity but poor specificity.

As a demonstration will use a simulation system described here. By default it has about a 50-50 class frequency but we can change this by altering the function argument called intercept:

training <- twoClassSim(n = 1000, intercept = -16)
testing <- twoClassSim(n = 1000, intercept = -16)

(All the code used here was formatted by Pretty R at inside-R.org.)

In the training set the class frequency breakdown looks like this:

> table(training$Class)

Class1 Class2 
   899    101 

There is almost a 9:1 imbalance in these data.

Let's use a standard random forest model with these data using the default value of mtry. We'll also use 10-fold cross validation to get a sense of performance:

> set.seed(949)
> mod0 <- train(Class ~ ., data = training,
+               method = "rf",
+               metric = "ROC",
+               tuneGrid = data.frame(mtry = 3),
+               trControl = trainControl(method = "cv",
+                                        classProbs = TRUE,
+                                        summaryFunction = twoClassSummary))
> getTrainPerf(mod0)
   TrainROC TrainSens TrainSpec method
1 0.9288911      0.99 0.4736364     rf

The area under the ROC curve is very high, indicating that the model has very good predictive power for these data. Here's a test set ROC curve for this model:


The plot shows the default probability cut off value of 50%. The sensitivity and specificity values associated with this point indicate that performance is not that good when an actual call needs to be made on a sample.

One of the most common ways to deal with this is to determine an alternate probability cut off using the ROC curve. But to do this well, another set of data (not the test set) is needed to set the cut off and the test set is used to validate it. We don't have a lot of data this is difficult since we will be spending some of our data just to get a single cut off value.

Alternatively the model can be tuned, using resampling, to determine any model tuning parameters as well as an appropriate cut off for the probabilities.

The latest update to the caret package allows users to define their own modeling and prediction components. This also gives us a huge amount of flexibility for creating your own models or doing some things that were originally intended by the package. This page shows a lot of the details for creating custom models.

Suppose the model has one tuning parameter and we want to look at four candidate values for tuning. Suppose we also want to tune the probability cut off over 20 different thresholds. Now we have to look at 20×4=80 different models (and that is for each resample). One other feature that has been opened up his ability to use sequential parameters: these are tuning parameters that don't require a completely new model fit to produce predictions. In this case, we can fit one random forest model and get it's predicted class probabilities and evaluate the candidate probability cutoffs using these same hold-out samples. Again, there's a lot of details on this page and, without going into them, our code for these analyses can be found here.

Basically, we define a list of model components (such as the fitting code, the prediction code, etc.) and feed this into the train function instead of using a pre-listed model string (such as method = "rf"). For this model and these data, there was an 8% increase in training time to evaluate 20 additional values of the probability cut off.

How do we optimize this model? Normally we might look at the area under the ROC curve as a metric to choose our final values. In this case the ROC curve is independent of the probability threshold so we have to use something else. A common technique to evaluate a candidate threshold is see how close it is to the perfect model where sensitivity and specificity are one. Our code will use the distance between the current model's performance and the best possible performance and then have train minimize this distance when choosing it's parameters. Here is the code that we use to calculate this:

fourStats <- function (data, lev = levels(data$obs), model = NULL) {
  ## This code will get use the area under the ROC curve and the
  ## sensitivity and specificity values using the current candidate
  ## value of the probability threshold.
  out <- c(twoClassSummary(data, lev = levels(data$obs), model = NULL))
  ## The best possible model has sensitivity of 1 and specifity of 1. 
  ## How far are we from that value?
  coords <- matrix(c(1, 1, out["Spec"], out["Sens"]), 
                   ncol = 2, 
                   byrow = TRUE)
  colnames(coords) <- c("Spec", "Sens")
  rownames(coords) <- c("Best", "Current")
  c(out, Dist = dist(coords)[1])

Now let's run our random forest model and see what it comes up with for the best possible threshold:

mod1 <- train(Class ~ ., data = training,
              ## 'modelInfo' is a list object found in the linked
              ## source code
              method = modelInfo,
              ## Minimize the distance to the perfect model
              metric = "Dist",
              maximize = FALSE,
              tuneLength = 20,
              trControl = trainControl(method = "cv",
                                       classProbs = TRUE,
                                       summaryFunction = fourStats))

The resulting model output notes that:

Tuning parameter 'mtry' was held constant at a value of 3
Dist was used to select the optimal model using  the smallest value.
The final values used for the model were mtry = 3 and threshold = 0.887.

Using ggplot(mod1) will show the performance profile. Instead here is a plot of the sensitivity, specificity, and distance to the perfect model:


You can see that as we increase the probability cut off for the first class it takes more and more evidence for a sample to be predicted as the first class. As a result the sensitivity goes down when the threshold becomes very large. The upside is that we can increase specificity in the same way. The blue curve shows the distance to the perfect model. The value of 0.887 was found to be optimal.

Now we can use the test set ROC curve to validate the cut off we chose by resampling. Here the cut off closest to the perfect model is 0.881. We were able to find a good probability cut off value without setting aside another set of data for tuning the cut off.

One great thing about this code is that it will automatically apply the optimized probability threshold when predicting new samples. Here is an example:

  Class1 Class2  Class Note
1  0.874  0.126 Class2    *
2  1.000  0.000 Class1     
3  0.930  0.070 Class1     
4  0.794  0.206 Class2    *
5  0.836  0.164 Class2    *
6  0.988  0.012 Class1     

However we should be careful because the probability values are not consistent with our usual notion of a 50-50 cut off.

Down-Sampling Using Random Forests

We discuss dealing with large class imbalances in Chapter 16. One approach is to sample the training set to coerce a more balanced class distribution. We discuss

  • down-sampling: sample the majority class to make their frequencies closer to the rarest class.
  • up-sampling: the minority class is resampled to increase the corresponding frequencies
  • hybrid approaches: some methodologies do a little of both and possibly impute synthetic data for the minority class. One such example is the SMOTE procedure.

Here is an image from the book that shows the results of sampling a simulated data set:


The down-side to down-sampling is that information in the majority classes is being thrown away and this situation becomes more acute as the class imbalance becomes more severe.

Random forest models have the ability to use down-sampling without data loss. Recall that random forests is a tree ensemble method. A large number of bootstrap samples are taken form the training data and a separate unpruned tree is created for each data set. This model contains another feature that randomly samples a subset of predictors at each split to encourage diversity of the resulting trees. When predicting a new sample, a prediction is produced by every tree in the forest and these results are combined to generate a single prediction for an individual sample.

Random forests (and bagging) use bootstrap sampling. This means that if there are n training set instances, the resulting sample will select n samples with replacement. As a consequence, some training set samples will be selected more than once.

To incorporate down-sampling, random forest can take a random sample of size c*nmin, where c is the number of classes and nmin is the number of samples in the minority class. Since we usually take a large number of samples (at least 1000) to create the random forest model, we get many looks at the data in the majority class. This can be very effective.

The R package for the book contains scripts to reproduce almost of the analyses in the text. We mistakenly left out the code to down-sample random forests. I'll demonstrate it here with a simulated data set and then show code for the caravan policy data use din the chapter.

Let's create simulated training and test sets using this method:

> ## Simulate data sets with a small event rate
> set.seed(1)
> training <- twoClassSim(500, intercept = -13)
> testing <- twoClassSim(5000, intercept = -13)
> table(training$Class)
Class1 Class2 
   428     72 
> nmin <- sum(training$Class == "Class2")
> nmin
[1] 72

Now we will train two random forest models: one using down-sampling and another with the standard sampling procedure. The area under the ROC curve will be used to quantify the effectiveness of each procedure for these data.

> ctrl <- trainControl(method = "cv",
+                      classProbs = TRUE,
+                      summaryFunction = twoClassSummary)
> set.seed(2)
> rfDownsampled <- train(Class ~ ., data = training,
+                        method = "rf",
+                        ntree = 1500,
+                        tuneLength = 5,
+                        metric = "ROC",
+                        trControl = ctrl,
+                        ## Tell randomForest to sample by strata. Here, 
+                        ## that means within each class
+                        strata = training$Class,
+                        ## Now specify that the number of samples selected
+                        ## within each class should be the same
+                        sampsize = rep(nmin, 2))
> set.seed(2)
> rfUnbalanced <- train(Class ~ ., data = training,
+                       method = "rf",
+                       ntree = 1500,
+                       tuneLength = 5,
+                       metric = "ROC",
+                       trControl = ctrl)

Now we can compute the test set ROC curves for both procedures:

> downProbs <- predict(rfDownsampled, testing, type = "prob")[,1]
> downsampledROC <- roc(response = testing$Class, 
+                       predictor = downProbs,
+                       levels = rev(levels(testing$Class)))
> unbalProbs <- predict(rfUnbalanced, testing, type = "prob")[,1]
> unbalROC <- roc(response = testing$Class, 
+                 predictor = unbalProbs,
+                 levels = rev(levels(testing$Class)))

And finally, we can plot the curves and determine the area under each curve:

> plot(downsampledROC, col = rgb(1, 0, 0, .5), lwd = 2)
roc.default(response = testing$Class, predictor = downProbs, 
   levels = rev(levels(testing$Class)))

Data: downProbs in 701 controls (testing$Class Class2) < 4299 cases (testing$Class Class1).
Area under the curve: 0.9503
> plot(unbalROC, col = rgb(0, 0, 1, .5), lwd = 2, add = TRUE)
roc.default(response = testing$Class, predictor = unbalProbs,
levels = rev(levels(testing$Class)))

Data: unbalProbs in 701 controls (testing$Class Class2) < 4299 cases (testing$Class Class1).
Area under the curve: 0.9242
> legend(.4, .4, + c("Down-Sampled", "Normal"), + lwd = rep(2, 1), + col = c(rgb(1, 0, 0, .5), rgb(0, 0, 1, .5)))

This demonstrates an improvement using the alternative sampling procedure.

One last note about this analysis. The cross-validation procedure used to tune the down-sampled random forest model is likely to give biased results. If a single down-sampled data set is fed to the cross-validation procedure, the resampled performance estimates will probably be optimistic (since the unbalance was not present). In the analysis shown here, the resampled area under the ROC curve was overly pessimistic:

> getTrainPerf(rfDownsampled)
   TrainROC TrainSens  TrainSpec method
1 0.8984348         1 0.07142857     rf
> auc(downsampledROC)
Area under the curve: 0.9503

For the caravan data in Chapter 16, this code can be used to fit the same model:

rfDownInt <- train(CARAVAN ~ ., data = trainingInd,
                   method = "rf",
                   ntree = 1500,
                   tuneLength = 5,
                   strata = training$CARAVAN,
                   sampsize = rep(sum(training$CARAVAN == "insurance"), 2),
                   metric = "ROC",
                   trControl = ctrl)
evalResults$RFdownInt <- predict(rfDownInt, evaluationInd, type = "prob")[,1]
testResults$RFdownInt <- predict(rfDownInt, testingInd, type = "prob")[,1]
rfDownIntRoc <- roc(evalResults$CARAVAN,
                    levels = rev(levels(training$CARAVAN)))

For this entry, the code formatting was created by Pretty R at inside-R.org

One Statistician’s View of Big Data

Recently I've had several questions about using machine learning models with large data sets. Here is a talk I gave at Yale's Big Data Symposium on the subject.

I believe that, with a few exceptions, less data is more. Once you get beyond some "large enough" number of samples, most models don't really change that much and the additional computation burden is likely to cause practical problems with model fitting.

Off the top of my head, the exceptions that I can think of are:

  • class imbalances
  • poor variability in measured predictors
  • exploring new "spaces" or customer segments

Big Data may be great as long as you are adding something of value (instead of more of what you already have). The last bullet above is a good example. I work a lot with computational chemistry and we are constantly moving into new areas of "chemical space" making new compounds that have qualities that had not been previously investigated. Models that ignore this space are not as good as ones that do include them.

Also, new measurements or characteristic of your samples can make all the difference. Anthony Goldbloom of Kaggle has a great example from a competition for predicting the value of used cars:

The results included for instance that orange cars were generally more reliable - and that colour was a very significant predictor of the reliability of a used car.
"The intuition here is that if you are the first buyer of an orange car, orange is an unusual colour you're probably going to be someone who really cares about the car and so you looked after it better than somebody who bought a silver car," said Goldbloom.
"The data doesn't lie - the data unearthed that correlation. It was something that they had not taken into account before when purchasing vehicles."

My presentation has other examples of adding new information to increase the dimensionality of the data. The final quote sums it up:

The availability of Big Data should be a trigger to really re-evaluate what we are trying to solve and why this will help.