caret webinar on Feb 25

I"ll be doing a webinar with the Orange County R User Group on the caret package on Tue, Feb 25, 2014 1:00 PM - 2:00 PM EST.

Here is the url in case you are interested: https://www3.gotomeeting.com/register/673845982

Thanks to Ray DiGiacomo for setting this up.

Calibration Affirmation

In the book, we discuss the notion of a probability model being "well calibrated". There are many different mathematical techniques that classification models use to produce class probabilities. Some of values are "probability-like" in that they are between zero and one and sum to one. This doesn't necessarily mean that the probability estimates are consistent with the true event rate seen with similar samples. Suppose a doctor told you that you have a model predicts that you have a 90% chance of having cancer. Is that number consistent with the actual likelihood of you being sick?

As an example, we can look at the cell image segmentation data use in the text (from this paper). The data are in the caret package. The outcome is whether or not a cell in an image is well segmented (WS) or poorly segments (PS). We try to predict this using measured characteristics of the supposed cell.

We can fit a partial least squares discriminant analysis model to these data. This model can use the basic PLS model for regression (based on binary dummy variables for the classes) and we can post-process the data into class probabilities and class predictions. One method to do this is the softmax function. This simple mathematical formula can used but, in my experience, doesn't lead to very realistic predictions. Let's fit that model:

library(caret)
data(segmentationData)
 
## Retain the original training set
segTrain <- subset(segmentationData, Case == "Train")
## Remove the first three columns (identifier columns)
segTrainX <- segTrain[, -(1:3)]
segTrainClass <- segTrain$Class
 
## Test Data
segTest <- subset(segmentationData, Case != "Train")
segTestX <- segTest[, -(1:3)]
segTestClass <- segTest$Class

set.seed(1)
useSoftmax <- train(segTrainX, segTrainClass, method = "pls",
                    preProc = c("center", "scale"),
                    ## Tune over 15 values of ncomp using 10 fold-CV
                    tuneLength = 15,
                    trControl = trainControl(method = "cv"),
                    probMethod = "softmax")
 
## Based on cross-validation, how well does the model fit?
getTrainPerf(useSoftmax)

##    TrainAccuracy TrainKappa method
##  1        0.8108      0.592    pls

## Predict the test set probabilities
SoftmaxProbs <- predict(useSoftmax, segTestX, type = "prob")[,"PS",]
 
## Put these into a data frame:
testProbs <- data.frame(Class = segTestClass,
                        Softmax = SoftmaxProbs)

(All the code used here was formatted by Pretty R at inside-R.org.)

The accuracy and Kappa statistics reflect a reasonable model.

We are interested in knowing if cells are poorly segmented. What is the distribution of the test set class probabilities for the two classes?

library(ggplot2)
softmaxHist <- ggplot(testProbs, aes(x = Softmax))
softmaxHist <- softmaxHist + geom_histogram(binwidth = 0.02)
softmaxHist+ facet_grid(Class ~ .) + xlab("Pr[Poorly Segmented]")

Looking at the bottom panel, the mode of the distribution is around 40%. Also, very little of the truly well segmented samples are not confidently predicted as such since most probabilities are greater than 30%. The same is true for the poorly segmented cells. Very few values are greater than 80%.

What does the calibration plot look like?

plot(calibration(Class ~ Softmax, data = testProbs), type = "l")

This isn't very close to the 45 degree reference line so we shouldn't expect the probabilities to be very realistic.

Another approach for PLS models is to use Bayes' rule to estimate the class probabilities. This is a little more complicated but at least it is based on probability theory. We can fit the model this way using the option probMethod = "Bayes":

set.seed(1)
useBayes <- train(segTrainX, segTrainClass, method = "pls",
                  preProc = c("center", "scale"),
                  tuneLength = 15,
                  trControl = trainControl(method = "cv"),
                  probMethod = "Bayes")
 
## Compare models:
getTrainPerf(useBayes)

##   TrainAccuracy TrainKappa method
## 1        0.8108     0.5984    pls

getTrainPerf(useSoftmax)

##   TrainAccuracy TrainKappa method
## 1        0.8108      0.592    pls

BayesProbs <- predict(useBayes, segTestX, type = "prob")[,"PS",]
 
## Put these into a data frame as before:
testProbs$Bayes <- BayesProbs

Performance is about the same. The test set ROC curves both have AUC values of 0.87.

What do these probabilities look like?

BayesHist <- ggplot(testProbs, aes(x = Bayes))
BayesHist <- BayesHist + geom_histogram(binwidth = 0.02)
BayesHist + facet_grid(Class ~ .) + xlab("Pr[Poorly Segmented]")

Even though the accuracy and ROC curve results show the models to be very similar, the class probabilities are very different. Here, the most confidently predicted samples have probabilities closer to zero or one. Based on the calibration plot, is this model any better?

Much better! Looking at the right-hand side of the plot, the event rate of samples with Bayesian generated probabilities is about 97% for class probabilities between 91% and 100%. The softmax model has an event rate of zero.

This is a good example because, by most measures,the model performance is almost identical. However, one generates much more realistic values than the other.

Down-Sampling Using Random Forests

We discuss dealing with large class imbalances in Chapter 16. One approach is to sample the training set to coerce a more balanced class distribution. We discuss

down-sampling: sample the majority class to make their frequencies closer to the rarest class.
up-sampling: the minority class is resampled to increase the corresponding frequencies
hybrid approaches: some methodologies do a little of both and possibly impute synthetic data for the minority class. One such example is the SMOTE procedure.

Here is an image from the book that shows the results of sampling a simulated data set:

The down-side to down-sampling is that information in the majority classes is being thrown away and this situation becomes more acute as the class imbalance becomes more severe.

Random forest models have the ability to use down-sampling without data loss. Recall that random forests is a tree ensemble method. A large number of bootstrap samples are taken form the training data and a separate unpruned tree is created for each data set. This model contains another feature that randomly samples a subset of predictors at each split to encourage diversity of the resulting trees. When predicting a new sample, a prediction is produced by every tree in the forest and these results are combined to generate a single prediction for an individual sample.

Random forests (and bagging) use bootstrap sampling. This means that if there are n training set instances, the resulting sample will select n samples with replacement. As a consequence, some training set samples will be selected more than once.

To incorporate down-sampling, random forest can take a random sample of size c*nmin, where c is the number of classes and nmin is the number of samples in the minority class. Since we usually take a large number of samples (at least 1000) to create the random forest model, we get many looks at the data in the majority class. This can be very effective.

The R package for the book contains scripts to reproduce almost of the analyses in the text. We mistakenly left out the code to down-sample random forests. I'll demonstrate it here with a simulated data set and then show code for the caravan policy data use din the chapter.

Let's create simulated training and test sets using this method:

> ## Simulate data sets with a small event rate
> set.seed(1)
> training <- twoClassSim(500, intercept = -13)
> testing <- twoClassSim(5000, intercept = -13)
> 
> table(training$Class)
 
Class1 Class2 
   428     72 
> 
> nmin <- sum(training$Class == "Class2")
> nmin
[1] 72

Now we will train two random forest models: one using down-sampling and another with the standard sampling procedure. The area under the ROC curve will be used to quantify the effectiveness of each procedure for these data.

> ctrl <- trainControl(method = "cv",
+                      classProbs = TRUE,
+                      summaryFunction = twoClassSummary)
> 
> set.seed(2)
> rfDownsampled <- train(Class ~ ., data = training,
+                        method = "rf",
+                        ntree = 1500,
+                        tuneLength = 5,
+                        metric = "ROC",
+                        trControl = ctrl,
+                        ## Tell randomForest to sample by strata. Here, 
+                        ## that means within each class
+                        strata = training$Class,
+                        ## Now specify that the number of samples selected
+                        ## within each class should be the same
+                        sampsize = rep(nmin, 2))
> 
> 
> set.seed(2)
> rfUnbalanced <- train(Class ~ ., data = training,
+                       method = "rf",
+                       ntree = 1500,
+                       tuneLength = 5,
+                       metric = "ROC",
+                       trControl = ctrl)

Now we can compute the test set ROC curves for both procedures:

> downProbs <- predict(rfDownsampled, testing, type = "prob")[,1]
> downsampledROC <- roc(response = testing$Class, 
+                       predictor = downProbs,
+                       levels = rev(levels(testing$Class)))
> 
> unbalProbs <- predict(rfUnbalanced, testing, type = "prob")[,1]
> unbalROC <- roc(response = testing$Class, 
+                 predictor = unbalProbs,
+                 levels = rev(levels(testing$Class)))

And finally, we can plot the curves and determine the area under each curve:

> plot(downsampledROC, col = rgb(1, 0, 0, .5), lwd = 2)
Call:
roc.default(response = testing$Class, predictor = downProbs, 
   levels = rev(levels(testing$Class)))

Data: downProbs in 701 controls (testing$Class Class2) < 4299 cases (testing$Class Class1).
Area under the curve: 0.9503

> plot(unbalROC, col = rgb(0, 0, 1, .5), lwd = 2, add = TRUE)
Call:
roc.default(response = testing$Class, predictor = unbalProbs,
levels = rev(levels(testing$Class)))

Data: unbalProbs in 701 controls (testing$Class Class2) < 4299 cases (testing$Class Class1).
Area under the curve: 0.9242

> legend(.4, .4,
+        c("Down-Sampled", "Normal"),
+        lwd = rep(2, 1), 
+        col = c(rgb(1, 0, 0, .5), rgb(0, 0, 1, .5)))

This demonstrates an improvement using the alternative sampling procedure.

One last note about this analysis. The cross-validation procedure used to tune the down-sampled random forest model is likely to give biased results. If a single down-sampled data set is fed to the cross-validation procedure, the resampled performance estimates will probably be optimistic (since the unbalance was not present). In the analysis shown here, the resampled area under the ROC curve was overly pessimistic:

> getTrainPerf(rfDownsampled)
   TrainROC TrainSens  TrainSpec method
1 0.8984348         1 0.07142857     rf

> auc(downsampledROC)
Area under the curve: 0.9503

For the caravan data in Chapter 16, this code can be used to fit the same model:

set.seed(1401)
rfDownInt <- train(CARAVAN ~ ., data = trainingInd,
                   method = "rf",
                   ntree = 1500,
                   tuneLength = 5,
                   strata = training$CARAVAN,
                   sampsize = rep(sum(training$CARAVAN == "insurance"), 2),
                   metric = "ROC",
                   trControl = ctrl)
evalResults$RFdownInt <- predict(rfDownInt, evaluationInd, type = "prob")[,1]
testResults$RFdownInt <- predict(rfDownInt, testingInd, type = "prob")[,1]
rfDownIntRoc <- roc(evalResults$CARAVAN,
                    evalResults$RFdownInt,
                    levels = rev(levels(training$CARAVAN)))

For this entry, the code formatting was created by Pretty R at inside-R.org

caret webinar on Feb 25

Calibration Affirmation

Down-Sampling Using Random Forests

Applied Predictive Modeling