To demonstrate, let's take a set of data and see how a support vector machine performs:
set.seed(468)
training <- twoClassSim( 300, noiseVars = 100,
corrVar = 100, corrValue = 0.75)
testing <- twoClassSim( 300, noiseVars = 100,
corrVar = 100, corrValue = 0.75)
large <- twoClassSim(10000, noiseVars = 100,
corrVar = 100, corrValue = 0.75)
The default for the number of informative linear predictors is 10 and the default intercept of -5 makes the class frequencies fairly balanced:
table(large$Class)/nrow(large)
##
## Class1 Class2
## 0.5457 0.4543
We'll use the train
function to tune and train the model:
library(caret)
ctrl <- trainControl(method = "repeatedcv",
repeats = 3, classProbs = TRUE,
summaryFunction = twoClassSummary)
set.seed(1254)
fullModel <- train(Class ~ ., data = training,
method = "svmRadial",
preProc = c("center", "scale"),
tuneLength = 8,
metric = "ROC",
trControl = ctrl)
fullModel
## 300 samples
## 215 predictors
## 2 classes: 'Class1', 'Class2'
##
## Pre-processing: centered, scaled
## Resampling: Cross-Validation (10 fold, repeated 3 times)
##
## Summary of sample sizes: 270, 270, 270, 270, 270, 270, ...
##
## Resampling results across tuning parameters:
##
## C ROC Sens Spec ROC SD Sens SD Spec SD
## 0.25 0.636 1 0 0.0915 0 0
## 0.5 0.635 1 0.00238 0.0918 0 0.013
## 1 0.644 0.719 0.438 0.0929 0.0981 0.134
## 2 0.68 0.671 0.574 0.0863 0.0898 0.118
## 4 0.69 0.673 0.579 0.0904 0.0967 0.11
## 8 0.69 0.673 0.579 0.0904 0.0967 0.11
## 16 0.69 0.673 0.579 0.0904 0.0967 0.11
## 32 0.69 0.673 0.579 0.0904 0.0967 0.11
##
## Tuning parameter 'sigma' was held constant at a value of 0.00353
## ROC was used to select the optimal model using the largest value.
## The final values used for the model were C = 4 and sigma = 0.00353.
Cross-validation estimates the best area under the ROC curve to be 0.69. Is this an accurate estimate? The test set has:
fullTest <- roc(testing$Class,
predict(fullModel, testing, type = "prob")[,1],
levels = rev(levels(testing$Class)))
fullTest
##
## Call:
## roc.default(response = testing$Class, predictor = predict(fullModel, testing, type = "prob")[, 1], levels = rev(levels(testing$Class)))
##
## Data: predict(fullModel, testing, type = "prob")[, 1] in 140 controls (testing$Class Class2) < 160 cases (testing$Class Class1).
## Area under the curve: 0.78
For this small test set, the estimate is 0.09 larger than the resampled version. How do both of these compare to our approximation of the truth?
fullLarge <- roc(large$Class,
predict(fullModel, large, type = "prob")[, 1],
levels = rev(levels(testing$Class)))
fullLarge
##
## Call:
## roc.default(response = large$Class, predictor = predict(fullModel, large, type = "prob")[, 1], levels = rev(levels(testing$Class)))
##
## Data: predict(fullModel, large, type = "prob")[, 1] in 4543 controls (large$Class Class2) < 5457 cases (large$Class Class1).
## Area under the curve: 0.733
How much did the presence of the non-informative predictors affect this model? We know the true model, so we can fit that and evaluate it in the same way:
realVars <- names(training)
realVars <- realVars[!grepl("(Corr)|(Noise)", realVars)]
set.seed(1254)
trueModel <- train(Class ~ .,
data = training[, realVars],
method = "svmRadial",
preProc = c("center", "scale"),
tuneLength = 8,
metric = "ROC",
trControl = ctrl)
trueModel
## 300 samples
## 15 predictors
## 2 classes: 'Class1', 'Class2'
##
## Pre-processing: centered, scaled
## Resampling: Cross-Validation (10 fold, repeated 3 times)
##
## Summary of sample sizes: 270, 270, 270, 270, 270, 270, ...
##
## Resampling results across tuning parameters:
##
## C ROC Sens Spec ROC SD Sens SD Spec SD
## 0.25 0.901 0.873 0.733 0.0468 0.0876 0.136
## 0.5 0.925 0.873 0.8 0.0391 0.0891 0.11
## 1 0.936 0.871 0.826 0.0354 0.105 0.104
## 2 0.94 0.881 0.852 0.0356 0.0976 0.0918
## 4 0.936 0.875 0.857 0.0379 0.0985 0.0796
## 8 0.927 0.835 0.852 0.0371 0.0978 0.0858
## 16 0.917 0.821 0.843 0.0387 0.11 0.0847
## 32 0.915 0.821 0.843 0.0389 0.11 0.0888
##
## Tuning parameter 'sigma' was held constant at a value of 0.0573
## ROC was used to select the optimal model using the largest value.
## The final values used for the model were C = 2 and sigma = 0.0573.
Much higher! Is this verified by the other estimates?
trueTest <- roc(testing$Class,
predict(trueModel, testing, type = "prob")[, 1],
levels = rev(levels(testing$Class)))
trueTest
##
## Call:
## roc.default(response = testing$Class, predictor = predict(trueModel, testing, type = "prob")[, 1], levels = rev(levels(testing$Class)))
##
## Data: predict(trueModel, testing, type = "prob")[, 1] in 140 controls (testing$Class Class2) < 160 cases (testing$Class Class1).
## Area under the curve: 0.923
trueLarge <- roc(large$Class,
predict(trueModel, large, type = "prob")[, 1],
levels = rev(levels(testing$Class)))
trueLarge
##
## Call:
## roc.default(response = large$Class, predictor = predict(trueModel, large, type = "prob")[, 1], levels = rev(levels(testing$Class)))
##
## Data: predict(trueModel, large, type = "prob")[, 1] in 4543 controls (large$Class Class2) < 5457 cases (large$Class Class1).
## Area under the curve: 0.926
At this point, we might want to look and see what would happen if all 200 non-informative predictors were uncorrelated etc. At least we have a testing tool to make objective statements.