Do Resampling Estimates Have Low Correlation to the Truth? The Answer May Shock You.

One criticism that is often leveled against using resampling methods (such as cross-validation) to measure model performance is that there is no correlation between the CV results and the true error rate.

Let's look at this with some simulated data. While this assertion is often correct, there are a few reasons why you shouldn't care.

The Setup

First, I simulated some 2-class data using this simulation system. There are 15 predictors in the data set. Many nonlinear classification models can achieve an area under the ROC curve in the low 0.90's on these data. The training set contained 500 samples and a 125 sample test set was also simulated.

I used a radial basis function support vector machine to model the data with a single estimate of the kernel parameter sigma and 10 values of the SVM cost parameter (on the log2 scale). The code for this set of simulations can be found here so that you can reproduce the results.

Models were fit for each of the 10 submodels and five repeats of 10-fold cross-validation were used to measure the areas under the ROC curve. The test set results were also calculated as well as a large sample test set that approximates the truth (and is labeled as such below). All the results were calculated for all of the 10 SVM submodels (over cost). This simulation was conducted 50 times. Here is one example of how the cost parameter relates to the area under the ROC curve:


The Bad News

When you look at the results, there is little to no correlation between the resampling ROC estimates and the true area under that curve:

The correlations were highest (0.54) when the cost values were low (which is also where the model performed poorly). Under the best cost value, the correlation was even worse (0.01).

However, note that the 125 sample test set estimates do not appear to have a high fidelity to the true values either:

The Good News

We really shouldn't care about this, or at least we are measuring the effectiveness in the wrong way. High correlation would be nice but could result in a strong relationship that does not reflect accuracy of the resampling procedure. This is basically the same argument that we make against using R2.

Let's look at the root mean squared error (RMSE) instead. The RMSE can be decomposed into two quantities:

  • the bias reflects how far the resampling estimate is from the true value (which we can measure in our simulations).
  • the variance of the resampling estimate

RMSE is mostly the squared bias plus the variance.

Two things can be seem in the bias graph below. First, the bias is getting better as cost increases. This shouldn't be a surprise since increasing the cost value coerces the SVM model to be more adaptive to the (training) data. Second, the bias scale is exceedingly small (since the area under the ROC curve is typically between 0.50 and 1.00). This is true even at its worst.

The standard deviation curve below shows that the model noise is minimized when performance is best and resembles an inverted version of the curve shown in the Bad News section. This is because the SVM model is pushing against the best performance. As Tolstoy said, "all good models resemble one another, each crappy model is crappy in its own way." (actually, he did not say this). However, note the scale again. These are not large numbers.

Looking at the RMSE of the model, which is the in the same units as the AUC values, the curve movies around a lot but the magnitude of the values are very low. This can obviously be affected by the size of the training set, but 500 samples is not massive for this particular simulation system.


So the results here indicate that:

  1. yes the correlation is low but
  2. the overall RMSE is very good.

Accuracy is arguably a much better quality to have relative to correlation.

Feature Selection Strikes Back (Part 1)

In the feature selection chapter, we describe several search procedures ("wrappers") that can be used to optimize the number of predictors. Some techniques were described in more detail than others. Although we do describe genetic algorithms and how they can be used for reducing the dimensions of the data, this is the first of series of blog posts that look at them in practice using simulated data described in a previous post.

Genetic algorithms are optimization tools that search for the best solution by mimicking the evolution of a population. A set of predictor subsets are evaluated in terms of their model performance and the best sets combine randomly to form new subsets. In the GA lingo, each iteration has a collection of subsets (i.e. population of chromosomes) with a corresponding model performance value (i.e. their fitness values). At each step of reproduction, there is some probability of random mutations. This has the effect of randomly turning some predictors off or on in each subset. The algorithm continues for a set number of generations.

One question is how to evaluate the fitness function. There are a few options:

  • For each subset, employ the same cross-validation approach used with the full data set. We know from the literature that this will not be a good estimate of future performance because of over-fitting to the predictor set. However, can it be used to differentiate good subsets from bad subsets?
  • Use the test set. This is a poor choice since the data can no longer provide an unbiased view of model performance. Single test sets usually have more noise than resampled estimates.
  • Set aside some data from the training set for calculating model performance for a given subset. We may eventually end up over-fitting to these data, so should we randomly select a different hold-out set each time? This will add some noise to the fitness values.

In the literature, how is this usually handled? From what I've found, internal cross-validated accuracy is used, meaning that the model is cross-validated within the feature selection. For this reason, there is a high likelihood that the estimate of the model's performance will be optimistic (since it does not reflect the uncertainty induced by the search procedure).

For this example, we'll simulate 500 training set samples and add a total of 200 non-informative predictors. For the extra predictors, 100 will be a set of uncorrelated standard normal random variables while 100 will be multivariate normal with a pre-defined correlation structure. The correlated set will have variances of 1 and an auto-regressive structure (AR1). While the is no time component to this model, using this structure will simulate predictors with various levels of correlation. To do this, a function called twoClassSim is used. The code for this document is here, so you can see and use this function yourself.

Three sets of data were simulated: a training set of 500 samples, a test set of 200 samples and a very large set that will help us approximate the true error rate.

training <- twoClassSim(500, noiseVars = 100, 
                        corrVar = 100, corrValue = 0.75)
testing <- twoClassSim(200, noiseVars = 100, 
                       corrVar = 100, corrValue = 0.75)
large <- twoClassSim(10000, noiseVars = 100, 
                     corrVar = 100, corrValue = 0.75)

## Get the names of the truly active predictors
realVars <- names(training)
realVars <- realVars[!grepl("(Corr)|(Noise)", realVars)]

## We will use cross-validation later, so we setup the folds here so we
## can make sure all of the model fits use the same data (not really
## required, but helpful)
cvIndex <- createMultiFolds(training$Class, times = 2)
ctrl <- trainControl(method = "repeatedcv", 
                     repeats = 2, 
                     classProbs = TRUE, 
                     summaryFunction = twoClassSummary, 
                     allowParallel = FALSE, 
                     index = cvIndex)

For these data, one model that has fairly good performance is quadratic discriminant analysis (QDA). This model can generate non-linear class boundaries (i.e. quadratic patterns) and models the covariance matrix of the predictors differently for each class. This means that for two classes and p predictors, a total of p (p+1) parameters are estimated (and these are just the variance parameters). With a large number of non-informative predictors, this can negatively affect model performance.

If we knew the true predictor set, what would we expect in terms of performance? Although QDA has no tuning parameters, I'll use two repeats of 10-fold cross-validation to estimate the area under the ROC curve. Additionally, the AUC will also be derived from the test set and using the large sample set too.

The resampled area under the curve was the area under the ROC curve was 0.91 and the test set estimate was 0.946. There some difference there and the large sample estimate of the AUC is 0.928. Overall, performance is pretty good.

Now, what happens when you use all the available predictors?

The estimates of performance for the AUC were: 0.505 (resampling), 0.492 (test set) and 0.515 (large-sample). All of these indicate that QDA tanked because of the excess variables.

The ROC curves for the large-sample predictions are:


Another nasty side-effect when you have a large number of predictors (215) with a relatively small set of training set data (500) with several of the classical discriminant models is that the class probabilities become extremely polarized. We demonstrate this in the book and the same issue occurs here. The plot below shows histograms of the Class 1 probability for the large-sample set for each model. There are panels for each of the true classes.


The model with the true predictors has fairly well-calibrated class probabilities while the probabilities for the model with all the predictors are concentrated near zero and one. This can occur for both linear and quadratic discriminant analysis as well as naive Bayes models. In the book, we also show that this is independent of the amount of signal in the data.

Clearly, there is a need for some feature selection. To start, I used a genetic algorithm to maximize performance of the QDA model. The search procedure for these data used most of the defaults from the GA package:

  • 400 generations
  • an initial population of 50 chromosomes
  • a mutation probability of 10%
  • a cross-over probability of 80%
  • elitism of 2
  • the area under the ROC curve was estimated using two repeats of 10-fold cross-validation was used

In total, about 400 * 50 * 20 = 400,000 QDA models were evaluated. I also ran the models within each generation in parallel, which makes a good dent in the computation time.

To do this, I used a slightly modified version of the GA R package. Our changes enabled us to save the chromosome value for the best result per generation (for further analysis) and use parallel processing. The package maintainer (Luca Scrucca) has been testing these additions.

I have to define a fitness function that defines what should be maximized. I'll use caret to tune the model and return the resampled estimate of the area under the ROC curve:

## 'ind' will be a vector of 0/1 data denoting which predictors are being
## evaluated.
ROCcv <- function(ind, x, y, cntrl) {
    ind <- which(ind == 1)
    ## In case no predictors are selected:
    if (length(ind) == 0) return(0)
    out <- train(x[, ind, drop = FALSE], y, 
                 method = "qda", 
                 metric = "ROC", 
                 trControl = cntrl)
    ## this function returns the resampled ROC estimate
    caret:::getTrainPerf(out)[, "TrainROC"]

Now, to run the algorithm with the GA package:

ga_resampled <- ga(type = "binary", 
                   fitness = ROCcv, 
                   min = 0, 
                   max = 1, 
                   maxiter = 400, 
                   nBits = ncol(training) - 1, 
                   names = names(training)[-ncol(training)], 
                   ## These options are passed through the ga funciton
                   ## and into the ROCcv function
                   x = training[, -ncol(training)], 
                   y = training$Class, 
                   cntrl = ctrl, 
                   ## These two options are not yet in the GA package.
                   keepBest = TRUE, 
                   parallel = TRUE)

The results are summarized in the image below, where the three estimates of the area under the ROC curve is shown. The size of the points is indicative of the number of predictors used in the best chromosome of each generation.


The resampled estimate has no idea that it is embedded inside of a feature selection routine, so it does not factor in that variability. The search steadily increases the AUC until it converges. However, the test set AUC (as well as the large-sample version) initially increase but then converge to a much smaller value that the cross-validation results would lead one to believe. These two pessimistic estimates of the AUC appear to be in-line with one another although the large-sample results are slightly lower in many generations. It looks at though the model is over-fitting to the predictors and the resampling procedure is not picking up on this.

As previously mentioned, another tactic is to utilize a separate test set to measure the area under the ROC curve. If we have a lot of data, it may be a good idea to have an "evaluation" set of samples to measure performance during feature selection and keep a different (true) test set to only use at the end.

Let's sacrifice our test set for the genetic algorithm and see if this helps. The new objective funciotn is:

## An added 'test' argument...
ROCtest <- function(ind, x, y, cntrl, test) {
    ind <- which(ind == 1)
    if (length(ind) == 0) return(0)
    modFit <- qda(x[, ind], y)
    testROC <- roc(test$Class, 
                           test[, ind, drop = FALSE])$posterior[, 1],
                   levels = rev(levels(y)))

The updated call to the ga function is:

ga_test <- ga(type = "binary", 
              fitness = ROCtest, 
              min = 0, 
              max = 1, 
              maxiter = 1000, 
              nBits = ncol(training) - 1, 
              names = names(training)[-ncol(training)], 
              x = training[,-ncol(training)], 
              y = training$Class, 
              cntrl = ctrl, 
              test = testing, 
              keepBest = TRUE, 
              parallel = TRUE)

Here are the results:


If this were not a simulated data set, we would only see the green curve. There is a similar pattern between the resampled ROC and the results from our evaluation set (formerly known as the test set). However, the GA is still selecting too many predictors and, as a consequence, the true performance is still pretty poor. Basically, the evaluation set is not showing the degradation of performance due to the non-informative predictors (i.e. we are over-fitting to the evaluation set).

The genetic algorithm converged on a subset size of 97 predictors. This included 7 of the 10 linear predictors, 1 of the non-linear terms and both of the terms that have an interaction effect in the model. Looking across the generations, we can see the frequency that each type of non-informative predictor was retained:


Let's now fit a QDA model based on these predictors and see what the large-sample ROC curve looks like:

## The bestBinary item is a list of the best chromosomes from
## each generation. We will look at the last one and fit a QDA
## model.
finalVars <- ga_test@bestBinary[[length(ga_test@bestBinary)]]
finalFit <- qda(training[, finalVars], training$Class)
## Get the large sample results:
finalLarge <- roc(large$Class, 
                          large[, finalVars])$posterior[, 1], 
                  levels = rev(levels(large$Class)))

## Call:
## roc.default(response = large$Class, predictor = predict(finalFit,     large[, finalVars])$posterior[, 1], levels = rev(levels(large$Class)))
## Data: predict(finalFit, large[, finalVars])$posterior[, 1] in 4684 controls (large$Class Class2) < 5316 cases (large$Class Class1).
## Area under the curve: 0.622

The large-sample estimate of the area under the ROC curve is 0.622, which is not as good as the true model (0.928) but better than the worst-case scenario (0.515). The ROC curves are:


In the next blog post, I'll look at other ways of improving the genetic algorithm. Before we do that though, let's make a comparison to another feature selection procedure: recursive feature elimination (RFE). RFE is basically a backwards selection procedure that uses a some variable importance metric to rank the predictors. If will use the area under the ROC curve for each predictor to quantify its relative importance.

Here, all subset sizes were evaluated and the procedure was cross-validated using the same two repeats of ten-fold cross-validation. The QDA models were trained in the same manner:

## caret includes some pre-defined code for RFE, including code to do
## linear discriminant analysis (LDA). The code for LDA and QDA are
## almost identical, so we can recycle the LDA code and only change
## the model fit function (to use QDA) and the function that computes
## the model summary statistics (so that we can get the area under the 
## ROC curve):
qdaFuncs <- ldaFuncs
qdaFuncs$fit <- function(x, y, first, last, ...) {
    qda(x, y, ...)
qdaFuncs$summary <- twoClassSummary

qdaRfe <- rfe(x = training[, -ncol(training)], 
              y = training$Class, 
              sizes = 1:(ncol(training) - 1), 
              metric = "ROC", 
              rfeControl = rfeControl(method = "repeatedcv", 
                                      repeats = 2, 
                                      index = cvIndex, 
                                      functions = qdaFuncs))

Here, there are a maximum of 215 * 20 = 4,300 models being evaluated. In comparison to the genetic algorithm, not only is this fewer models, but many of them have smaller subset sizes than those shown in the figures above.

The potential down-side to this feature selection technique is that it is greedy; once a predictor is discarded, it is never considered again in a subset. For this reason, RFE may achieve a local optimum where as the genetic algorithm has the ability to find a global optimum.

This approach filtered far more predictors. The profile of the area under the ROC curve (with all three estimates of the area under the ROC curve):


The RFE algorithm fit the final QDA model to 11 predictors, including 5 linear effects and both predictors associated with the interaction. However, it did not capture any of the non-linear terms and picked up 4 non-informative predictors. Of the noise predictors, it was not biased towards the correlated set (only 1 of the 4 were correlated). The estimates were consistent with one another and the area under the ROC curve was estimated as 0.907 using resampling, 0.893 using the test set and 0.885 using the large-sample holdout. The consistency of these values is probably due to the RFE procedure constructing external resampling, meaning that each cross-validation did a separate feature elimination sequence and used held-out data to estimate performance. This prevented the type of overly optimistic estimates that were seen earlier.

The large-sample ROC curves are:


So far, RFE is more effective than the genetic algorithm at sub-setting features for these data.

The next blog post will look at modifications of the genetic algorithm that will improve performance.

The R packages loaded at the time of the analysis were: base (2.15.2), caret (5.15-87), cluster (1.14.3), datasets (2.15.2), doMC (1.2.5), foreach (1.4.0), GA (1.3), ggplot2 (, graphics (2.15.2), grDevices (2.15.2), iterators (1.0.6), knitr (0.8), lattice (0.20-10), latticeExtra (0.6-24), MASS (7.3-22), methods (2.15.2), multicore (0.1-7), nlme (3.1-105), plyr (1.7.1), pROC (1.5.4), RColorBrewer (1.0-5), reshape2 (1.2.1), stats (2.15.2) and utils (2.15.2)

Benchmarking Machine Learning Models Using Simulation

What is the objective of most data analysis? One way I think about it is that we are trying to discover or approximate what is really going on in our data (and in general, nature). However, I occasionally run into people think that if one model fulfills our expectations (e.g. higher number of significant p-values or accuracy) than it must be better than any other model that does not. For most data sets, we don't know what the truth is, so this is a problem.

Computational biology/bioinformatics are particularly bad in this way. In many cases, the cost, time and complexity of high dimensional biology experiments prevents a solid, methodical validation of analysis of the initial data set. This is has be verified by a number of different publications.

I was talking to someone recently who was describing their research with ~150 samples and ~50,000 predictors. They used the same sample set to do feature selection and then build predictive models. The results was a random forest model based on about 150 predictors. The validation was based on running some of the same samples using a different technology. When I asked if there there would be any external validation, their response was "we're never going to get a 5,000 sample clinical trial to check the results." While true (and a bit dramatic), it is not an excuse to throw out good methodology. In fact, you would think that a lack of a clear path to validation would make people be more dogmatic about methodology...

When I'm trying to evaluate any sort of statistic method, I try to use a good simulation system so that I can produce results where I know the truth. Examples are the "Friedman" simulations systems for regression modeling, such as the 'Friedman 3' model. This is a non-linear regression function of four real predictors:

y = atan ((x2 x3 - (1/(x2 x4)))/x1) + error

The the mlbench package has this in R code as well as other simulation systems.

I've been looking for a system that can be used to test a few different aspects of classification models:

  • class imbalances
  • non-informative predictors
  • correlation amoung the predictors
  • linear and nonlinear signals

I spent a few hours developing one. It models the log-odds of a binary event as a function of true signals using several additive "sets" of a few different types. First, there are two main effects and an interaction:

intercept - 4A + 4B + 2AB 

(A and B are the predictors) The intercept is a parameter for the simulation and can be used to control the amount of class imbalance.

The second set of effects are linear with coefficients that alternate signs and have values between 2.5 and 0.025. For example, if there were six predictors in this set, their contribution to the log-odds would be

-2.50C + 2.05D -1.60E + 1.15F -0.70G + 0.25H

The third set is a nonlinear function of a single predictor ranging between [0, 1] called J here:

(J^3) + 2exp(-6(J-0.3)^2) 

I saw this in one of Radford Neal's presentations but I couldn't find an exact reference for it. The equation produces an interesting trend:


The fourth set of informative predictors are copied from one of Friedman's systems and use two more predictors (K and L):


All of these effects are added up to model the log-odds. This is used to calculate the probability of a sample being in the first class and a random uniform number is used to actually make the assignment of the actual class.

We can also add non-informative predictors to the data. These are random standard normal predictors and can be optionally added to the data in two ways: a specified number of independent predictors or a set number of predictors that follow a particular correlation structure. The only two correlation structure that I've implemented are

  • compound-symmetry (aka exchangeable) where there is a constant correlation between all the predictors

  • auto-regressive 1 [AR(1)]. While there is no time component to these data, we can use this structure to add predictors of varying levels of correlation. For example, if there were 4 predictors and r (for rho) was the correlation parameter, the between predictor correlaiton matrix would be

      | 1             sym   |
      | r    1              |
      | r^2  r    1         |
      | r^3  r^2  r    1    |
      | r^4  r^3  r^2  r  1 |

For AR(1), correlations decrease as the predictors are "father away" from each other (in order). Simulating ten predictors (named Corr01 - Corr10) with a correlation parameter of 0.75 yields the following between-predictor correlation structure:


To demonstrate, let's take a set of data and see how a support vector machine performs:

training <- twoClassSim(  300, noiseVars = 100, 
                        corrVar = 100, corrValue = 0.75)
testing  <- twoClassSim(  300, noiseVars = 100, 
                        corrVar = 100, corrValue = 0.75)
large    <- twoClassSim(10000, noiseVars = 100, 
                        corrVar = 100, corrValue = 0.75)

The default for the number of informative linear predictors is 10 and the default intercept of -5 makes the class frequencies fairly balanced:


## Class1 Class2 
## 0.5457 0.4543

We'll use the train function to tune and train the model:


ctrl <- trainControl(method = "repeatedcv", 
                     repeats = 3, classProbs = TRUE, 
                     summaryFunction = twoClassSummary)

fullModel <- train(Class ~ ., data = training, 
                   method = "svmRadial", 
                   preProc = c("center", "scale"), 
                   tuneLength = 8, 
                   metric = "ROC", 
                   trControl = ctrl)


## 300 samples
## 215 predictors
##   2 classes: 'Class1', 'Class2' 
## Pre-processing: centered, scaled 
## Resampling: Cross-Validation (10 fold, repeated 3 times) 
## Summary of sample sizes: 270, 270, 270, 270, 270, 270, ... 
## Resampling results across tuning parameters:
##   C     ROC    Sens   Spec     ROC SD  Sens SD  Spec SD
##   0.25  0.636  1      0        0.0915  0        0      
##   0.5   0.635  1      0.00238  0.0918  0        0.013  
##   1     0.644  0.719  0.438    0.0929  0.0981   0.134  
##   2     0.68   0.671  0.574    0.0863  0.0898   0.118  
##   4     0.69   0.673  0.579    0.0904  0.0967   0.11   
##   8     0.69   0.673  0.579    0.0904  0.0967   0.11   
##   16    0.69   0.673  0.579    0.0904  0.0967   0.11   
##   32    0.69   0.673  0.579    0.0904  0.0967   0.11   
## Tuning parameter 'sigma' was held constant at a value of 0.00353
## ROC was used to select the optimal model using  the largest value.
## The final values used for the model were C = 4 and sigma = 0.00353.

Cross-validation estimates the best area under the ROC curve to be 0.69. Is this an accurate estimate? The test set has:

fullTest <- roc(testing$Class, 
                predict(fullModel, testing, type = "prob")[,1], 
                levels = rev(levels(testing$Class)))

## Call:
## roc.default(response = testing$Class, predictor = predict(fullModel,     testing, type = "prob")[, 1], levels = rev(levels(testing$Class)))
## Data: predict(fullModel, testing, type = "prob")[, 1] in 140 controls (testing$Class Class2) < 160 cases (testing$Class Class1).
## Area under the curve: 0.78

For this small test set, the estimate is 0.09 larger than the resampled version. How do both of these compare to our approximation of the truth?

fullLarge <- roc(large$Class, 
                 predict(fullModel, large, type = "prob")[, 1], 
                 levels = rev(levels(testing$Class)))

## Call:
## roc.default(response = large$Class, predictor = predict(fullModel,     large, type = "prob")[, 1], levels = rev(levels(testing$Class)))
## Data: predict(fullModel, large, type = "prob")[, 1] in 4543 controls (large$Class Class2) < 5457 cases (large$Class Class1).
## Area under the curve: 0.733

How much did the presence of the non-informative predictors affect this model? We know the true model, so we can fit that and evaluate it in the same way:

realVars <- names(training)
realVars <- realVars[!grepl("(Corr)|(Noise)", realVars)]

trueModel <- train(Class ~ ., 
                   data = training[, realVars], 
                   method = "svmRadial", 
                   preProc = c("center", "scale"), 
                   tuneLength = 8, 
                   metric = "ROC", 
                   trControl = ctrl)

## 300 samples
##  15 predictors
##   2 classes: 'Class1', 'Class2' 
## Pre-processing: centered, scaled 
## Resampling: Cross-Validation (10 fold, repeated 3 times) 
## Summary of sample sizes: 270, 270, 270, 270, 270, 270, ... 
## Resampling results across tuning parameters:
##   C     ROC    Sens   Spec   ROC SD  Sens SD  Spec SD
##   0.25  0.901  0.873  0.733  0.0468  0.0876   0.136  
##   0.5   0.925  0.873  0.8    0.0391  0.0891   0.11   
##   1     0.936  0.871  0.826  0.0354  0.105    0.104  
##   2     0.94   0.881  0.852  0.0356  0.0976   0.0918 
##   4     0.936  0.875  0.857  0.0379  0.0985   0.0796 
##   8     0.927  0.835  0.852  0.0371  0.0978   0.0858 
##   16    0.917  0.821  0.843  0.0387  0.11     0.0847 
##   32    0.915  0.821  0.843  0.0389  0.11     0.0888 
## Tuning parameter 'sigma' was held constant at a value of 0.0573
## ROC was used to select the optimal model using  the largest value.
## The final values used for the model were C = 2 and sigma = 0.0573.

Much higher! Is this verified by the other estimates?

trueTest <- roc(testing$Class, 
                predict(trueModel, testing, type = "prob")[, 1], 
                levels = rev(levels(testing$Class)))

## Call:
## roc.default(response = testing$Class, predictor = predict(trueModel,     testing, type = "prob")[, 1], levels = rev(levels(testing$Class)))
## Data: predict(trueModel, testing, type = "prob")[, 1] in 140 controls (testing$Class Class2) < 160 cases (testing$Class Class1).
## Area under the curve: 0.923

trueLarge <- roc(large$Class, 
                 predict(trueModel, large, type = "prob")[, 1], 
                 levels = rev(levels(testing$Class)))

## Call:
## roc.default(response = large$Class, predictor = predict(trueModel,     large, type = "prob")[, 1], levels = rev(levels(testing$Class)))
## Data: predict(trueModel, large, type = "prob")[, 1] in 4543 controls (large$Class Class2) < 5457 cases (large$Class Class1).
## Area under the curve: 0.926

At this point, we might want to look and see what would happen if all 200 non-informative predictors were uncorrelated etc. At least we have a testing tool to make objective statements.

Code to create this can be found here and will end up making its way into the caret  package.

Any suggestions for simulation systems?