Simulated Annealing Feature Selection

As previously mentioned, caret has two new feature selection routines based on genetic algorithms (GA) and simulated annealing (SA). The help pages for the two new functions give a detailed account of the options, syntax etc.

The package already has functions to conduct feature selection using simple filters as well as recursive feature elimination (RFE). RFE can be very effective. It initially ranks the features, then removes them in sequence, starting with the least importance variables. It is a greedy method since it never backtracks to reevaluate a subset. Basically, it points itself in a single direction and proceeds. The main question is how far to go in that direction.

These two new functions, gafs and safs, conduct a global search over the feature set. This means that the direction of the search is constantly changing and may re-evaluate solutions that are similar to (or the same as) previously evaluated feature sets. This is good and bad. If a feature set is not easily "rankable" then it should do better than RFE. However, it will require a larger number of model fits and has a greater chance of overfitting the features to the training set.

I won't go over the details of GAs or SA. There are a lot of references on these and you can always check out Chapter 19 of our book. Also, since there were two previous blog posts about genetic algorithms, I'll focus on simulated annealing in this one.

To demonstrate I'll use the Alzheimer's disease data from Section 19.6 (page 502). It contains biomarker and patient demographic data that are being used to predict who will eventually have cognitive impairment. The original data contained instances for 333 subjects and information on 133 predictors.

In the book, we fit a number of different models to these data and conducted feature selection. Using RFE, linear discriminant analysis, generalized linear models and naive Bayes seemed to benefit from removing non-informative predictors. However, K-nearest neighbors and support vector machines had worse performance as RFE proceeded to remove predictors.

Wiht KNN, there is no model-based method for measuring predictor importance. In these cases, the rfe will use the area under the ROC curve for each individual predictor for ranking. Using that approach, here is the plot of KNN performance over subset sizes:

The values shown on the plot are the average area under the ROC curve measured using 5 repeats of 10-fold cross-validation. This implies that the full set of predictors is required. I should also note that, even with the entire predictor set, the KNN model did worse than LDA and random forests.

While RFE couldn't find a better subset of predictors, that does not mean that one doesn't exist. Can simulated annealing do better?

The code to load and split the data are in the AppliedPredictiveModeling package and you can find the markdown for this blog post linked at the bottom of this post. We have a data frame called training that has all the data used to fit the models. The outcome is a factor called Class and the predictor names are in a character vector called predVars. First, let's define the resampling folds.

Let's define the control object that will specify many of the important options for the search. There are a lot of details to spell out, such as how to train the model, how new samples are predicted etc. There is a pre-made list called caretSA that contains a starting template.

names(caretSA)

[1] "fit"            "pred"           "fitness_intern" "fitness_extern"
[5] "initial"        "perturb"        "prob"           "selectIter"

These functions are detailed on the packages web page. We will make a copy of this and change the method of measuring performance using hold-out samples:

knnSA <- caretSA
knnSA$fitness_extern <- twoClassSummary

This will compute the area under the ROC curve and the sensitivity and specificity using the default 50% probability cutoff. These functions will be passed into the safs function (see below)

safs will conduct the SA search inside a resampling wrapper as defined by the index object that we created above (50 total times). For each subset, it computed and out-of-sample (i.e. "external") performance estimate to make sure that we are not overfitting to the feature set. This code should reproduce the same folds as those used in the book:

library(caret)
set.seed(104)
index <- createMultiFolds(training$Class, times = 5)

Here is the control object that we will use:

ctrl <- safsControl(functions = knnSA,
                    method = "repeatedcv",
                    repeats = 5,
                    ## Here are the exact folds to used:
                    index = index,
                    ## What should we optimize? 
                    metric = c(internal = "ROC",
                               external = "ROC"),
                    maximize = c(internal = TRUE,
                                 external = TRUE),
                    improve = 25,
                    allowParallel = TRUE)

The option improve specifies how many SA iterations can occur without improvement. Here, the search restarts at the last known improvement after 25 iterations where the internal fitness value has not improved. The allowParallel options tells the package that it can parallelize the external resampling loops (if parallel processing is available).

The metric argument above is a little different than it would for train or rfe. SA needs an internal estimate of performance to help guide the search. To that end, we will also use cross-validation to tune the KNN model and measure performance. Normally, we would use the train function to do this. For example, using the full set of predictors, this could might work to tune the model:

train(x = training[, predVars],
      y = training$Class,
      method = "knn",
      metric = "ROC",
      tuneLength = 20,
      preProc = c("center", "scale"),
      trControl = trainControl(method = "repeatedcv",
                               ## Produce class prob predictions
                               classProbs = TRUE,
                               summaryFunction = twoClassSummary,))

The beauty of using the caretSA functions is that the ... are available.

A short explanation of why you should care...

R has this great feature where you can seamlessly pass arguments between functions. Suppose we have this function that computes the means of the columns in a matrix or data frame:

mean_by_column <- function(x, ...) {
  results <- rep(NA, ncol(x))
  for(i in 1:ncol(x)) results[i] <- mean(x[, i], ...)
  results
  }

(aside: don't loop over columns with for. See ?apply, or ?colMeans, instead)

There might be some options to the mean function that we want to pass in but those options might change for different applications. Rather than making different versions of the function for different option combinations, any argument that we pass to mean_by_column that is not one it its arguments (x, in this case) is passed to wherever the three dots appear inside the function. For the function above, they go to mean. Suppose there are missing values in x:

example <- matrix(runif(100), ncol = 5)
example[1, 5] <- NA
example[16, 1] <- NA
mean_by_column(example)

[1]     NA 0.4922 0.4704 0.5381     NA

mean has an option called na.rm that will compute the mean on the complete data. Now, we can pass this into mean_by_column even though this is not one of its options.

mean_by_column(example, na.rm = TRUE)

[1] 0.5584 0.4922 0.4704 0.5381 0.4658

Here's why this is relevant. caretSA$fit uses train to fit and possibly tune the model.

caretSA$fit

function (x, y, lev = NULL, last = FALSE, ...) 
train(x, y, ...)

Any options that we pass to safs that are not x, y, iters, differences, or safsControl will be passed to train.

(Even further, any option passed to safs that isn't an option to train gets passed down one more time to the underlying fit function. A good example of this is using importance = TRUE with random forest models. )

So, putting it all together:

set.seed(721)
knn_sa <- safs(x = training[, predVars],
               y = training$Class,
               iters = 500,
               safsControl = ctrl,

               ## Now we pass options to `train` via "knnSA":               
               method = "knn",
               metric = "ROC",
               tuneLength = 20,
               preProc = c("center", "scale"),
               trControl = trainControl(method = "repeatedcv",
                                        repeats = 2,
                                        classProbs = TRUE,
                                        summaryFunction = twoClassSummary,
                                        allowParallel = FALSE))

To recap:

the SA is conducted many times inside of resampling to get an external estimate of performance.
inside of this external resampling loop, the KNN model is tuned using another, internal resampling procedure.
the area under the ROC curve is used to guide the search (internally) and to know if the SA has overfit to the features (externally)
in the code above, when safs is called with other options (e.g. method = "knn"),
- safs passes the method, metric, tuneLength, preProc, and tuneLength, trControl options to caretSA$fit

After external resampling, the optimal number of search iterations is determined and one last SA is run using all of the training data.

Needless to say, this executes a lot of KNN models. When I ran this, I used parallel processing to speed things up using the doMC package.

Here are the results of the search:

knn_sa

Simulated Annealing Feature Selection

267 samples
132 predictors
2 classes: 'Impaired', 'Control' 

Maximum search iterations: 500 
Restart after 25 iterations without improvement (15.6 restarts on average)

Internal performance values: ROC, Sens, Spec
Subset selection driven to maximize internal ROC 

External performance values: ROC, Sens, Spec
Best iteration chose by maximizing external ROC 
External resampling method: Cross-Validated (10 fold, repeated 5 times) 

During resampling:
  * the top 5 selected variables (out of a possible 132):
    Ab_42 (96%), tau (92%), Cystatin_C (82%), NT_proBNP (82%), VEGF (82%)
  * on average, 60 variables were selected (min = 45, max = 75)

In the final search using the entire training set:
   * 59 features selected at iteration 488 including:
     Alpha_1_Antitrypsin, Alpha_1_Microglobulin, Alpha_2_Macroglobulin, Angiopoietin_2_ANG_2, Apolipoprotein_E ... 
   * external performance at this iteration is

       ROC       Sens       Spec 
     0.852      0.198      0.987

The algorithm automatically chose the subset created at iteration 488 of the SA (based on the external ROC) which contained 59 out of 133 predictors.

We can also plot the performance values over iterations using the plot function. By default, this uses the ggplot2 package, so we can add a theme at the end of the call:

Each of the data points for the external fitness is a average of the 50 resampled ROC values. The most improvement was found in the first 200 iterations. The internal estimate is generally more optimistic than the external estimate. It also tends to increase while the external estimate is relatively flat, indicating some overfitting. The plot above indicates that less iterations might probably give us equivalent performance. Instead of repeating the SA with fewer iterations, the update function can be used to pick a different subset size.

Let's compare the RFE and SA profiles. For RFE and SA, the ROC values are averaged over the 50 resamples. In the case of SA, the number of predictors is also an average. For example, at iteration 100 of the SA, here is the subset size distribution across the 50 resamples:

Here are superimposed smoothed trend lines for the resampling profiles of each search method:

Recall that the same cross-validation folds were used for SA and RFE, so this is an apples-to-apples comparison. SA searched a smaller range of subset sizes over the iterations in comparison to RFE. The code here starts the initial subset with a random 20% of the possible features and tended to increase as the search continued and then stabilized at a size of about 55.

How does this predictor subset perform on the test data?

library(pROC)
roc(testing$Class,
    predict(knn_sa, testing)$Impaired,
    levels = rev(levels(testing$Class)))

Call:
roc.default(response = testing$Class, 
            predictor = predict(knn_sa, testing)$Impaired, 
            levels = rev(levels(testing$Class)))

Data: predict(knn_sa, testing)$Impaired in 48 controls 
      (testing$Class Control) <  18 cases (testing$Class 
      Impaired).
      
Area under the curve: 0.848

This is better than the test set results for the RFE procedure. Note that the test set AUC is much more in-line with the external estimate of the area under the ROC curve. The bad new is that we evaluated many more models than the RFE procedure and the SA process was slightly more than 11-fold slower than RFE to complete. Good things take time. Here is a parallel-coordinate plot of the individual resampling results, match by fold:

The difference is statistically signficant too:

summary(diff(rs, metric = "ROC"))

Call:
summary.diff.resamples(object = diff(rs, metric = "ROC"))

p-value adjustment: bonferroni 
Upper diagonal: estimates of the difference
Lower diagonal: p-value for H0: difference = 0

ROC 
    RFE      SA     
RFE          -0.0504
SA  8.02e-06

The genetic algorithm code in gafs has very similar syntax to safs and also has pre-made functions.

The knitr file for these analyses can be found here.

Feature Selection 2 - Genetic Boogaloo

Previously, I talked about genetic algorithms (GA) for feature selection and illustrated the algorithm using a modified version of the GA R package and simulated data. The data were simulated with 200 non-informative predictors and 12 linear effects and three non-linear effects. Quadratic discriminant analysis (QDA) was used to model the data. The last set of analyses showed, for these data, that:

The performance of QDA suffered from including the irrelevant terms.
Using a resampled estimate area under the ROC curve resulted in selection bias and severely over-fit to the predictor set.
Using a single test set to estimate the fitness value did marginally better model with an area under the ROC curve of 0.622.
Recursive feature elimination (RFE) was also used and found a much better predictor subset with good performance (an AUC of 0.885 using the large-sample data set).

For the genetic algorithm, I used the default parameters (e.g. crossover rate, mutation probability etc). Can we do better with GA's?

One characteristic seen in the last set of analyses is that, as the number of irrelevant predictors increases, there is a gradual decrease in the fitness value (although the true performance gets worse). The initial population of chromosomes for the GA is based on simple random sampling, which means that each predictor has about a 50% chance of being included. Since our true subset is much smaller, the GA should favor smaller sets and move towards cases with fewer predictors... except that it didn't. I think that this didn't happen because the of two reasons:

The increase in performance caused by removing a single predictor is small. Since there is no "performance cliff" for this model, the GA isn't able to see improvements in performance unless a subset is tested with a very low number of predictors.
Clearly, using the evaluation set and resampling to measure the fitness value did not penalize larger models enough. The plots of these estimates versus the large sample estimates showed that they were not effective measurements. Note that the RFE procedure did not have this issue since the feature selection was conducted within the resampling process (and reduced the effect of selection bias).

Would resampling the entire GA process help? It might but I doubt it. It only affects how performance is measured and does not help the selection of features, which is the underlying issue. I think it would result in another over-fit model and all that the resampling would do would be to accurately tell when the model begins to over-fit. I might test this hypothesis in another post.

There are two approaches that I'll try here to improve the effectiveness of the GA. The first is to modify the initial population. If we have a large number of irrelevant predictors in the model, the GA has difficultly driving the number of predictors down. However, would the converse be true? We are doing feature selection. Why not start the GA with a set of chromosomes with small predictor sets. Would the algorithm drive up the number of predictors or would it see the loss of performance and keep the number small?

To test this, I simulated the same data sets:

set.seed(468)
training <- twoClassSim(500, noiseVars = 100, 
                        corrVar = 100, corrValue = 0.75)
testing <- twoClassSim(500, noiseVars = 100, 
                       corrVar = 100, corrValue = 0.75)
large <- twoClassSim(10000, noiseVars = 100, 
                     corrVar = 100, corrValue = 0.75)
realVars <- names(training)
realVars <- realVars[!grepl("(Corr)|(Noise)", realVars)]
cvIndex <- createMultiFolds(training$Class, times = 2)
ctrl <- trainControl(method = "repeatedcv", 
                     repeats = 2, 
                     classProbs = TRUE, 
                     summaryFunction = twoClassSummary,
                     allowParallel = FALSE, 
                     index = cvIndex)

The ga function has a parameter for creating the initial population. I'll use one that produces chromosomes with a random 10% of the predictors.

initialSmall <- function(object, ...) 
{
    population <- sample(0:1, 
                         replace = TRUE, 
                         size = object@nBits * object@popSize, 
                         prob = c(0.9, 0.1))
    population <- matrix(population, 
                         nrow = object@popSize, 
                         ncol = object@nBits)
    return(population)
}

Then I'll run the GA again (see the last post for more explanation of this code).

ROCcv <- function(ind, x, y, cntrl) 
{
    library(caret)
    library(MASS)
    ind <- which(ind == 1)
    ## In case no predictors are selected:
    if (length(ind) == 0)  return(0)
    out <- train(x[, ind], y, 
                 method = "qda", 
                 metric = "ROC", 
                 trControl = cntrl)
    ## Return the resampled ROC value
    caret:::getTrainPerf(out)[, "TrainROC"]
}

set.seed(137)
ga_small_cv <- ga(type = "binary", 
                  fitness = ROCcv, 
                  min = 0, max = 1, 
                  maxiter = 400, 
                  population = initialSmall, 
                  nBits = ncol(training) - 1, 
                  names = names(training)[-ncol(training)],
                  x = training[, -ncol(training)], 
                  y = training$Class, 
                  cntrl = ctrl, 
                  keepBest = TRUE, 
                  parallel = FALSE)

The genetic algorithm converged on a subset size of 14 predictors. This included 5 of the 10 linear predictors, 0 of the non-linear terms and both of the terms that have an interaction effect in the model. The trends were:

The resampling and the large sample results have the same pattern until about 75 iterations, after which point they agree in some areas and disagree in others. However the genetic algorithm begins to over-fit to the predictors. The resampling results do not reflect this but the test set would have picked up the issue.

Applying the resulting model to the large-sample set, the ROC curve is shown below along with the curves from the previous analysis.

Starting from small subset sizes appeared to have helped. Would using the evaluation set to estimate the fitness have had the same results?

ROCtest <- function(ind, x, y, cntrl, test) 
{
    library(MASS)
    ind <- which(ind == 1)
    if (length(ind) == 0) 
        return(0)
    modFit <- qda(x[, ind], y)
    testROC <- roc(test$Class, 
                   predict(modFit, 
                           test[, ind, drop = FALSE])$posterior[,1], 
                   levels = rev(levels(y)))
    as.vector(auc(testROC))
}

set.seed(137)
ga_small <- ga(type = "binary", 
               fitness = ROCtest, 
               min = 0, max = 1, 
               maxiter = 400, 
               population = initialSmall, 
               nBits = ncol(training) - 1, 
               names = names(training)[-ncol(training)], 
               x = training[, -ncol(training)], 
               y = training$Class, 
               cntrl = ctrl, 
               test = testing, 
               keepBest = TRUE, 
               parallel = FALSE)

This GA showed:

The pattern is very similar to the previous GA run (where resampling was used).

Given that the resampling process is susceptible to over-fitting, how can we penalize the results based on the size of the subset? The idea of penalization/regularization is pretty common in statistics and machine learning. The most commonly known measure is the Akaike information criterion (AIC) which takes the objective function that is being optimized (e.g. RMSE or likelihood) and penalizes it based on the sample size and number of parameters. That's not very straight-forward here. First, it is very model dependent. In many cases, the number of parameters is not a useful concept. What would be use for tree-based models? In ensemble models, there may be more parameters than data points. What is the objective function? I've been trying to optimize the area under the ROC curve, but there is not theoretical connection between the ROC curve and QDA. QDA can be motivated by Bayes' Rule and, if a probability distribution is specified, it is for the predictor data.

One less theoretical solution uses desirability functions. This technique is used to blend several outcomes into a single measure of desirability. In our case, I want to maximize the area under the ROC curve but minimize the number of predictors in the model. The desirability function first defines curves that translates both of these characteristics to a [0, 1] scale where zero is unacceptable and one is desirable. Any curve can do. I'll use the parameterization created by Derringer and Suich that defines high and low thresholds for each characteristic and linearizes the desirability in-between those values:

library(desirability)
## The best ROC value is one, 0.5 is the worst
dROC <- dMax(low = 0.5, high = 1)
## The 'best' possible model would have a single 
## predictor and the worst would have everything
dPreds <- dMin(low = 1, high = ncol(training) - 1)
## Show them:
par(mfrow = c(1, 2))
plot(dROC, nonInform = FALSE)
title("Area Under the ROC Curve")
plot(dPreds, nonInform = FALSE)
title("Subset Size")

The ROC desirability curve is maximized when the area under the curve is high while the subset size curve is most desirable for small values. I can adjust the thresholds a several different ways. For example, I may only want subset sizes less than 100. If I move the high value to 100, any solution with less than 100 predictors would be equally acceptable. Another modification is to allow the curves to bend to make each characteristic easier or more difficult to satisfy. The SAS application JMP has a nice tool for configuring desirability functions (obviously, I used an R package above to create them).

The overall desirability is the geometric mean of the individual desirabilites. For two inputs, I multiply them together and take the square root. Since I've multiplied the values together, one consequence is that the overall desirability is unacceptable if any one of the individual values is unacceptable (i.e. a value of zero). This can be avoided by adjusting the individual curve to always be higher than a desirability of zero.

We can then maximize the overall desirability and hope that the GA can find a balance between performance and sparsity. I'll define a new fitness function that uses overall desirability as measured with the resampled estimate of the area under the ROC curve:

Dcv <- function(ind, x, y, cntrl) 
{
    library(caret)
    library(MASS)
    library(desirability)
    ind <- which(ind == 1)
    if (length(ind) == 0) return(0)
    out <- train(x[, ind], y, 
                 method = "qda", 
                 metric = "ROC", 
                 trControl = cntrl)
    rocVal <- caret:::getTrainPerf(out)[, "TrainROC"]
    dROC <- dMax(0.5, 1)
    dPreds <- dMin(1, ncol(x))
    ## Comnined the two with a geometirc mean
    allD <- dOverall(dROC, dPreds)
    ## Calculate the overall desirability value
    predict(allD, data.frame(ROC = rocVal, NumPred = length(ind)))
}

Any I will once again use the GA package to search the predictor space:

set.seed(137)
ga_D <- ga(type = "binary", 
           fitness = Dcv, 
           min = 0, max = 1, 
           maxiter = 500, 
           population = initialSmall, 
           nBits = ncol(training) - 1, 
           names = names(training)[-ncol(training)], 
           x = training[, -ncol(training)], 
           y = training$Class, 
           cntrl = ctrl, 
           keepBest = TRUE, 
           parallel = FALSE)

Here are the profiles for the three estimates of desirability (symbols sizes again indicate the subset size):

The first pattern to note is that all three estimates are strongly correlated. There seems to be negligable evidence of selection bias creeping in as before. The GA converged to a fairly small subset size. The genetic algorithm converged on a subset size of 8 predictors. This included 5 of the 10 linear predictors, none of the non-linear terms, both of the terms that have an interaction effect in the model and 1 irrelavant predictor.

n terms of the area under the ROC curve, the GA was able to produce pretty competitive performance:

finalDVars <- ga_D@bestBinary[[length(ga_D@bestBinary)]]
finalDFit <- qda(training[, finalDVars], training$Class)
finalDLarge <- roc(large$Class, 
                   predict(finalDFit, 
                           large[, finalDVars])$posterior[, 1],
                   levels = rev(levels(large$Class)))
finalDLarge

## 
## Call:
## roc.default(response = large$Class, predictor = predict(finalDFit,
##     large[, finalDVars])$posterior[, 1], levels = rev(levels(large$Class)))
## 
## Data: predict(finalDFit, large[, finalDVars])$posterior[, 1] in 4640 controls (large$Class Class2) < 5360 cases (large$Class Class1).
## Area under the curve: 0.93

The resampling results are slightly optimistic and the test set is slightly pessimistic. The large-sample estimate of the area under the ROC curve is 0.755, which is not as good as the true model (0.931) but better than the worst-case scenario (0.52). The ROC curves are:

So this approach is yielding near-optimal results.

As I did in the last post, I'm compelled to throw a wet blanket on all of this. Last time, I showed that, for these data, the RFE procedure was more effective than the GA. With the two adjustments I made to the GA, it has the performance edge. What if I were to use a classification model with built-in feature selection? One such approach is the Flexible Discriminant Model (FDA). FDA is a generalizes of linear discriminant analysis that can produce non-linear class boundaries. It does this using a framework that generalizing the basis functions that can be used. One approach is to use Multivariate Adaptive Regression Splines (MARS) to fit the model. MARS is a regression model that has one quality in common with tree-based models; MARS chooses one or more predictor to "split" on. Multiple splits are used to model different predictors, but if a predictor was never used in a split, the class boundaries are functionally independent of that predictor. So, FDA simultaneously builds a classification model while conducting feature selection. This is computationally advantageous (less models are being fit), less likely to over-fit and ties the feature selection process directly to model performance.

I fit an FDA model with MARS basis functions. There are two tuning parameters. First, the degree of the model indicates the maximum number of predictors that can be used in a split. I'll evaluate only first or second degree models. The other parameter is the number of retained terms. MARS does a forward stage of splitting then, like tree models, prunes the model terms. The nprune parameter controls how far MARS can reduce the complexity of the model.

I'll use caret package's train function again with the same cross-validation scheme:

fdaModel <- train(Class ~ ., 
                  data = training, 
                  method = "fda", 
                  tuneGrid = expand.grid(.nprune = 2:30,.degree = 1:2), 
                  metric = "ROC", 
                  trControl = ctrl)
fdaTest <- roc(testing$Class, 
               predict(fdaModel, testing, type = "prob")[, 1], 
               levels = rev(levels(testing$Class)))
fdaTest

## 
## Call:
## roc.default(response = testing$Class, predictor = predict(fdaModel,
##     testing, type = "prob")[, 1], levels = rev(levels(testing$Class)))
## 
## Data: predict(fdaModel, testing, type = "prob")[, 1] in 219 controls (testing$Class Class2) < 281 cases (testing$Class Class1).
## Area under the curve: 0.947

fdaLarge <- roc(large$Class, 
                predict(fdaModel, large, type = "prob")[, 1],
                levels = rev(levels(testing$Class)))
fdaLarge

## 
## Call:
## roc.default(response = large$Class, predictor = predict(fdaModel,
##     large, type = "prob")[, 1], levels = rev(levels(testing$Class)))
## 
## Data: predict(fdaModel, large, type = "prob")[, 1] in 4640 controls (large$Class Class2) < 5360 cases (large$Class Class1).
## Area under the curve: 0.945

The resampling profile for this model was:

The model used additive functions of the predictor data (which may result in an interpretable model). The ROC curves:

The results indicate that a single FDA model does better than the best possible QDA model and the model fitting process was much model simplistic and straight-forward for these data. This may not always be true. This simulation system has non-linear terms that QDA should not be able to model (and FDA/MARS can), so it is not a completely fair comparison.

The code for these analyses can be found here.

The next post in this series looks at another wrapper-based feature selection algorithm: particle swarm optimization.

Feature Selection Strikes Back (Part 1)

In the feature selection chapter, we describe several search procedures ("wrappers") that can be used to optimize the number of predictors. Some techniques were described in more detail than others. Although we do describe genetic algorithms and how they can be used for reducing the dimensions of the data, this is the first of series of blog posts that look at them in practice using simulated data described in a previous post.

Genetic algorithms are optimization tools that search for the best solution by mimicking the evolution of a population. A set of predictor subsets are evaluated in terms of their model performance and the best sets combine randomly to form new subsets. In the GA lingo, each iteration has a collection of subsets (i.e. population of chromosomes) with a corresponding model performance value (i.e. their fitness values). At each step of reproduction, there is some probability of random mutations. This has the effect of randomly turning some predictors off or on in each subset. The algorithm continues for a set number of generations.

One question is how to evaluate the fitness function. There are a few options:

For each subset, employ the same cross-validation approach used with the full data set. We know from the literature that this will not be a good estimate of future performance because of over-fitting to the predictor set. However, can it be used to differentiate good subsets from bad subsets?
Use the test set. This is a poor choice since the data can no longer provide an unbiased view of model performance. Single test sets usually have more noise than resampled estimates.
Set aside some data from the training set for calculating model performance for a given subset. We may eventually end up over-fitting to these data, so should we randomly select a different hold-out set each time? This will add some noise to the fitness values.

In the literature, how is this usually handled? From what I've found, internal cross-validated accuracy is used, meaning that the model is cross-validated within the feature selection. For this reason, there is a high likelihood that the estimate of the model's performance will be optimistic (since it does not reflect the uncertainty induced by the search procedure).

For this example, we'll simulate 500 training set samples and add a total of 200 non-informative predictors. For the extra predictors, 100 will be a set of uncorrelated standard normal random variables while 100 will be multivariate normal with a pre-defined correlation structure. The correlated set will have variances of 1 and an auto-regressive structure (AR1). While the is no time component to this model, using this structure will simulate predictors with various levels of correlation. To do this, a function called twoClassSim is used. The code for this document is here, so you can see and use this function yourself.

Three sets of data were simulated: a training set of 500 samples, a test set of 200 samples and a very large set that will help us approximate the true error rate.

set.seed(468)
training <- twoClassSim(500, noiseVars = 100, 
                        corrVar = 100, corrValue = 0.75)
testing <- twoClassSim(200, noiseVars = 100, 
                       corrVar = 100, corrValue = 0.75)
large <- twoClassSim(10000, noiseVars = 100, 
                     corrVar = 100, corrValue = 0.75)

## Get the names of the truly active predictors
realVars <- names(training)
realVars <- realVars[!grepl("(Corr)|(Noise)", realVars)]

## We will use cross-validation later, so we setup the folds here so we
## can make sure all of the model fits use the same data (not really
## required, but helpful)
cvIndex <- createMultiFolds(training$Class, times = 2)
ctrl <- trainControl(method = "repeatedcv", 
                     repeats = 2, 
                     classProbs = TRUE, 
                     summaryFunction = twoClassSummary, 
                     allowParallel = FALSE, 
                     index = cvIndex)

For these data, one model that has fairly good performance is quadratic discriminant analysis (QDA). This model can generate non-linear class boundaries (i.e. quadratic patterns) and models the covariance matrix of the predictors differently for each class. This means that for two classes and p predictors, a total of p (p+1) parameters are estimated (and these are just the variance parameters). With a large number of non-informative predictors, this can negatively affect model performance.

If we knew the true predictor set, what would we expect in terms of performance? Although QDA has no tuning parameters, I'll use two repeats of 10-fold cross-validation to estimate the area under the ROC curve. Additionally, the AUC will also be derived from the test set and using the large sample set too.

The resampled area under the curve was the area under the ROC curve was 0.91 and the test set estimate was 0.946. There some difference there and the large sample estimate of the AUC is 0.928. Overall, performance is pretty good.

Now, what happens when you use all the available predictors?

The estimates of performance for the AUC were: 0.505 (resampling), 0.492 (test set) and 0.515 (large-sample). All of these indicate that QDA tanked because of the excess variables.

The ROC curves for the large-sample predictions are:

Another nasty side-effect when you have a large number of predictors (215) with a relatively small set of training set data (500) with several of the classical discriminant models is that the class probabilities become extremely polarized. We demonstrate this in the book and the same issue occurs here. The plot below shows histograms of the Class 1 probability for the large-sample set for each model. There are panels for each of the true classes.

The model with the true predictors has fairly well-calibrated class probabilities while the probabilities for the model with all the predictors are concentrated near zero and one. This can occur for both linear and quadratic discriminant analysis as well as naive Bayes models. In the book, we also show that this is independent of the amount of signal in the data.

Clearly, there is a need for some feature selection. To start, I used a genetic algorithm to maximize performance of the QDA model. The search procedure for these data used most of the defaults from the GA package:

400 generations
an initial population of 50 chromosomes
a mutation probability of 10%
a cross-over probability of 80%
elitism of 2
the area under the ROC curve was estimated using two repeats of 10-fold cross-validation was used

In total, about 400 * 50 * 20 = 400,000 QDA models were evaluated. I also ran the models within each generation in parallel, which makes a good dent in the computation time.

To do this, I used a slightly modified version of the GA R package. Our changes enabled us to save the chromosome value for the best result per generation (for further analysis) and use parallel processing. The package maintainer (Luca Scrucca) has been testing these additions.

I have to define a fitness function that defines what should be maximized. I'll use caret to tune the model and return the resampled estimate of the area under the ROC curve:

## 'ind' will be a vector of 0/1 data denoting which predictors are being
## evaluated.
ROCcv <- function(ind, x, y, cntrl) {
    library(caret)
    library(MASS)
    ind <- which(ind == 1)
    ## In case no predictors are selected:
    if (length(ind) == 0) return(0)
    out <- train(x[, ind, drop = FALSE], y, 
                 method = "qda", 
                 metric = "ROC", 
                 trControl = cntrl)
    ## this function returns the resampled ROC estimate
    caret:::getTrainPerf(out)[, "TrainROC"]

Now, to run the algorithm with the GA package:

library(GA)
set.seed(137)
ga_resampled <- ga(type = "binary", 
                   fitness = ROCcv, 
                   min = 0, 
                   max = 1, 
                   maxiter = 400, 
                   nBits = ncol(training) - 1, 
                   names = names(training)[-ncol(training)], 
                   ## These options are passed through the ga funciton
                   ## and into the ROCcv function
                   x = training[, -ncol(training)], 
                   y = training$Class, 
                   cntrl = ctrl, 
                   ## These two options are not yet in the GA package.
                   keepBest = TRUE, 
                   parallel = TRUE)

The results are summarized in the image below, where the three estimates of the area under the ROC curve is shown. The size of the points is indicative of the number of predictors used in the best chromosome of each generation.

The resampled estimate has no idea that it is embedded inside of a feature selection routine, so it does not factor in that variability. The search steadily increases the AUC until it converges. However, the test set AUC (as well as the large-sample version) initially increase but then converge to a much smaller value that the cross-validation results would lead one to believe. These two pessimistic estimates of the AUC appear to be in-line with one another although the large-sample results are slightly lower in many generations. It looks at though the model is over-fitting to the predictors and the resampling procedure is not picking up on this.

As previously mentioned, another tactic is to utilize a separate test set to measure the area under the ROC curve. If we have a lot of data, it may be a good idea to have an "evaluation" set of samples to measure performance during feature selection and keep a different (true) test set to only use at the end.

Let's sacrifice our test set for the genetic algorithm and see if this helps. The new objective funciotn is:

## An added 'test' argument...
ROCtest <- function(ind, x, y, cntrl, test) {
    library(MASS)
    ind <- which(ind == 1)
    if (length(ind) == 0) return(0)
    modFit <- qda(x[, ind], y)
    testROC <- roc(test$Class, 
                   predict(modFit, 
                           test[, ind, drop = FALSE])$posterior[, 1],
                   levels = rev(levels(y)))
    as.vector(auc(testROC))
}

The updated call to the ga function is:

set.seed(137)
ga_test <- ga(type = "binary", 
              fitness = ROCtest, 
              min = 0, 
              max = 1, 
              maxiter = 1000, 
              nBits = ncol(training) - 1, 
              names = names(training)[-ncol(training)], 
              x = training[,-ncol(training)], 
              y = training$Class, 
              cntrl = ctrl, 
              test = testing, 
              keepBest = TRUE, 
              parallel = TRUE)

Here are the results:

If this were not a simulated data set, we would only see the green curve. There is a similar pattern between the resampled ROC and the results from our evaluation set (formerly known as the test set). However, the GA is still selecting too many predictors and, as a consequence, the true performance is still pretty poor. Basically, the evaluation set is not showing the degradation of performance due to the non-informative predictors (i.e. we are over-fitting to the evaluation set).

The genetic algorithm converged on a subset size of 97 predictors. This included 7 of the 10 linear predictors, 1 of the non-linear terms and both of the terms that have an interaction effect in the model. Looking across the generations, we can see the frequency that each type of non-informative predictor was retained:

Let's now fit a QDA model based on these predictors and see what the large-sample ROC curve looks like:

## The bestBinary item is a list of the best chromosomes from
## each generation. We will look at the last one and fit a QDA
## model.
finalVars <- ga_test@bestBinary[[length(ga_test@bestBinary)]]
finalFit <- qda(training[, finalVars], training$Class)
## Get the large sample results:
finalLarge <- roc(large$Class, 
                  predict(finalFit, 
                          large[, finalVars])$posterior[, 1], 
                  levels = rev(levels(large$Class)))
finalLarge

## 
## Call:
## roc.default(response = large$Class, predictor = predict(finalFit,     large[, finalVars])$posterior[, 1], levels = rev(levels(large$Class)))
## 
## Data: predict(finalFit, large[, finalVars])$posterior[, 1] in 4684 controls (large$Class Class2) < 5316 cases (large$Class Class1).
## Area under the curve: 0.622

The large-sample estimate of the area under the ROC curve is 0.622, which is not as good as the true model (0.928) but better than the worst-case scenario (0.515). The ROC curves are:

In the next blog post, I'll look at other ways of improving the genetic algorithm. Before we do that though, let's make a comparison to another feature selection procedure: recursive feature elimination (RFE). RFE is basically a backwards selection procedure that uses a some variable importance metric to rank the predictors. If will use the area under the ROC curve for each predictor to quantify its relative importance.

Here, all subset sizes were evaluated and the procedure was cross-validated using the same two repeats of ten-fold cross-validation. The QDA models were trained in the same manner:

## caret includes some pre-defined code for RFE, including code to do
## linear discriminant analysis (LDA). The code for LDA and QDA are
## almost identical, so we can recycle the LDA code and only change
## the model fit function (to use QDA) and the function that computes
## the model summary statistics (so that we can get the area under the 
## ROC curve):
qdaFuncs <- ldaFuncs
qdaFuncs$fit <- function(x, y, first, last, ...) {
    library(MASS)
    qda(x, y, ...)
}
qdaFuncs$summary <- twoClassSummary

qdaRfe <- rfe(x = training[, -ncol(training)], 
              y = training$Class, 
              sizes = 1:(ncol(training) - 1), 
              metric = "ROC", 
              rfeControl = rfeControl(method = "repeatedcv", 
                                      repeats = 2, 
                                      index = cvIndex, 
                                      functions = qdaFuncs))

Here, there are a maximum of 215 * 20 = 4,300 models being evaluated. In comparison to the genetic algorithm, not only is this fewer models, but many of them have smaller subset sizes than those shown in the figures above.

The potential down-side to this feature selection technique is that it is greedy; once a predictor is discarded, it is never considered again in a subset. For this reason, RFE may achieve a local optimum where as the genetic algorithm has the ability to find a global optimum.

This approach filtered far more predictors. The profile of the area under the ROC curve (with all three estimates of the area under the ROC curve):

The RFE algorithm fit the final QDA model to 11 predictors, including 5 linear effects and both predictors associated with the interaction. However, it did not capture any of the non-linear terms and picked up 4 non-informative predictors. Of the noise predictors, it was not biased towards the correlated set (only 1 of the 4 were correlated). The estimates were consistent with one another and the area under the ROC curve was estimated as 0.907 using resampling, 0.893 using the test set and 0.885 using the large-sample holdout. The consistency of these values is probably due to the RFE procedure constructing external resampling, meaning that each cross-validation did a separate feature elimination sequence and used held-out data to estimate performance. This prevented the type of overly optimistic estimates that were seen earlier.

The large-sample ROC curves are:

So far, RFE is more effective than the genetic algorithm at sub-setting features for these data.

The next blog post will look at modifications of the genetic algorithm that will improve performance.

The R packages loaded at the time of the analysis were: base (2.15.2), caret (5.15-87), cluster (1.14.3), datasets (2.15.2), doMC (1.2.5), foreach (1.4.0), GA (1.3), ggplot2 (0.9.2.1), graphics (2.15.2), grDevices (2.15.2), iterators (1.0.6), knitr (0.8), lattice (0.20-10), latticeExtra (0.6-24), MASS (7.3-22), methods (2.15.2), multicore (0.1-7), nlme (3.1-105), plyr (1.7.1), pROC (1.5.4), RColorBrewer (1.0-5), reshape2 (1.2.1), stats (2.15.2) and utils (2.15.2)

Simulated Annealing Feature Selection

Feature Selection 2 - Genetic Boogaloo

Feature Selection Strikes Back (Part 1)

Applied Predictive Modeling