Early draft of our "Feature Engineering and Selection" book

Kjell and I are writing another book on predictive modeling, this time focused on all the things that you can do with predictors. It's about 60% done and we'd love to get feedback. You cna take a look at http://feat.engineering and provide feedback at https://github.com/topepo/FES/issues.

The current TOC is:

Introduction
Illustrative Example: Predicting Risk of Ischemic Stroke
A Review of the Predictive Modeling Process
Exploratory Visualizations
Encoding Categorical Predictors
Engineering Numeric Predictors
Detecting Interaction Effects (these later chapters are not finished yet)
Flattening Profile Data
Handling Missing Data
Feature Engineering Without Overfitting
Feature Selection

Simulated Annealing Feature Selection

As previously mentioned, caret has two new feature selection routines based on genetic algorithms (GA) and simulated annealing (SA). The help pages for the two new functions give a detailed account of the options, syntax etc.

The package already has functions to conduct feature selection using simple filters as well as recursive feature elimination (RFE). RFE can be very effective. It initially ranks the features, then removes them in sequence, starting with the least importance variables. It is a greedy method since it never backtracks to reevaluate a subset. Basically, it points itself in a single direction and proceeds. The main question is how far to go in that direction.

These two new functions, gafs and safs, conduct a global search over the feature set. This means that the direction of the search is constantly changing and may re-evaluate solutions that are similar to (or the same as) previously evaluated feature sets. This is good and bad. If a feature set is not easily "rankable" then it should do better than RFE. However, it will require a larger number of model fits and has a greater chance of overfitting the features to the training set.

I won't go over the details of GAs or SA. There are a lot of references on these and you can always check out Chapter 19 of our book. Also, since there were two previous blog posts about genetic algorithms, I'll focus on simulated annealing in this one.

To demonstrate I'll use the Alzheimer's disease data from Section 19.6 (page 502). It contains biomarker and patient demographic data that are being used to predict who will eventually have cognitive impairment. The original data contained instances for 333 subjects and information on 133 predictors.

In the book, we fit a number of different models to these data and conducted feature selection. Using RFE, linear discriminant analysis, generalized linear models and naive Bayes seemed to benefit from removing non-informative predictors. However, K-nearest neighbors and support vector machines had worse performance as RFE proceeded to remove predictors.

Wiht KNN, there is no model-based method for measuring predictor importance. In these cases, the rfe will use the area under the ROC curve for each individual predictor for ranking. Using that approach, here is the plot of KNN performance over subset sizes:

The values shown on the plot are the average area under the ROC curve measured using 5 repeats of 10-fold cross-validation. This implies that the full set of predictors is required. I should also note that, even with the entire predictor set, the KNN model did worse than LDA and random forests.

While RFE couldn't find a better subset of predictors, that does not mean that one doesn't exist. Can simulated annealing do better?

The code to load and split the data are in the AppliedPredictiveModeling package and you can find the markdown for this blog post linked at the bottom of this post. We have a data frame called training that has all the data used to fit the models. The outcome is a factor called Class and the predictor names are in a character vector called predVars. First, let's define the resampling folds.

Let's define the control object that will specify many of the important options for the search. There are a lot of details to spell out, such as how to train the model, how new samples are predicted etc. There is a pre-made list called caretSA that contains a starting template.

names(caretSA)

[1] "fit"            "pred"           "fitness_intern" "fitness_extern"
[5] "initial"        "perturb"        "prob"           "selectIter"

These functions are detailed on the packages web page. We will make a copy of this and change the method of measuring performance using hold-out samples:

knnSA <- caretSA
knnSA$fitness_extern <- twoClassSummary

This will compute the area under the ROC curve and the sensitivity and specificity using the default 50% probability cutoff. These functions will be passed into the safs function (see below)

safs will conduct the SA search inside a resampling wrapper as defined by the index object that we created above (50 total times). For each subset, it computed and out-of-sample (i.e. "external") performance estimate to make sure that we are not overfitting to the feature set. This code should reproduce the same folds as those used in the book:

library(caret)
set.seed(104)
index <- createMultiFolds(training$Class, times = 5)

Here is the control object that we will use:

ctrl <- safsControl(functions = knnSA,
                    method = "repeatedcv",
                    repeats = 5,
                    ## Here are the exact folds to used:
                    index = index,
                    ## What should we optimize? 
                    metric = c(internal = "ROC",
                               external = "ROC"),
                    maximize = c(internal = TRUE,
                                 external = TRUE),
                    improve = 25,
                    allowParallel = TRUE)

The option improve specifies how many SA iterations can occur without improvement. Here, the search restarts at the last known improvement after 25 iterations where the internal fitness value has not improved. The allowParallel options tells the package that it can parallelize the external resampling loops (if parallel processing is available).

The metric argument above is a little different than it would for train or rfe. SA needs an internal estimate of performance to help guide the search. To that end, we will also use cross-validation to tune the KNN model and measure performance. Normally, we would use the train function to do this. For example, using the full set of predictors, this could might work to tune the model:

train(x = training[, predVars],
      y = training$Class,
      method = "knn",
      metric = "ROC",
      tuneLength = 20,
      preProc = c("center", "scale"),
      trControl = trainControl(method = "repeatedcv",
                               ## Produce class prob predictions
                               classProbs = TRUE,
                               summaryFunction = twoClassSummary,))

The beauty of using the caretSA functions is that the ... are available.

A short explanation of why you should care...

R has this great feature where you can seamlessly pass arguments between functions. Suppose we have this function that computes the means of the columns in a matrix or data frame:

mean_by_column <- function(x, ...) {
  results <- rep(NA, ncol(x))
  for(i in 1:ncol(x)) results[i] <- mean(x[, i], ...)
  results
  }

(aside: don't loop over columns with for. See ?apply, or ?colMeans, instead)

There might be some options to the mean function that we want to pass in but those options might change for different applications. Rather than making different versions of the function for different option combinations, any argument that we pass to mean_by_column that is not one it its arguments (x, in this case) is passed to wherever the three dots appear inside the function. For the function above, they go to mean. Suppose there are missing values in x:

example <- matrix(runif(100), ncol = 5)
example[1, 5] <- NA
example[16, 1] <- NA
mean_by_column(example)

[1]     NA 0.4922 0.4704 0.5381     NA

mean has an option called na.rm that will compute the mean on the complete data. Now, we can pass this into mean_by_column even though this is not one of its options.

mean_by_column(example, na.rm = TRUE)

[1] 0.5584 0.4922 0.4704 0.5381 0.4658

Here's why this is relevant. caretSA$fit uses train to fit and possibly tune the model.

caretSA$fit

function (x, y, lev = NULL, last = FALSE, ...) 
train(x, y, ...)

Any options that we pass to safs that are not x, y, iters, differences, or safsControl will be passed to train.

(Even further, any option passed to safs that isn't an option to train gets passed down one more time to the underlying fit function. A good example of this is using importance = TRUE with random forest models. )

So, putting it all together:

set.seed(721)
knn_sa <- safs(x = training[, predVars],
               y = training$Class,
               iters = 500,
               safsControl = ctrl,

               ## Now we pass options to `train` via "knnSA":               
               method = "knn",
               metric = "ROC",
               tuneLength = 20,
               preProc = c("center", "scale"),
               trControl = trainControl(method = "repeatedcv",
                                        repeats = 2,
                                        classProbs = TRUE,
                                        summaryFunction = twoClassSummary,
                                        allowParallel = FALSE))

To recap:

the SA is conducted many times inside of resampling to get an external estimate of performance.
inside of this external resampling loop, the KNN model is tuned using another, internal resampling procedure.
the area under the ROC curve is used to guide the search (internally) and to know if the SA has overfit to the features (externally)
in the code above, when safs is called with other options (e.g. method = "knn"),
- safs passes the method, metric, tuneLength, preProc, and tuneLength, trControl options to caretSA$fit

After external resampling, the optimal number of search iterations is determined and one last SA is run using all of the training data.

Needless to say, this executes a lot of KNN models. When I ran this, I used parallel processing to speed things up using the doMC package.

Here are the results of the search:

knn_sa

Simulated Annealing Feature Selection

267 samples
132 predictors
2 classes: 'Impaired', 'Control' 

Maximum search iterations: 500 
Restart after 25 iterations without improvement (15.6 restarts on average)

Internal performance values: ROC, Sens, Spec
Subset selection driven to maximize internal ROC 

External performance values: ROC, Sens, Spec
Best iteration chose by maximizing external ROC 
External resampling method: Cross-Validated (10 fold, repeated 5 times) 

During resampling:
  * the top 5 selected variables (out of a possible 132):
    Ab_42 (96%), tau (92%), Cystatin_C (82%), NT_proBNP (82%), VEGF (82%)
  * on average, 60 variables were selected (min = 45, max = 75)

In the final search using the entire training set:
   * 59 features selected at iteration 488 including:
     Alpha_1_Antitrypsin, Alpha_1_Microglobulin, Alpha_2_Macroglobulin, Angiopoietin_2_ANG_2, Apolipoprotein_E ... 
   * external performance at this iteration is

       ROC       Sens       Spec 
     0.852      0.198      0.987

The algorithm automatically chose the subset created at iteration 488 of the SA (based on the external ROC) which contained 59 out of 133 predictors.

We can also plot the performance values over iterations using the plot function. By default, this uses the ggplot2 package, so we can add a theme at the end of the call:

Each of the data points for the external fitness is a average of the 50 resampled ROC values. The most improvement was found in the first 200 iterations. The internal estimate is generally more optimistic than the external estimate. It also tends to increase while the external estimate is relatively flat, indicating some overfitting. The plot above indicates that less iterations might probably give us equivalent performance. Instead of repeating the SA with fewer iterations, the update function can be used to pick a different subset size.

Let's compare the RFE and SA profiles. For RFE and SA, the ROC values are averaged over the 50 resamples. In the case of SA, the number of predictors is also an average. For example, at iteration 100 of the SA, here is the subset size distribution across the 50 resamples:

Here are superimposed smoothed trend lines for the resampling profiles of each search method:

Recall that the same cross-validation folds were used for SA and RFE, so this is an apples-to-apples comparison. SA searched a smaller range of subset sizes over the iterations in comparison to RFE. The code here starts the initial subset with a random 20% of the possible features and tended to increase as the search continued and then stabilized at a size of about 55.

How does this predictor subset perform on the test data?

library(pROC)
roc(testing$Class,
    predict(knn_sa, testing)$Impaired,
    levels = rev(levels(testing$Class)))

Call:
roc.default(response = testing$Class, 
            predictor = predict(knn_sa, testing)$Impaired, 
            levels = rev(levels(testing$Class)))

Data: predict(knn_sa, testing)$Impaired in 48 controls 
      (testing$Class Control) <  18 cases (testing$Class 
      Impaired).
      
Area under the curve: 0.848

This is better than the test set results for the RFE procedure. Note that the test set AUC is much more in-line with the external estimate of the area under the ROC curve. The bad new is that we evaluated many more models than the RFE procedure and the SA process was slightly more than 11-fold slower than RFE to complete. Good things take time. Here is a parallel-coordinate plot of the individual resampling results, match by fold:

The difference is statistically signficant too:

summary(diff(rs, metric = "ROC"))

Call:
summary.diff.resamples(object = diff(rs, metric = "ROC"))

p-value adjustment: bonferroni 
Upper diagonal: estimates of the difference
Lower diagonal: p-value for H0: difference = 0

ROC 
    RFE      SA     
RFE          -0.0504
SA  8.02e-06

The genetic algorithm code in gafs has very similar syntax to safs and also has pre-made functions.

The knitr file for these analyses can be found here.

Feature Selection 3 - Swarm Mentality

"Bees don't swarm in a mango grove for nothing. Where can you see a wisp of smoke without a fire?" - Hla Stavhana

In the last two posts, genetic algorithms were used as feature wrappers to search for more effective subsets of predictors. Here, I will do the same with another type of search algorithm: particle swarm optimization.

Like genetic algorithms, this search procedure is motivated by a natural phenomenon, such as the movements of bird flocks. An excellent reference for this technique is Poli et al (2007). The methodology was originally developed for optimizing real valued parameters, but was later adapted for discrete optimization by Kennedy and Eberhart (1997).

The optimization is initiated with configurations (i.e. multiple particles). In our case, the particles will be different predictor subsets. For now, let's stick with the parameters being real-valued variables. A particular value of a particle is taken to be it's position. In addition to a position, each particle has an associated velocity. For the first iteration, these are based on random numbers.

Each particle produces a fitness value. As with genetic algorithms, this is some measure of model fit. The next candidate set of predictors that a particle evaluates is based on it's last position and it's current velocity.

A swarm of particle are evaluated at once and the location of the best particle is determined. As the velocity of each particle is updated, the update is a function of the:

previous velocity,
last position and
the position of the best particle

There are other parameters of the search procedure, such as the number of particles or how much relative weight the positions of the individual and best particle are used to determine the next candidate point, but this is the basic algorithm in a nutshell.

As an example, consider optimzing the Rosenbrock function with two real-valued variables (A and B):

fitness = 100*(B - A^2)^2 + (A - 1)^2

The best value is at (A = 1, B = 1). The movie below shows a particle swarm optimization using 100 iterations. The predicted best (solid white dot) is consistently in the neighborhood of the optimum value at around 50 iterations. You may need to refresh your browser to re-start the animation.

When searching for subsets, the quantities that we search over are binary (i.e. the predictor is used or excluded from the model). The description above implies that the position is a real valued quantity. If the positions are centered around zero, Kennedy and Eberhart (1997) suggested using a sigmoidal function to translate this value be between zero and one. A uniform random number is used to determine the binary version of the position that is evaluated. Other strategies have been proposed, including the application of a simple threshold to the translated position (i.e. if the translated position is above 0.5, include the predictor).

R has the pso package that implements this algorithm. It does not work for discrete optimization that we need for feature selection. Since its licensed under the GPL, I took the code and removed the parts specific to real valued optimization. That code is linked that the bottom of the page. I structured it to be similar to the R code for genetic algorithms. One input into the modified pso function is a list that has modules for fitting the model, generating predictions, evaluating the fitness function and so on. I've made some changes so that each particle can return multiple values and will treat the first as the fitness function. I'll fit the same QDA model as before to the same simulated data set. First, here are the QDA functions:

qda_pso <- list(
  fit = function(x, y, ...)
    {
    ## Check to see if the subset has no members
    if(ncol(x) > 0)
      {
      mod <- train(x, y, "qda", 
                   metric = "ROC",
                   trControl = trainControl(method = "repeatedcv", 
                                            repeats = 1,
                                            summaryFunction = twoClassSummary,
                                            classProbs = TRUE))
      } else mod <- nullModel(y = y) ## A model with no predictors 
    mod
    },
  fitness = function(object, x, y)
    {
    if(ncol(x) > 0)
      {
      testROC <- roc(y, predict(object, x, type = "prob")[,1], 
                     levels = rev(levels(y)))
      largeROC <- roc(large$Class, 
                      predict(object, 
                              large[,names(x),drop = FALSE], 
                              type = "prob")[,1], 
                      levels = rev(levels(y)))  
      out <- c(Resampling = caret:::getTrainPerf(object)[, "TrainROC"],
               Test = as.vector(auc(testROC)), 
               Large_Sample = as.vector(auc(largeROC)),
               Size = ncol(x))
      } else {
        out <- c(Resampling = .5,
                 Test = .5, 
                 Large_Sample = .5,
                 Size = ncol(x))
        print(out)
        }
    out
    },
  predict = function(object, x)
    {
    library(caret)
    predict(object, newdata = x)
    }
  )

Here is the familiar code to generate the simulated data:

set.seed(468)
training <- twoClassSim(  500, noiseVars = 100, 
                        corrVar = 100, corrValue = .75)
testing  <- twoClassSim(  500, noiseVars = 100, 
                        corrVar = 100, corrValue = .75)
large    <- twoClassSim(10000, noiseVars = 100, 
                        corrVar = 100, corrValue = .75)
realVars <- names(training)
realVars <- realVars[!grepl("(Corr)|(Noise)", realVars)]
cvIndex <- createMultiFolds(training$Class, times = 2)
ctrl <- trainControl(method = "repeatedcv",
                     repeats = 2,
                     classProbs = TRUE,
                     summaryFunction = twoClassSummary,
                     ## We will parallel process within each PSO
                     ## iteration, so don't double-up the number
                     ## of sub-processes
                     allowParallel = FALSE,
                     index = cvIndex)

To run the optimization, the code will be similar to the GA code used in the last two posts:

set.seed(235)
psoModel <- psofs(x = training[,-ncol(training)],
                  y = training$Class,
                  iterations = 200,
                  functions = qda_pso,
                  verbose = FALSE,
                  ## The PSO code uses foreach to parallelize
                  parallel = TRUE,
                  ## These are passed to the fitness function
                  tx = testing[,-ncol(testing)],
                  ty = testing$Class)

Since this is simulated data, we can evaluate how well the search went using estimates of the fitness (the area under the ROC curve) calculated using different data sets: resampling, a test set of 500 samples and large set of 10,000 samples that we use to approximate the truth.

The swarm did not consistently move to smaller subsets and, as with the original GA, it overfits to the predictors. This is demonstrated by the increase in the resampled fitness estimates and mediocre test/large sample estimates:

One tactic that helped the GA was to bias the algorithm towards smaller subsets. For PSO, this can be accomplished during the conversion from real valued positions to binary encodings. The previous code used a value of 1 for a predictor if the "squashed" version (i.e. after applying a sigmoidal function) was greater than 0.5. We can bias the subsets by increasing the threshold. This should start the process with smaller subsets and, since we raise the criteria for activating a predictor, only increase the subset size if there is a considerable increase in the fitness function. Here is the code for that conversion and another run of the PSO:

smallerSubsets <- function(x)
{  
  ## 'x' is a matrix of positions centered around zero. The
  ## code below uses a logistic function to "squash" then to
  ## be between (0, 1). 
  binary <- binomial()$linkinv(x)
  ## We could use ranom numbers to translate between the 
  ## squashed version of the position and the binary version.
  ## In the last application of PSO, I used a simple threshold of
  ## 0.5. Now, increase the threshold a little to force the 
  ## subsets to be smaller.
  binary <- ifelse(binary >= .7, 1, 0)
  ## 'x' has particles in columns and predicors in rows, 
  ## so use apply() to threshold the positions
  apply(binary, 2, function(x) which(x == 1))
}
set.seed(235)
psoSmallModel <- psofs(x = training[,-ncol(training)],
                       y = training$Class,
                       iterations = 200,
                       convert = smallerSubsets,
                       functions = qda_pso,
                       verbose = FALSE,
                       parallel = TRUE,
                       tx = testing[,-ncol(testing)],
                       ty = testing$Class)

The results are much better:

The large-sample and test set fitness values agree with the resampled versions. A smoothed version of the number of predictors over iterations shows that the search is driving down the number of predictors and keeping them low:

And here are the large-sample ROC curves so far:

For the simulated data, the GA and PSO procedures effectively reduced the number of predictors. After reading the last few posts, one could easily remark that I was only able to do this since I knew what the answers should be. If the optimal subset size was not small, would these approaches have been effective? The next (and final) post in this series will apply these methods to a real data set.

The code for these analyses are here and the modified PSO code is here. Thanks to Claus Bendtsen for the original pso code and for answering my email.

Early draft of our "Feature Engineering and Selection" book

Simulated Annealing Feature Selection

Feature Selection 3 - Swarm Mentality

Applied Predictive Modeling