Early draft of our "Feature Engineering and Selection" book

Kjell and I are writing another book on predictive modeling, this time focused on all the things that you can do with predictors. It's about 60% done and we'd love to get feedback. You cna take a look at http://feat.engineering and provide feedback at https://github.com/topepo/FES/issues.

The current TOC is:

Introduction
Illustrative Example: Predicting Risk of Ischemic Stroke
A Review of the Predictive Modeling Process
Exploratory Visualizations
Encoding Categorical Predictors
Engineering Numeric Predictors
Detecting Interaction Effects (these later chapters are not finished yet)
Flattening Profile Data
Handling Missing Data
Feature Engineering Without Overfitting
Feature Selection

Feature Engineering versus Feature Extraction: Game On!

"Feature engineering" is a fancy term for making sure that your predictors are encoded in the model in a manner that makes it as easy as possible for the model to achieve good performance. For example, if your have a date field as a predictor and there are larger differences in response for the weekends versus the weekdays, then encoding the date in this way makes it easier to achieve good results.

However, this depends on a lot of things.

First, it is model-dependent. For example, trees might have trouble with a classification data set if the class boundary is a diagonal line since their class boundaries are made using orthogonal slices of the data (oblique trees excepted).

Second, the process of predictor encoding benefits the most from subject-specific knowledge of the problem. In my example above, you need to know the patterns of your data to improve the format of the predictor. Feature engineering is very different in image processing, information retrieval, RNA expressions profiling, etc. You need to know something about the problem and your particular data set to do it well.

Here is some training set data where two predictors are used to model a two-class system (I'll unblind the data at the end):

There is also a corresponding test set that we will use below.

There are some observations that we can make:

The data are highly correlated (correlation = 0.85)
Each predictor appears to be fairly right-skewed
They appear to be informative in the sense that you might be able to draw a diagonal line to differentiate the classes

Depending on what model that we might choose to use, the between-predictor correlation might bother us. Also, we should look to see of the individual predictors are important. To measure this, we'll use the area under the ROC curve on the predictor data directly.

Here are univariate box-plots of each predictor (on the log scale):

There is some mild differentiation between the classes but a significant amount of overlap in the boxes. The area under the ROC curves for predictor A and B are 0.61 and 0.59, respectively. Not so fantastic.

What can we do? Principal component analysis (PCA) is a pre-processing method that does a rotation of the predictor data in a manner that creates new synthetic predictors (i.e. the principal components or PC's). This is conducted in a way where the first component accounts for the majority of the (linear) variation or information in the predictor data. The second component does the same for any information in the data that remains after extracting the first component and so on. For these data, there are two possible components (since there are only two predictors). Using PCA in this manner is typically called feature extraction.

Let's compute the components:

> library(caret)
> head(example_train)

     PredictorA PredictorB Class
2    3278.726  154.89876   One
3    1727.410   84.56460   Two
4    1194.932  101.09107   One
12   1027.222   68.71062   Two
15   1035.608   73.40559   One
16   1433.918   79.47569   One

> pca_pp <- preProcess(example_train[, 1:2],
+                      method = c("center", "scale", "pca"))
+ pca_pp

  
Call:
preProcess.default(x = example_train[, 1:2], method = c("center",
 "scale", "pca"))

Created from 1009 samples and 2 variables
Pre-processing: centered, scaled, principal component signal extraction 

PCA needed 2 components to capture 95 percent of the variance

> train_pc <- predict(pca_pp, example_train[, 1:2])
> test_pc <- predict(pca_pp, example_test[, 1:2])
> head(test_pc, 4)

          PC1         PC2
1 0.8420447  0.07284802
5 0.2189168  0.04568417
6 1.2074404 -0.21040558
7 1.1794578 -0.20980371

Note that we computed all the necessary information from the training set and apply these calculations to the test set. What do the test set data look like?

These are the test set predictors simply rotated.

PCA is unsupervised, meaning that the outcome classes are not considered when the calculations are done. Here, the area under the ROC curves for the first component is 0.5 and 0.81 for the second component. These results jive with the plot above; the first component has an random mixture of the classes while the second seems to separate the classes well. Box plots of the two components reflect the same thing:

There is much more separation in the second component.

This is interesting. First, despite PCA being unsupervised, it managed to find a new predictor that differentiates the classes. Secondly, it is the last component that is most important to the classes but the least important to the predictors. It is often said that PCA doesn't guarantee that any of the components will be predictive and this is true. Here, we get lucky and it does produce something good.

However, imagine that there are hundreds of predictors. We may only need to use the first X components to capture the majority of the information in the predictors and, in doing so, discard the later components. In this example, the first component accounts for 92.4% of the variation in the predictors; a similar strategy would probably discard the most effective predictor.

How does the idea of feature engineering come into play here? Given these two predictors and seeing the first scatterplot shown above, one of the first things that occurs to me is "there are two correlated, positive, skewed predictors that appear to act in tandem to differentiate the classes". The second thing that occurs to be is "take the ratio". What does that data look like?

The corresponding area under the ROC curve is 0.8, which is nearly as good as the second component. A simple transformation based on visually exploring the data can do just as good of a job as an unbiased empirical algorithm.

These data are from the cell segmentation experiment of Hill et al, and predictor A is the "surface of a sphere created from by rotating the equivalent circle about its diameter" (labeled as EqSphereAreaCh1 in the data) and predictor B is the perimeter of the cell nucleus (PerimCh1). A specialist in high content screening might naturally take the ratio of these two features of cells because it makes good scientific sense (I am not that person). In the context of the problem, their intuition should drive the feature engineering process.

However, in defense of an algorithm such as PCA, the machine has some benefit. In total, there are almost sixty predictors in these data whose features are just as arcane as EqSphereAreaCh1. My personal favorite is the "Haralick texture measurement of the spatial arrangement of pixels based on the co-occurrence matrix". Look that one up some time. The point is that there are often too many features to engineer and they might be completely unintuitive from the start.

Another plus for feature extraction is related to correlation. The predictors in this particular data set tend to have high between-predictor correlations and for good reasons. For example, there are many different ways to quantify the eccentricity of a cell (i.e. how elongated it is). Also, the size of a cell's nucleus is probably correlated with the size of the overall cell and so on. PCA can mitigate the effect of these correlations in one fell swoop. An approach of manually taking ratios of many predictors seems less likely to be effective and would take more time.

Last year, in one of the R&D groups that I support, there was a bit of a war being waged between the scientists who focused on biased analysis (i.e. we model what we know) versus the unbiased crowd (i.e. just let the machine figure it out). I fit somewhere in-between and believe that there is a feedback loop between the two. The machine can flag potentially new and interesting features that, once explored, become part of the standard book of "known stuff".

Simulated Annealing Feature Selection

As previously mentioned, caret has two new feature selection routines based on genetic algorithms (GA) and simulated annealing (SA). The help pages for the two new functions give a detailed account of the options, syntax etc.

The package already has functions to conduct feature selection using simple filters as well as recursive feature elimination (RFE). RFE can be very effective. It initially ranks the features, then removes them in sequence, starting with the least importance variables. It is a greedy method since it never backtracks to reevaluate a subset. Basically, it points itself in a single direction and proceeds. The main question is how far to go in that direction.

These two new functions, gafs and safs, conduct a global search over the feature set. This means that the direction of the search is constantly changing and may re-evaluate solutions that are similar to (or the same as) previously evaluated feature sets. This is good and bad. If a feature set is not easily "rankable" then it should do better than RFE. However, it will require a larger number of model fits and has a greater chance of overfitting the features to the training set.

I won't go over the details of GAs or SA. There are a lot of references on these and you can always check out Chapter 19 of our book. Also, since there were two previous blog posts about genetic algorithms, I'll focus on simulated annealing in this one.

To demonstrate I'll use the Alzheimer's disease data from Section 19.6 (page 502). It contains biomarker and patient demographic data that are being used to predict who will eventually have cognitive impairment. The original data contained instances for 333 subjects and information on 133 predictors.

In the book, we fit a number of different models to these data and conducted feature selection. Using RFE, linear discriminant analysis, generalized linear models and naive Bayes seemed to benefit from removing non-informative predictors. However, K-nearest neighbors and support vector machines had worse performance as RFE proceeded to remove predictors.

Wiht KNN, there is no model-based method for measuring predictor importance. In these cases, the rfe will use the area under the ROC curve for each individual predictor for ranking. Using that approach, here is the plot of KNN performance over subset sizes:

The values shown on the plot are the average area under the ROC curve measured using 5 repeats of 10-fold cross-validation. This implies that the full set of predictors is required. I should also note that, even with the entire predictor set, the KNN model did worse than LDA and random forests.

While RFE couldn't find a better subset of predictors, that does not mean that one doesn't exist. Can simulated annealing do better?

The code to load and split the data are in the AppliedPredictiveModeling package and you can find the markdown for this blog post linked at the bottom of this post. We have a data frame called training that has all the data used to fit the models. The outcome is a factor called Class and the predictor names are in a character vector called predVars. First, let's define the resampling folds.

Let's define the control object that will specify many of the important options for the search. There are a lot of details to spell out, such as how to train the model, how new samples are predicted etc. There is a pre-made list called caretSA that contains a starting template.

names(caretSA)

[1] "fit"            "pred"           "fitness_intern" "fitness_extern"
[5] "initial"        "perturb"        "prob"           "selectIter"

These functions are detailed on the packages web page. We will make a copy of this and change the method of measuring performance using hold-out samples:

knnSA <- caretSA
knnSA$fitness_extern <- twoClassSummary

This will compute the area under the ROC curve and the sensitivity and specificity using the default 50% probability cutoff. These functions will be passed into the safs function (see below)

safs will conduct the SA search inside a resampling wrapper as defined by the index object that we created above (50 total times). For each subset, it computed and out-of-sample (i.e. "external") performance estimate to make sure that we are not overfitting to the feature set. This code should reproduce the same folds as those used in the book:

library(caret)
set.seed(104)
index <- createMultiFolds(training$Class, times = 5)

Here is the control object that we will use:

ctrl <- safsControl(functions = knnSA,
                    method = "repeatedcv",
                    repeats = 5,
                    ## Here are the exact folds to used:
                    index = index,
                    ## What should we optimize? 
                    metric = c(internal = "ROC",
                               external = "ROC"),
                    maximize = c(internal = TRUE,
                                 external = TRUE),
                    improve = 25,
                    allowParallel = TRUE)

The option improve specifies how many SA iterations can occur without improvement. Here, the search restarts at the last known improvement after 25 iterations where the internal fitness value has not improved. The allowParallel options tells the package that it can parallelize the external resampling loops (if parallel processing is available).

The metric argument above is a little different than it would for train or rfe. SA needs an internal estimate of performance to help guide the search. To that end, we will also use cross-validation to tune the KNN model and measure performance. Normally, we would use the train function to do this. For example, using the full set of predictors, this could might work to tune the model:

train(x = training[, predVars],
      y = training$Class,
      method = "knn",
      metric = "ROC",
      tuneLength = 20,
      preProc = c("center", "scale"),
      trControl = trainControl(method = "repeatedcv",
                               ## Produce class prob predictions
                               classProbs = TRUE,
                               summaryFunction = twoClassSummary,))

The beauty of using the caretSA functions is that the ... are available.

A short explanation of why you should care...

R has this great feature where you can seamlessly pass arguments between functions. Suppose we have this function that computes the means of the columns in a matrix or data frame:

mean_by_column <- function(x, ...) {
  results <- rep(NA, ncol(x))
  for(i in 1:ncol(x)) results[i] <- mean(x[, i], ...)
  results
  }

(aside: don't loop over columns with for. See ?apply, or ?colMeans, instead)

There might be some options to the mean function that we want to pass in but those options might change for different applications. Rather than making different versions of the function for different option combinations, any argument that we pass to mean_by_column that is not one it its arguments (x, in this case) is passed to wherever the three dots appear inside the function. For the function above, they go to mean. Suppose there are missing values in x:

example <- matrix(runif(100), ncol = 5)
example[1, 5] <- NA
example[16, 1] <- NA
mean_by_column(example)

[1]     NA 0.4922 0.4704 0.5381     NA

mean has an option called na.rm that will compute the mean on the complete data. Now, we can pass this into mean_by_column even though this is not one of its options.

mean_by_column(example, na.rm = TRUE)

[1] 0.5584 0.4922 0.4704 0.5381 0.4658

Here's why this is relevant. caretSA$fit uses train to fit and possibly tune the model.

caretSA$fit

function (x, y, lev = NULL, last = FALSE, ...) 
train(x, y, ...)

Any options that we pass to safs that are not x, y, iters, differences, or safsControl will be passed to train.

(Even further, any option passed to safs that isn't an option to train gets passed down one more time to the underlying fit function. A good example of this is using importance = TRUE with random forest models. )

So, putting it all together:

set.seed(721)
knn_sa <- safs(x = training[, predVars],
               y = training$Class,
               iters = 500,
               safsControl = ctrl,

               ## Now we pass options to `train` via "knnSA":               
               method = "knn",
               metric = "ROC",
               tuneLength = 20,
               preProc = c("center", "scale"),
               trControl = trainControl(method = "repeatedcv",
                                        repeats = 2,
                                        classProbs = TRUE,
                                        summaryFunction = twoClassSummary,
                                        allowParallel = FALSE))

To recap:

the SA is conducted many times inside of resampling to get an external estimate of performance.
inside of this external resampling loop, the KNN model is tuned using another, internal resampling procedure.
the area under the ROC curve is used to guide the search (internally) and to know if the SA has overfit to the features (externally)
in the code above, when safs is called with other options (e.g. method = "knn"),
- safs passes the method, metric, tuneLength, preProc, and tuneLength, trControl options to caretSA$fit

After external resampling, the optimal number of search iterations is determined and one last SA is run using all of the training data.

Needless to say, this executes a lot of KNN models. When I ran this, I used parallel processing to speed things up using the doMC package.

Here are the results of the search:

knn_sa

Simulated Annealing Feature Selection

267 samples
132 predictors
2 classes: 'Impaired', 'Control' 

Maximum search iterations: 500 
Restart after 25 iterations without improvement (15.6 restarts on average)

Internal performance values: ROC, Sens, Spec
Subset selection driven to maximize internal ROC 

External performance values: ROC, Sens, Spec
Best iteration chose by maximizing external ROC 
External resampling method: Cross-Validated (10 fold, repeated 5 times) 

During resampling:
  * the top 5 selected variables (out of a possible 132):
    Ab_42 (96%), tau (92%), Cystatin_C (82%), NT_proBNP (82%), VEGF (82%)
  * on average, 60 variables were selected (min = 45, max = 75)

In the final search using the entire training set:
   * 59 features selected at iteration 488 including:
     Alpha_1_Antitrypsin, Alpha_1_Microglobulin, Alpha_2_Macroglobulin, Angiopoietin_2_ANG_2, Apolipoprotein_E ... 
   * external performance at this iteration is

       ROC       Sens       Spec 
     0.852      0.198      0.987

The algorithm automatically chose the subset created at iteration 488 of the SA (based on the external ROC) which contained 59 out of 133 predictors.

We can also plot the performance values over iterations using the plot function. By default, this uses the ggplot2 package, so we can add a theme at the end of the call:

Each of the data points for the external fitness is a average of the 50 resampled ROC values. The most improvement was found in the first 200 iterations. The internal estimate is generally more optimistic than the external estimate. It also tends to increase while the external estimate is relatively flat, indicating some overfitting. The plot above indicates that less iterations might probably give us equivalent performance. Instead of repeating the SA with fewer iterations, the update function can be used to pick a different subset size.

Let's compare the RFE and SA profiles. For RFE and SA, the ROC values are averaged over the 50 resamples. In the case of SA, the number of predictors is also an average. For example, at iteration 100 of the SA, here is the subset size distribution across the 50 resamples:

Here are superimposed smoothed trend lines for the resampling profiles of each search method:

Recall that the same cross-validation folds were used for SA and RFE, so this is an apples-to-apples comparison. SA searched a smaller range of subset sizes over the iterations in comparison to RFE. The code here starts the initial subset with a random 20% of the possible features and tended to increase as the search continued and then stabilized at a size of about 55.

How does this predictor subset perform on the test data?

library(pROC)
roc(testing$Class,
    predict(knn_sa, testing)$Impaired,
    levels = rev(levels(testing$Class)))

Call:
roc.default(response = testing$Class, 
            predictor = predict(knn_sa, testing)$Impaired, 
            levels = rev(levels(testing$Class)))

Data: predict(knn_sa, testing)$Impaired in 48 controls 
      (testing$Class Control) <  18 cases (testing$Class 
      Impaired).
      
Area under the curve: 0.848

This is better than the test set results for the RFE procedure. Note that the test set AUC is much more in-line with the external estimate of the area under the ROC curve. The bad new is that we evaluated many more models than the RFE procedure and the SA process was slightly more than 11-fold slower than RFE to complete. Good things take time. Here is a parallel-coordinate plot of the individual resampling results, match by fold:

The difference is statistically signficant too:

summary(diff(rs, metric = "ROC"))

Call:
summary.diff.resamples(object = diff(rs, metric = "ROC"))

p-value adjustment: bonferroni 
Upper diagonal: estimates of the difference
Lower diagonal: p-value for H0: difference = 0

ROC 
    RFE      SA     
RFE          -0.0504
SA  8.02e-06

The genetic algorithm code in gafs has very similar syntax to safs and also has pre-made functions.

The knitr file for these analyses can be found here.

Early draft of our "Feature Engineering and Selection" book

Feature Engineering versus Feature Extraction: Game On!

Simulated Annealing Feature Selection

Applied Predictive Modeling