Measuring Associations

In Chapter 18, we discuss a relatively new method for measuring predictor importance called the maximal information coefficient (MIC). The original paper is by Reshef at al (2011).

A summary of the initial reactions to the MIC are Speed and Tibshirani (and others can be found here). My (minor) beef with it is the lack of a probabilistic motivation. The authors have some general criteria (generality and equitability) and created an algorithm that seems to good in terms of these properties. The MIC clearly works, but what is it really optimizing and why?

It reminds me of partial least squares. The algorithm made intuitive sense and obviously worked, but it was a while before anyone actually figured out what mathematical problem it was solving. A similar statement could be made about boosting when it was first developed.

Murrell et al (2013) (or Murrell³?) has a similar generalized measure of association between continuous variables. There's is based on a generalized notion of R² that I'd never heard of. At first glance, it has a lot of attractive properties. One is that is has a probabilistic genesis. Another nice quality is that the association can be computed while controlling for other data. That is a big deal for me, since we often have experiments where we need to control for nuisance factors. For example, if you were trying to measure the relationship between the selling price of a house and the acerage of the lot, you might want to control for other factors, such as the type of terrain or geography (e.g. urban, rural etc).

Despite the probabilistic motivation, they take a randomization approach to making formal statistical inferences on significance of the measure. The same could be done for the MIC measure (and in the book, we used the same idea for Relief scores). I think a confidence interval would be better for everyone since it directly tells you the uncertainty and the size of the association (that's another post for another time for a topic that has been discussed quite a bit).

Let's look at some data. I like the blood-brain barrier data a lot. It has measurements on 208 drugs to see how much (if at all) they enter the brain. The predictors are molecular descriptors (similar to the solubility example in the book). To get the data:

library(caret)
data(BloodBrain)
## remove zero variance columns
isZV <- apply(bbbDescr, 2, function(x) length(unique(x)) == 1)
bbbDescr <- bbbDescr[, !isZV]
ncol(bbbDescr)

|| [1] 134

First, I'll measure association using the A measure discussed in Murrell³:

library(matie)
## Compute the associations across all columns of the predictors
Avalues <- apply(bbbDescr, 2, function(x, y) ma(cbind(x, y))$A, y = logBBB)
Avalues <- sort(Avalues, decreasing = TRUE)
head(Avalues)

||      fnsa3   psa_npsa polar_area       tpsa      scaa3       tcnp 
||     0.4386     0.4126     0.4116     0.3924     0.3771     0.3672

So the best predictor only explains 43.9% of the variation in the outcome. Most of the predictors shown above are related to surface area, which makes sense: the larger the molecule the less likely it is to physically fit through the barrier.

What does MIC tell us?

library(minerva)
## There are three predictors whose scales have very low variances. We
## need to reset the threshold that checks for this
mic <- mine(bbbDescr, logBBB, var.thr = 1e-10)$MIC
mic <- mic[order(mic[, "Y"], decreasing = TRUE), ]
head(mic)

||    fnsa3     rpcg    fpsa3    scaa1    scaa3 psa_npsa 
||   0.5249   0.4806   0.4762   0.4759   0.4712   0.4650

There are some differences but the top predictor from A is still at the top. The MIC values is sort of a correlation-like measure and our best value was 0.52.

I also have a measure of importance that is based on scatterplot smoothers. A loess smoother is fit between the outcome and the predictor and the R² statistic is calculated for this model against the intercept only null model. I don't claim that there is any justification (which is why I've never published it) for this but it has worked for me in the past. This is similar to my statements about MIC and PLS. I still use them because they tend to work, but I've no theoretical leg to stand on.

## A measure based on regression smoothers
gamImp <- filterVarImp(bbbDescr, logBBB, nonpara = TRUE)
gamImp <- gamImp[order(gamImp$Overall, decreasing = TRUE), , drop = FALSE]
head(gamImp)

||              Overall
||   fnsa3       0.3970
||   psa_npsa    0.3879
||   polar_area  0.3806
||   tcnp        0.3681
||   tpsa        0.3584
||   tpsa.1      0.3475

Finally, I'll compute the RRelief scores. We discuss this in the book and the a good reference is here. It uses a nearest-neighbor approach and measures the importance of each predictors simultaneously (all of the other methods show here measure each association in isolation).

library(CORElearn)
## The function only uses the formula interface
bbbData <- bbbDescr
bbbData$logBBB <- logBBB
set.seed(10)
RRelief <- attrEval(logBBB ~ ., data = bbbData, estimator = "RReliefFbestK", 
    ReliefIterations = 100)
RRelief <- RRelief[order(RRelief, decreasing = TRUE)]
head(RRelief)

||      smr_vsa7      smr_vsa1 dipole_moment    peoe_vsa.0           pol 
||        0.2622        0.2577        0.2525        0.2452        0.2366 
||      smr_vsa6 
||        0.2278

This score ranks the variables differently than the other methods. This is most likely due to the difference in philosophy in measuring association between this method and the others as well as the high degree of correlation between the predictors in these data. Overall, do these metrics correlate?

If you ignore RRelief, there is some association between measures of association. Interestingly, there are a few predictors that have zero association using the A measure but non-zero correlation using MIC. The variables, and their MIC values are: peoe_vsa.2 (0.22), slogp_vsa4 (0.2), peoe_vsa.0 (0.19), smr_vsa1 (0.18), a_acid (0.18), peoe_vsa.2.1 (0.14) and negative (0.04). What do these look like? Here are scatterplots between these predictions and the outcome (with scatterplot smoother fits):

Several of these are "near zero variance" predictors and probably shouldn't be evaluated. For this image, it is difficult for me to see the association between the response and peoe_vsa.0 (MIC = 0.19) or smr_vsa1 (MIC = 0.18).

type = "what?"

One great thing about R is that has a wide diversity of packages written by many different people of many different viewpoints on how software should be designed. However, this does tend to bite us periodically.

When I teach newcomers about R and predictive modeling, I have a slide that illustrates one of the weaknesses of this system: heterogeneous interfaces. If you are building a classification model and want to generate class probabilities for new samples, the syntax can be... diverse. Here is a sample of syntax for different models:

That's a lot of minutia to remember. I did a quick and dirty census of all the classification models used by caret to quantify the variability in this particular syntax. The train utilizes 64 different models that can produce class probabilities. Of these, many were from the same package. For example, both nnet and multinom are in the nnet package and probably should not count twice since the latter is a wrapper for the former. As another example, the RWeka packages has at least six functions that all use probability as the value for type.

For this reason, I cooked the numbers down to one value of type per package (using majority vote if there was more than one). There were 40 different packages once these redundancies were eliminated. Here is a histogram of the type values for calculating probabilities:

The most frequent situation is no type value at all. For example, the lda package automatically generated predicted classes and posterior probabilities without requiring the user to specify anything. There were a handful of cases where the class did not have a predict method to generate class probabilities (e.g. party and pamr) and these also counted as "none".

For those of us that use R to create predictive models on a day-to-day basis, this is a lot of detail to remember (especially if we want to try different models). This is one of the reasons I created caret; it has a unified interface to models that eliminates the need to remember the name of the function, the value of type and any other arguments. In case you are wondering, I chose `type = "prob"'.

Feature Selection 3 - Swarm Mentality

"Bees don't swarm in a mango grove for nothing. Where can you see a wisp of smoke without a fire?" - Hla Stavhana

In the last two posts, genetic algorithms were used as feature wrappers to search for more effective subsets of predictors. Here, I will do the same with another type of search algorithm: particle swarm optimization.

Like genetic algorithms, this search procedure is motivated by a natural phenomenon, such as the movements of bird flocks. An excellent reference for this technique is Poli et al (2007). The methodology was originally developed for optimizing real valued parameters, but was later adapted for discrete optimization by Kennedy and Eberhart (1997).

The optimization is initiated with configurations (i.e. multiple particles). In our case, the particles will be different predictor subsets. For now, let's stick with the parameters being real-valued variables. A particular value of a particle is taken to be it's position. In addition to a position, each particle has an associated velocity. For the first iteration, these are based on random numbers.

Each particle produces a fitness value. As with genetic algorithms, this is some measure of model fit. The next candidate set of predictors that a particle evaluates is based on it's last position and it's current velocity.

A swarm of particle are evaluated at once and the location of the best particle is determined. As the velocity of each particle is updated, the update is a function of the:

previous velocity,
last position and
the position of the best particle

There are other parameters of the search procedure, such as the number of particles or how much relative weight the positions of the individual and best particle are used to determine the next candidate point, but this is the basic algorithm in a nutshell.

As an example, consider optimzing the Rosenbrock function with two real-valued variables (A and B):

fitness = 100*(B - A^2)^2 + (A - 1)^2

The best value is at (A = 1, B = 1). The movie below shows a particle swarm optimization using 100 iterations. The predicted best (solid white dot) is consistently in the neighborhood of the optimum value at around 50 iterations. You may need to refresh your browser to re-start the animation.

When searching for subsets, the quantities that we search over are binary (i.e. the predictor is used or excluded from the model). The description above implies that the position is a real valued quantity. If the positions are centered around zero, Kennedy and Eberhart (1997) suggested using a sigmoidal function to translate this value be between zero and one. A uniform random number is used to determine the binary version of the position that is evaluated. Other strategies have been proposed, including the application of a simple threshold to the translated position (i.e. if the translated position is above 0.5, include the predictor).

R has the pso package that implements this algorithm. It does not work for discrete optimization that we need for feature selection. Since its licensed under the GPL, I took the code and removed the parts specific to real valued optimization. That code is linked that the bottom of the page. I structured it to be similar to the R code for genetic algorithms. One input into the modified pso function is a list that has modules for fitting the model, generating predictions, evaluating the fitness function and so on. I've made some changes so that each particle can return multiple values and will treat the first as the fitness function. I'll fit the same QDA model as before to the same simulated data set. First, here are the QDA functions:

qda_pso <- list(
  fit = function(x, y, ...)
    {
    ## Check to see if the subset has no members
    if(ncol(x) > 0)
      {
      mod <- train(x, y, "qda", 
                   metric = "ROC",
                   trControl = trainControl(method = "repeatedcv", 
                                            repeats = 1,
                                            summaryFunction = twoClassSummary,
                                            classProbs = TRUE))
      } else mod <- nullModel(y = y) ## A model with no predictors 
    mod
    },
  fitness = function(object, x, y)
    {
    if(ncol(x) > 0)
      {
      testROC <- roc(y, predict(object, x, type = "prob")[,1], 
                     levels = rev(levels(y)))
      largeROC <- roc(large$Class, 
                      predict(object, 
                              large[,names(x),drop = FALSE], 
                              type = "prob")[,1], 
                      levels = rev(levels(y)))  
      out <- c(Resampling = caret:::getTrainPerf(object)[, "TrainROC"],
               Test = as.vector(auc(testROC)), 
               Large_Sample = as.vector(auc(largeROC)),
               Size = ncol(x))
      } else {
        out <- c(Resampling = .5,
                 Test = .5, 
                 Large_Sample = .5,
                 Size = ncol(x))
        print(out)
        }
    out
    },
  predict = function(object, x)
    {
    library(caret)
    predict(object, newdata = x)
    }
  )

Here is the familiar code to generate the simulated data:

set.seed(468)
training <- twoClassSim(  500, noiseVars = 100, 
                        corrVar = 100, corrValue = .75)
testing  <- twoClassSim(  500, noiseVars = 100, 
                        corrVar = 100, corrValue = .75)
large    <- twoClassSim(10000, noiseVars = 100, 
                        corrVar = 100, corrValue = .75)
realVars <- names(training)
realVars <- realVars[!grepl("(Corr)|(Noise)", realVars)]
cvIndex <- createMultiFolds(training$Class, times = 2)
ctrl <- trainControl(method = "repeatedcv",
                     repeats = 2,
                     classProbs = TRUE,
                     summaryFunction = twoClassSummary,
                     ## We will parallel process within each PSO
                     ## iteration, so don't double-up the number
                     ## of sub-processes
                     allowParallel = FALSE,
                     index = cvIndex)

To run the optimization, the code will be similar to the GA code used in the last two posts:

set.seed(235)
psoModel <- psofs(x = training[,-ncol(training)],
                  y = training$Class,
                  iterations = 200,
                  functions = qda_pso,
                  verbose = FALSE,
                  ## The PSO code uses foreach to parallelize
                  parallel = TRUE,
                  ## These are passed to the fitness function
                  tx = testing[,-ncol(testing)],
                  ty = testing$Class)

Since this is simulated data, we can evaluate how well the search went using estimates of the fitness (the area under the ROC curve) calculated using different data sets: resampling, a test set of 500 samples and large set of 10,000 samples that we use to approximate the truth.

The swarm did not consistently move to smaller subsets and, as with the original GA, it overfits to the predictors. This is demonstrated by the increase in the resampled fitness estimates and mediocre test/large sample estimates:

One tactic that helped the GA was to bias the algorithm towards smaller subsets. For PSO, this can be accomplished during the conversion from real valued positions to binary encodings. The previous code used a value of 1 for a predictor if the "squashed" version (i.e. after applying a sigmoidal function) was greater than 0.5. We can bias the subsets by increasing the threshold. This should start the process with smaller subsets and, since we raise the criteria for activating a predictor, only increase the subset size if there is a considerable increase in the fitness function. Here is the code for that conversion and another run of the PSO:

smallerSubsets <- function(x)
{  
  ## 'x' is a matrix of positions centered around zero. The
  ## code below uses a logistic function to "squash" then to
  ## be between (0, 1). 
  binary <- binomial()$linkinv(x)
  ## We could use ranom numbers to translate between the 
  ## squashed version of the position and the binary version.
  ## In the last application of PSO, I used a simple threshold of
  ## 0.5. Now, increase the threshold a little to force the 
  ## subsets to be smaller.
  binary <- ifelse(binary >= .7, 1, 0)
  ## 'x' has particles in columns and predicors in rows, 
  ## so use apply() to threshold the positions
  apply(binary, 2, function(x) which(x == 1))
}
set.seed(235)
psoSmallModel <- psofs(x = training[,-ncol(training)],
                       y = training$Class,
                       iterations = 200,
                       convert = smallerSubsets,
                       functions = qda_pso,
                       verbose = FALSE,
                       parallel = TRUE,
                       tx = testing[,-ncol(testing)],
                       ty = testing$Class)

The results are much better:

The large-sample and test set fitness values agree with the resampled versions. A smoothed version of the number of predictors over iterations shows that the search is driving down the number of predictors and keeping them low:

And here are the large-sample ROC curves so far:

For the simulated data, the GA and PSO procedures effectively reduced the number of predictors. After reading the last few posts, one could easily remark that I was only able to do this since I knew what the answers should be. If the optimal subset size was not small, would these approaches have been effective? The next (and final) post in this series will apply these methods to a real data set.

The code for these analyses are here and the modified PSO code is here. Thanks to Claus Bendtsen for the original pso code and for answering my email.

Measuring Associations

type = "what?"

Feature Selection 3 - Swarm Mentality

Applied Predictive Modeling