The Basics of Encoding Categorical Data for Predictive Models

Thomas Yokota asked a very straight-forward question about encodings for categorical predictors: "Is it bad to feed it non-numerical data such as factors?" As usual, I will try to make my answer as complex as possible.

(I've heard the old wives tale that eskimos have 180 different words in their language for snow. I'm starting to think that statisticians have at least as many ways of saying "it depends")

BTW, we cover this in Sections 3.6, 14.1 and 14.7 of the book.

My answer: it depends on the model. Some models can work with categorical predictors in their nature, non-numeric encodings. Trees, for example, can usually partition the predictor into distinct groups. For a predictor X with levels a through e, a split might look like

if X in {a, c, d} the class = 1
else class = 2

Rule-based models operate similarly.

Naive Bayes models are another method that does not need to re-encode the data. In the above example, the frequency distribution of the predictor is computed overall as well as within each of the classes (a good example of this is in Table 13.1 for those of you that are following along).

However, these are the exceptions; most models require the predictors to be in some sort of numeric encoding to be used. For example, linear regression required numbers so that it can assign slopes to each of the predictors.

The most common encoding is to make simple dummy variables. The there are C distinct values of the predictor (or levels of the factor in R terminology), a set of C - 1 numeric predictors are created that identify which value that each data point had.

These are called dummy variables. Why C - 1 and not C? First, if you know the values of the first C - 1 dummy variables, you know the last one too. It is more economical to use C - 1. Secondly, if the model has slopes and intercepts (e.g. linear regression), the sum of all of the dummy variables wil add up to the intercept (usually encoded as a "1") and that is bad for the math involved.

In R, a simple demonstration for the example above is:

> pred1 <- factor(letters[1:5])
> pred1

[1] a b c d e
Levels: a b c d e

The R function model.matrix is a good way to show the encodings:

> model.matrix(~pred1)

  (Intercept) pred1b pred1c pred1d pred1e
1           1      0      0      0      0
2           1      1      0      0      0
3           1      0      1      0      0
4           1      0      0      1      0
5           1      0      0      0      1
attr(,"assign")
[1] 0 1 1 1 1
attr(,"contrasts")
attr(,"contrasts")$pred1
[1] "contr.treatment"

A column for the factor level a is removed (since it excludes the first level of the factor). This approach goes by the name of "full-rank" encoding since the dummy variables do not always add up to 1.

We discuss different encodings for predictors in a few places but fairly extensively in Section 12.1. In that example, we have a predictor that is a date. Do we encode that as the day or the year (1 to 365) and include it as a numeric predictor? We could also add in predictors for the day of the week, the month, the season etc. There are a lot of options. This question of feature engineering is important. You want to find the encoding that captures the important patterns in the data. If there is a seasonal effect, the encoding should capture that information. Exploratory visualizations (perhaps with lattice or ggplot2) can go a long way to figuring out good ways to represent these data.

Some of these options result in ordered encodings, such as the day of the week. It is possible that the trends in the data are best exposed if the ordering is preserved. R does have a way for dealing with this:

> pred2 <- ordered(letters[1:5])
> pred2

[1] a b c d e
Levels: a < b < c < d < e

Simple enough, right? Maybe not. If we need a numeric encoding here, what do we do?

There are a few options. One simple way is to assign "scores" to each of the levels. We might assign a value of 1 to a and think that b should be twice that and c should be four times that and so on. It is arbitrary but there are whole branches of statistics dedicated to modeling data with (made up) scores. Trend tests are one example.

If the data are ordered, one technique is to create a set of new variables similar to dummy variables. However, their values are not 0/1 but are created to reflect the possible trends that can be estimated.

For example, if the predictor has two ordered levels, we can't fit anything more sophisticated to a straight line. However, if there are three ordered levels, we could fit a linear effect as well as a quadratic effect and so on. There are some smart ways to do this (google "orthogonal polynomials" if you are bored).

For each ordered factor in a model, R will create a set of polynomial scores for each (we could use the fancy label of "a set of basis functions" here). For example:

> model.matrix(~pred2)

  (Intercept) pred2.L pred2.Q    pred2.C pred2^4
1           1 -0.6325  0.5345 -3.162e-01  0.1195
2           1 -0.3162 -0.2673  6.325e-01 -0.4781
3           1  0.0000 -0.5345 -4.096e-16  0.7171
4           1  0.3162 -0.2673 -6.325e-01 -0.4781
5           1  0.6325  0.5345  3.162e-01  0.1195
attr(,"assign")
[1] 0 1 1 1 1
attr(,"contrasts")
attr(,"contrasts")$pred2
[1] "contr.poly"

Here, "L" is for linear, "Q" is quadratic and "C" is cubic and so on. There are five levels of this factor and we can create four new encodings. Here is a plot of what those encodings look like:

unnamed-chunk-5.png

The nice thing here is that, if the underlying relationship between the ordered factor and the outcome is cubic, we have a feature in the data that can capture that trend.

One other way of encoding ordered factors is to treat them as unordered. Again, depending on the model and the data, this could work just as well.

Exercises and Solutions

Kjell and I are putting together solutions to the exercise sections. It is a lot of work, so it may take some time.  

If you are teaching a class and are interested in using the book for a class, please drop me an email (mxkuhn@gmail.com). We would like to hear more about possible specifics. For example, we would like feedback on the format (e.g. pdf or web page), availability (e.g. password protected website?) or any other aspect.

Equivocal Zones

In Chapter 11, equivocal zones were briefly discussed. The idea is that some classification errors are close to the probability boundary (i.e. 50% for two class outcomes). If this is the case, we can create a zone where we the samples are predicted as "equivocal" or "indeterminate" instead of one of the class levels. This only works if the model does not incorrectly classify samples with complete (but wrong) confidence.  

In molecular diagnostics, many assays are required to have these zones for FDA approval and great care goes into their determination. If the assay returns an equivocal result, a recommendation would be made to repeat the assay or to obtain another sample. 

Does this actually work with a classification model? How would we do it in R? 

To illustrate this, I will use the two-class simulation system outlined here. For example:

library(caret)
 
set.seed(933)
training <- twoClassSim(n = 1000)
testing  <- twoClassSim(n = 1000)

Created by Pretty R at inside-R.org

Let's fit a random forest model using the default tuning parameter value to get a sense of the baseline performance. I'll calculate a set of different classification metrics: the area under the ROC curve, accuracy, Kappa, sensitivity and specificity.

p <- ncol(training) - 1
 
fiveStats <- function(...) c(twoClassSummary(...), defaultSummary(...))
 
ctrl <- trainControl(method = "cv",
                     summaryFunction = fiveStats,
                     classProbs = TRUE)
 
set.seed(721)
rfFit <- train(Class ~ ., data = training,
                method = "rf",
                metric = "ROC",
                tuneGrid = data.frame(.mtry = floor(sqrt(p))),
                ntree = 1000,
                trControl = ctrl)

Created by Pretty R at inside-R.org

> rfFit
1000 samples
  15 predictors
   2 classes: 'Class1', 'Class2' 

No pre-processing
Resampling: Cross-Validation (10 fold) 

Summary of sample sizes: 900, 900, 900, 900, 900, 900, ... 

Resampling results

  ROC    Sens   Spec   Accuracy  Kappa  ROC SD  Sens SD  Spec SD  Accuracy SD
  0.924  0.856  0.817  0.837     0.673  0.0329  0.0731   0.0561   0.0536     
  Kappa SD
  0.107   

Tuning parameter 'mtry' was held constant at a value of 3

We will fit the same model but performance is only measured using samples outside the zone: 

eZoned <- function(...)
{
  ## Find points within 0.5 +/- zone
  buffer <- .10
  preds <- list(...)[[1]]
  inZone <- preds$Class1 > (.5 - buffer) & preds$Class1 < (.5 + buffer)
  preds2 <- preds[!inZone,]
  c(twoClassSummary(preds2, lev = levels(preds2$obs)), 
    defaultSummary(preds2),
    ## We should measure the rate in which we do not make 
    ## a prediction. 
    Reportable = mean(!inZone))
}
 
ctrlWithZone <- trainControl(method = "cv",
                             summaryFunction = eZoned,
                             classProbs = TRUE)
set.seed(721)
rfEZ <- train(Class ~ ., data = training,
              method = "rf",
              metric = "ROC",
              tuneGrid = data.frame(.mtry = floor(sqrt(p))),
              ntree = 1000,
              trControl = ctrlWithZone)

Created by Pretty R at inside-R.org

> rfEZ
1000 samples
  15 predictors
   2 classes: 'Class1', 'Class2' 

No pre-processing
Resampling: Cross-Validation (10 fold) 

Summary of sample sizes: 900, 900, 900, 900, 900, 900, ... 

Resampling results

  ROC   Sens   Spec   Accuracy  Kappa  Reportable  ROC SD  Sens SD  Spec SD
  0.96  0.917  0.891  0.905     0.808  0.767       0.024   0.0483   0.0668 
  Accuracy SD  Kappa SD  Reportable SD
  0.0507       0.102     0.0655       

Tuning parameter 'mtry' was held constant at a value of 3

So by failing to predict about 23% of the samples, we were able to achieve a good boost in performance. What would happen if we change the zone size? The same procedure what used with zones up to +/- 0.14 and here are the results for the various metrics:

results.jpeg

There is an improvement in each of the measures as long as we are willing to accept an increasing proportion of indeterminate results. Does this replicate in the test set?

rfPred <- predict(rfFit, testing)
rfProb <- predict(rfFit, testing, type = "prob")[, "Class1"]
rfPred <- data.frame(obs = testing$Class,
                     pred = rfPred,
                     Class1 = rfProb)
 
fiveStats(rfPred, lev = levels(rfPred$obs))
eZoned(rfPred, lev = levels(rfPred$obs))

Created by Pretty R at inside-R.org

> fiveStats(rfPred, lev = levels(rfPred$obs))
      ROC      Sens      Spec  Accuracy     Kappa 
0.9378583 0.8518519 0.8543478 0.8530000 0.7047244 
> eZoned(rfPred, lev = levels(rfPred$obs))
       ROC       Sens       Spec   Accuracy      Kappa Reportable 
 0.9678054  0.9361702  0.9239437  0.9305913  0.8601139  0.7780000