Lots of Package News

I've sent a slew of packages to CRAN recently (thanks to Swetlana and Uwe). There are updates to:

caret was primarily updated to deal with an issue introduced in the last version. It is a good idea to avoid fully loading the underlying modeling packages to avoid name conflicts. We made that change in the last version and it ended up being more complex than thought. A quirk in the regression tests missed it too but the whole thing can be avoided by loading the modeling package. news file
rsample had some modest additions including bridges to caret and recipes. The website added more application examples (times series and survival analysis). news file
recipes had more substantial enhancments including six new steps, a better interface for creating interactions (using selectors), and the ability to save the processed data in a sparse format. news file
Cubist and C50 have been updated and brought into the age of roxygen and pkgdown. C5.0 now has a nice plot method a la partykit and now has a vignette. I'll be adding a few features to each over time.

Two new packages:

yardstick contains many of the performance measurement methods in caret but in a format that is easier to use with dplyr syntax and functional programming.
tidyposterior is a Bayesian version of caret's resamples function. It can be used to take the resampling results from multiple models and do more formal statistical comparisons. It is similar in spirit to Benavoli et al (2017). We are looking for nominations for the hex logo so please offer your suggestions (but keep it clean).

C5.0 Class Probability Shrinkage

(The image above has nothing do to with this post. It does, however, show the prize that my daughter won during a recent vacation to Virginia and how I got it back home).

I was recently asked to explain a potential disconnect in C5.0 between the class probabilities shown in the terminal nodes and the values generated by the prediction code.

Here is an example using the iris data:

> library(C50)
> mod <- C5.0(Species ~ ., data = iris)
> summary(mod)
Call:
C5.0.formula(formula = Species ~ ., data = iris)


C5.0 [Release 2.07 GPL Edition]  	Tue Sep  8 12:49:43 2015
-------------------------------

Class specified by attribute `outcome'

Read 150 cases (5 attributes) from undefined.data

Decision tree:

Petal.Length <= 1.9: setosa (50)
Petal.Length > 1.9:
:...Petal.Width > 1.7: virginica (46/1)
    Petal.Width <= 1.7:
    :...Petal.Length <= 4.9: versicolor (48/1)
        Petal.Length > 4.9: virginica (6/2)


Evaluation on training data (150 cases):

	    Decision Tree   
	  ----------------  
	  Size      Errors  

	     4    4( 2.7%)   <<


	   (a)   (b)   (c)    <-classified as
	  ----  ----  ----
	    50                (a): class setosa
	          47     3    (b): class versicolor
	           1    49    (c): class virginica


	Attribute usage:

	100.00%	Petal.Length
	 66.67%	Petal.Width


Time: 0.0 secs

Suppose that we are predicting the sample in row 130 with a petal length of 5.8 and a petal width of 1.6. From this tree, the terminal node shows "virginica (6/2)" which means a predicted class of the virginica species with a probability of 4/6 = 0.66667. However, we get a different predicted probability:

> predict(mod, iris[130,], type = "prob")
        setosa versicolor virginica
130 0.04761905  0.3333333 0.6190476

When we wanted to describe the technical aspects of the C5.0 and cubist models, the main source of information on these models was the raw C source code from the RuleQuest website. For many years, both of these models were proprietary commercial products and we only recently open-sourced. Our intuition is that Quinlan quietly evolved these models from the versions described in the most recent publications to what they are today. For example, it would not be unreasonable to assume that C5.0 uses AdaBoost. From the sources, a similar reweighting scheme is used but it does not appear to be the same.

For classifying new samples, the C sources have

ClassNo PredictTreeClassify(DataRec Case, Tree DecisionTree)
/*      ------------  */
{
    ClassNo    c, C;
    double    Prior;

    /*  Save total leaf count in ClassSum[0]  */
    ForEach(c, 0, MaxClass)
    {
        ClassSum[c] = 0;
    }

    PredictFindLeaf(Case, DecisionTree, Nil, 1.0);
    C = SelectClassGen(DecisionTree->Leaf, (Boolean)(MCost != Nil), ClassSum);

    /*  Set all confidence values in ClassSum  */
    ForEach(c, 1, MaxClass)
    {
        Prior = DecisionTree->ClassDist[c] / DecisionTree->Cases;
        ClassSum[c] = (ClassSum[0] * ClassSum[c] + Prior) / (ClassSum[0] + 1);
    }
    Confidence = ClassSum[C];
    return C;
}

Here:

The predicted probability is the "confidence" value
The prior is the class probabilities from the training set. For the iris data, this value is 1/3 for each of the classes
The array ClassSum is the probabilities of each class in the terminal node although ClassSum[0] is the number of samples in the terminal node (which, if there are missing values, can be fractional).

For sample 130, the virginica values are:

  (ClassSum[0] * ClassSum[c] + Prior) / (ClassSum[0] + 1)
= (          6 *       (4/6) + (1/3)) / (          6 + 1) 
= 0.6190476

Why is it doing this? This will tend to avoid class predictions that are absolute zero or one.

Basically, it can be viewed to be similar to how Bayesian methods operate where the simple probability estimates are "shrunken" towards the prior probabilities. Note that, as the number of samples in the terminal nodes (ClassSum[0]) becomes large, this operation has less effect on the final results. Suppose ClassSum[0] = 10000, then the predicted virginica probability would be 0.6663337, which is closer to the simple estimate.

This is very much related to the Laplace Correction. Traditionally, we would add a value of one to the denominator of the simple estimate and add the number of classes to the bottom, resulting in (4+1)/(6+3) = 0.5555556. C5.0 is substituting the prior probabilities and their sum (always one) into this equation instead.

To be fair, there are well known Bayesian estimates of the sample proportions under different prior distributions for the two class case. For example, if there were two classes, the estimate of the class probability under a uniform prior would be the same as the basic Laplace correction (using the integers and not the fractions). A more flexible Bayesian approach is the Beta-Binomial model, which uses a Beta prior instead of the uniform. The downside here is that two extra parameters need to be estimated (and it only is defined for two classes)

Sample Mislabeling and Boosted Trees

A while back, I saw this post on StackExchange/Crossvalidated: "Does anyone know how well C5.0 boosting performs in the presence of mislabeled data?" I did some simulations in order to make a comparison with gradient boosting machines (GBM).

Some publications simulate mislabelling by sampling from distinct multivariate distirbutions for each class. I simulated two class data based on this post. Each simulated sample has an associated probability of being in class #1. A random uniform number is generated to assign each sample and observed class label. To mislabel X% of the data, a random set of samples are selected and the probability of being in class #1 is reversed.

Simulated data sets were simulated with training set sizes between 100 and 1000. The amount of mislabeled data also varied (at 5%, 10%, 15% and 20%). For each mislabeled data set, there is a matched training set (form the same random number seed) with no intentional mislabeling. For each of these configurations, 500 simulations were conducted.

Model performance was assessed using the area under the ROC curve. A test set of 10,000 with no mislabeling was used to evaluate performance.

For the C5.0 and GBM models, models were tuned using cross-validation. For each technique, a model was trained on the mislabeled data and another on the correctly labeled data. In this way, we have a "control" model that reflects how well the model does for each data set if there were no added noise due to mislabeling. As a result, for C5.0 and GBM, a percentage of performance loss can be calculated against the correctly labelled control set:

pct = (mislabeled - correct)/correct*100).

This image contains the distributions of the percent loss across the configurations:

Some other observations:

When there is no mislabeling, the results are almost identical
Small amounts of mislabeling do not hurt either model very much
The loss of performance decreases with training set size
With gross amounts of mislabeling, the gradient boosting machine is not affected as much as C5.0
The effect of mislabeling on C5.0 also impacts the variation in the results. If you compare the columns above, note that the C5.0 distribution does not simply shift to the left with the same level of variation.

I contacted Ross Quinlan about this and his response was:

"I agree with your conclusions for the function that you studied. My experiments with noise and AdaBoost suggested that the effect of noise (mislabeling) varies markedly with different applications. There are some summary results in the first part of the following:

http://rulequest.com/Personal/q.alt96.ps

I have only some vague ideas about why thus might be. For those applications where the classes are well-separated in the attribute space, mislabeling does not seem to alter the class clusters much. Alternatively, for applications where there is a tight boundary between two classes, mislabeling could markedly affect the perceived class divide."

Lots of Package News

C5.0 Class Probability Shrinkage

Sample Mislabeling and Boosted Trees

Applied Predictive Modeling