Reproducible Research at ENAR

I gave a talk at the Spring ENAR meetings this morning on some of the technical aspects of creating the book. The session was on reproducible research and the slides are here.

I was dinged for not using git for version control (we used dropbox for simplicity) but overall the comments were good. There was a small panel at the end for answering questions, which were mostly related to proprietary systems (e.g. SAS).

​I was also approached by an editor for Computational Statistics in regards to writing all of this up, which I will when I get a free moment. 

Confidence in Prediction

A few colleagues have just published a paper on measuring the confidence in prediction in regression models ("Interpretable, Probability-Based Confidence Metric for Continuous Quantitative Structure-Activity Relationship Models"). The idea is related to applicability domains: the region of the predictor space were the model can create reliable predictions. 

Historically, the primary method for computing the applicability domain was to judge the similarity of new samples to the training set to characterize if the model would need to extrapolate for these samples. That doesn't take into account the training set outcomes or, more importantly, any regions inside the training set space where the model does not fit the data well. 

The approach that this paper takes is to create a local root mean squared error that is weighted by the distance of the new sample to the nearest training set neighbors (inspired by the approach taken by Quinlan (1993) for instance-based corrections in Cubist). 

We examine confidence in prediction methods in our book in the section "When Should You Trust Your Model’s Prediction?"

What Was Left Out

There were a few topics that just couldn't be added to the book due to time and, especially, space. For a project like this, the old saying is "you're never done, you just stop working on it". 

First, the generalized additive model is one of my favorites. It is simple, effective and has the added bonus of of giving the user an idea of the functional form of relationship between the  predictor and outcome. We describe the same ideas for cubic smoothing splines, but the GAM model is probably more flexible. 

Time series models were also excluded since they are fundamentally different than the models we describe (where the training set points are independent). There is a strong literature on using neural networks for these models. I would recommend Rob Hyndman's book on this subject. 

Finally, ensembles of different models were not included. For example, for a particular data set, a random forest, support vector machine and naive Bayes might be fit to the training data and their individual predictions could be combined into a single prediction per sample. We do discuss other ensemble models (e.g. boosting, bagging etc). Good resources on this topic and Seni and Elder (2010) and Kuncheva (2004)

At least this gives us some topics for a potential second addition.