New Version of caret on CRAN

A new version of caret is on CRAN.

ga_rf_profile.png

Some recent features/changes:

  • The license was changed to GPL >= 2 to accommodate new code from the GA package.
  • New feature selection functions gafs and safs were added, along with helper functions and objects, were added. The package HTML was updated to expand more about feature selection. I'll talk more about these functions in an upcoming blog post.
  • A reworked version of nearZerVar based on code from Michael Benesty was added the old version is now called nzv that uses less memory and can be used in parallel.
  • sbfControl now has a multivariate option where all the predictors are exposed to the scoring function at once.
  • Several regression simulation functions were added: SLC14_1, SLC14_2, LPH07_1 and LPH07_2
  • For the input data x to train, we now respect the class of the input value to accommodate other data types (such as sparse matrices).
  • A function update.rfe was added.

Recently added models:

  • From the adabag package, two new models were added: AdaBag and AdaBoost.M1.
  • Weighted subspace random forests from the wsrf package was added.
  • Additional bagged FDA and MARS models were added (model codes bagFDAGCV and bagEarthGCV) were added that use the GCV statistic to prune the model. This leads to memory reductions during training.
  • Brenton Kenkel added ordered logistic or probit regression to train using method = "polr" from MASS
  • The adaptive mixture discriminant model from the adaptDA package
  • A robust mixture discriminant model from the robustDA package was added.
  • The multi-class discriminant model using binary predictors in the binda package was added.
  • Ensembles of partial least squares models (via the enpls package) was added.
  • plsRglm was added.
  • From the kernlab package, SVM models using string kernels were added: svmBoundrangeString, svmExpoString, svmSpectrumString
  • The model code for ada had a bug fix applied and the code was adapted to use the "sub-model trick" so it should train faster.

My Research Tools

Pfizer has an excellent group of librarians and they recently contacted people, including a few statisticians, about how we find and organize article. I've spent considerable time thinking about this over the years. I've wanted to start a discussion about this topic for a while since I can't believe that someone isn't doing this better. Comments here or via email are enthusiastically welcome.

For finding journal articles, I do a few different things.

RSS feeds for journals.

RSS feeds are pretty straightforward to use. Most journals have RSS feeds of various types for journals (e.g. current issue, just accepted, articles ASAP etc.) In some cases, like PLOSone, you can create RSS feeds for specific search terms within that journal (see the examples at the bottom of this post). I haven't figured out how to filter RSS feeds based on whether the manuscript has supplemental materials (e.g. data).

RSS isn't perfect. For example, some of the ASA journals have mucked up their XML and I see a lot of repeats of articles on the same day. An edited list of what I keep tabs on is at the end of this post.

(As an aside, RSS feeds are also great for monitoring specific topics on Stack Overflow and Crossvalidated)

I have tried myriad RSS readers to aggregate and monitor my feeds. I'm currently using Feedly.

Also, this is only for content that you have identified as interesting. There could be something else out there that you have missed completely. That leads me to...

Google Alerts

I have about 30 different alerts. Some are related to general topics (e.g. ["training set" "test set" -microarray -SNP -QSAR -proteom -RNA -biomarker -biomarkers]) and others look for anything citing specific manuscripts (e.g. [Documents citing "The design and analysis of benchmark experiments"]). See this page for examples of how to create effective alerts. There are other uses for alerts too.

Alerts are very effective. I usually get emails with the alerts in batches of 20 or so at a time. I haven't quite figured out what the trigger is; in some cases I get two batches in a single day.

One thing I would put on the wish list is to so some sort of smart aggregation. If have alerts for [ "simulated annealing" "feature selection" ]; Articles excluding patents and [ "genetic algorithm" "feature selection" ]; Articles excluding patents, this results in abundant redundancy since many feature selection articles mention both search algorithms.

Keep in mind that the alerts may not be new articles but items that are new to Google. This isn't really an issue for me but it is worth mentioning.

Google Scholar

I love Google Scholar. Search on a title and you always be able to find the manuscript, links to different sources for obtaining it, plus a list of articles which reference it. Subject-based search are just as effective.

(Our librarians were surprised to find that we could get access to articles that our institution did not have licenses for via Google. For example, the scholar page for an article will list multiple versions of the reference. Some of these may correspond to the home page of one of the authors where he/she has a local copy of the PDF)

Google has good tips on searching. This presentation is excellent with some tricks that I didn't know.

So once I've found articles, how do I manage them?

Papers

Papers... I have equal parts love and hate for this program. I'll list the pros and cons below. I should say that I have been using this since the original version and have become increasingly frustrated . I'm not using the most recent version and I have tried a lot of different alternatives (e.g. Mendeley, BibDesk, Bookends, Endnote, Sente, Zotero). Unfortunately, for someone with thousands of PDFs, Papers (version 2) has some features that the others haven't mastered yet. I would love to move away from Papers.

What is good:

  • Importing articles is easy. In many cases, just dropping them into the window will find the metadata and automatically annotate the reference. Weirdly, drag-and-drop works better than the "Match" feature in the article window. There are Open In Papers bookmarks for most browsers. Once you find a journal article, use this link to start Papers and open the link. Often, the application automatically reads the citation information from the webpage and imports it. Clicking on the PDF link within the article's web page imports that file.
  • Articles within Papers can collect supplementary files easily. One minor issue is that plain text files are not automatically imported as PDF, CSV for other file formats are.
  • Papers does a great job or organizing the PDFs locally. I sync to Dropbox and have the same repository across different computers.
  • The bibtex export works well. This was invaluable when we were writing the book.
  • Their apps for tablets/mobile are easy to use and low maintenance. Syncing has not been an issue for me so far.

The bad news

  • Slooooow. It is really slow.
  • The search facilities for your PDF repository are not very powerful. This seems like it is a pretty low bar to jump over.
  • Keywords work but are manually added and the interface is pretty kludgy. I would love for this feature to work better. Hell, it wouldn't be difficult to automatically figure this out based on content (I wrote some rudimentary R/SQL code to do it on an really long plane flight once).
  • I have a small percentage of papers whose PDFs have gone missing.
  • I might accidentally import an article twice. In most cases, Papers doesn't tell me that I'm doing it until it is too late. Although they have a method for merging entries, I'd like to avoid this process beforehand.
  • They release versions with little to no testing. This is amazing but basic functionality in new major versions simply does not work. It is remarkable in the worst way.
  • While they did win the Apple Design Award for the first version, the interface seems to be getting worse with every new release. The color scheme for Papers 3 makes me depressed. Literal 50 shades of gray.

The last two issues have driven me crazy. I don't see myself upgrading any time soon.

Typesetting

I use LaTeX for almost all articles that I write. It is a pain when working with others who have never used it (or heard of it) but it is worth it. Also, the power you get when using LaTeX with Sweave or knitr simply cannot be underestimated. Apart from exporting bibtex from Papers, the other tools I use are:

  • Sublime Text: This is a great, lightweight editor that has some great add-ons for typesetting that integrate with Skim. Skim is pretty nice, but I would really like Sublime to work with OS X's Preview.
  • texpad is another good editor for OS X but, given the price, it might be difficult to argue that it is better than Sublime. It does hide a lot of the LaTeX junk that goes into typesetting a tex file but this is really a minor perk.

I gave a talk at ENAR last year related to this. We've since moved the book version control to github and have translated all of our Sweave code to knitr.

My Journal Feeds

In no particular order:

Comparing the Bootstrap and Cross-Validation

This is the second of two posts about the performance characteristics of resampling methods. The first post focused on the cross-validation techniques and this post mostly concerns the bootstrap.

Recall from the last post: we have some simulations to evaluate the precision and bias of these methods. I simulated some regression data (so that I know the real answers and compute the true estimate of RMSE). The model that I used was random forest with 1000 trees in the forest and the default value of the tuning parameter. I simulated 100 different data sets with 500 training set instances. For each data set, I also used each of the resampling methods listed above 25 times using different random number seeds. In the end, we can compute the precision and average bias of each of these resampling methods.

Question 3: How do the variance and bias change in the bootstrap?

First, let’s look at how the precision changes over the amount of data held-out and the training set size. We use the variance of the resampling estimates to measure precision.

Again, there shouldn't be any real surprise that the variance is decreasing as the number of bootstrap samples increases.

It is worth noting that compared to the original motivation for the bootstrap, which as to create confidence intervals for some unknown parameter, this application doesn't require a large number of replicates. Originally, the bootstrap was used to estimate the tail probabilities of the bootstrap distribution of some parameters. For example, if we want to get a 95% bootstrap confidence interval, we one simple approach is to accurately measure the 2.5% and 97.5% quantiles. Since these are very extremely values, traditional bootstrapping requires a large number of bootstrap samples (at least 1,000). For our purposes, we want a fairly good estimate of the mean of the bootstrap distribution and this shouldn't require hundreds of resamples.

For the bias, it is fairly well-known that the naive bootstrap produces biased estimates. The bootstrap has a hold-out rate of about 63.2%. Although this is a random value in practice and the mean hold-out percentage is not affected by the number of resamples. Our simulation confirms the large bias that doesn't move around very much (the y-axis scale here is very narrow when compared to the previous post):

Again, no surprises

Question 4: How does this compare to repeated 10-fold CV?

We can compare the sample bootstrap to repeated 10-fold CV. For each method, we have relatively constant hold-out rates and matching configurations for the number of total resamples.

My initial thoughts would be that the naive bootstrap is probably more precise than repeated 10-fold CV but has much worse bias. Let's look at the bias first this time. As predicted, CV is much less biased in comparison:

Now for the variance:

To me, this is very unexpected. Based on these simulations, repeated 10-fold CV is superior in both bias and variance.

Question 5: What about the OOB RMSE estimate?

Since we used random forests as our learner we have access to yet another resampled estimate of performance. Recall that random forest build a large number of trees and each tree is based on a separate bootstrap sample. As a consequence, each tree has an associated "out-of-bag" (OOB) set of instances that were not used in the tree. The RMSE can be calculated for these trees and averaged to get another bootstrap estimate. Recall from the first post that we used 1,000 trees in the forest so the effective number of resamples is 1,000.

Wait - didn't I say above that we don't need very many bootstrap samples? Yes, but that is a different situation. Random forests require a large number of bootstrap samples. The reason is that random forests randomly sample from the predictor set at each split. For this reason, you need a lot of resamples to get stable prediction values. Also, if you have a large number of predictors and a small to medium number of training set instances, the number of resamples should be really large to make sure that each predictor gets a chance to influence the model sufficiently.

Let's look at the last two plots and add a line for the value of the OOB estimate. For variance:

and bias:

the results look pretty good (especially the bias). The great thing about this is that you get this estimate for free and it works pretty well. This is consistent with my other experiences too. For example, Figure 8.18 shows that the CV and OOB error rates for the usual random forest model track very closely for the solubility data.

The bad news is that:

  • This estimate may not work very well for other types of models. Figure 8.18 does not show nearly identical performance estimates for random forests based on conditional inference trees.
  • There are a limited number of performance estimates where the OOB estimate can be computed. This is mostly due to software limitations (as opposed to theoretical concerns). For the caret package, we can compute RMSE and R2 for regression and accuracy and the Kappa statistic for classification but this is mostly by taking the output that the randomForest function provides. If you want to get the area under the ROC curve or some other measure, you are out of luck.
  • If you are comparing random forest’s OOB error rate with the CV error rate from another model, it may not be a very fair comparison.