My Research Tools

Pfizer has an excellent group of librarians and they recently contacted people, including a few statisticians, about how we find and organize article. I've spent considerable time thinking about this over the years. I've wanted to start a discussion about this topic for a while since I can't believe that someone isn't doing this better. Comments here or via email are enthusiastically welcome.

For finding journal articles, I do a few different things.

RSS feeds for journals.

RSS feeds are pretty straightforward to use. Most journals have RSS feeds of various types for journals (e.g. current issue, just accepted, articles ASAP etc.) In some cases, like PLOSone, you can create RSS feeds for specific search terms within that journal (see the examples at the bottom of this post). I haven't figured out how to filter RSS feeds based on whether the manuscript has supplemental materials (e.g. data).

RSS isn't perfect. For example, some of the ASA journals have mucked up their XML and I see a lot of repeats of articles on the same day. An edited list of what I keep tabs on is at the end of this post.

(As an aside, RSS feeds are also great for monitoring specific topics on Stack Overflow and Crossvalidated)

I have tried myriad RSS readers to aggregate and monitor my feeds. I'm currently using Feedly.

Also, this is only for content that you have identified as interesting. There could be something else out there that you have missed completely. That leads me to...

Google Alerts

I have about 30 different alerts. Some are related to general topics (e.g. ["training set" "test set" -microarray -SNP -QSAR -proteom -RNA -biomarker -biomarkers]) and others look for anything citing specific manuscripts (e.g. [Documents citing "The design and analysis of benchmark experiments"]). See this page for examples of how to create effective alerts. There are other uses for alerts too.

Alerts are very effective. I usually get emails with the alerts in batches of 20 or so at a time. I haven't quite figured out what the trigger is; in some cases I get two batches in a single day.

One thing I would put on the wish list is to so some sort of smart aggregation. If have alerts for [ "simulated annealing" "feature selection" ]; Articles excluding patents and [ "genetic algorithm" "feature selection" ]; Articles excluding patents, this results in abundant redundancy since many feature selection articles mention both search algorithms.

Keep in mind that the alerts may not be new articles but items that are new to Google. This isn't really an issue for me but it is worth mentioning.

Google Scholar

I love Google Scholar. Search on a title and you always be able to find the manuscript, links to different sources for obtaining it, plus a list of articles which reference it. Subject-based search are just as effective.

(Our librarians were surprised to find that we could get access to articles that our institution did not have licenses for via Google. For example, the scholar page for an article will list multiple versions of the reference. Some of these may correspond to the home page of one of the authors where he/she has a local copy of the PDF)

Google has good tips on searching. This presentation is excellent with some tricks that I didn't know.

So once I've found articles, how do I manage them?

Papers

Papers... I have equal parts love and hate for this program. I'll list the pros and cons below. I should say that I have been using this since the original version and have become increasingly frustrated . I'm not using the most recent version and I have tried a lot of different alternatives (e.g. Mendeley, BibDesk, Bookends, Endnote, Sente, Zotero). Unfortunately, for someone with thousands of PDFs, Papers (version 2) has some features that the others haven't mastered yet. I would love to move away from Papers.

What is good:

  • Importing articles is easy. In many cases, just dropping them into the window will find the metadata and automatically annotate the reference. Weirdly, drag-and-drop works better than the "Match" feature in the article window. There are Open In Papers bookmarks for most browsers. Once you find a journal article, use this link to start Papers and open the link. Often, the application automatically reads the citation information from the webpage and imports it. Clicking on the PDF link within the article's web page imports that file.
  • Articles within Papers can collect supplementary files easily. One minor issue is that plain text files are not automatically imported as PDF, CSV for other file formats are.
  • Papers does a great job or organizing the PDFs locally. I sync to Dropbox and have the same repository across different computers.
  • The bibtex export works well. This was invaluable when we were writing the book.
  • Their apps for tablets/mobile are easy to use and low maintenance. Syncing has not been an issue for me so far.

The bad news

  • Slooooow. It is really slow.
  • The search facilities for your PDF repository are not very powerful. This seems like it is a pretty low bar to jump over.
  • Keywords work but are manually added and the interface is pretty kludgy. I would love for this feature to work better. Hell, it wouldn't be difficult to automatically figure this out based on content (I wrote some rudimentary R/SQL code to do it on an really long plane flight once).
  • I have a small percentage of papers whose PDFs have gone missing.
  • I might accidentally import an article twice. In most cases, Papers doesn't tell me that I'm doing it until it is too late. Although they have a method for merging entries, I'd like to avoid this process beforehand.
  • They release versions with little to no testing. This is amazing but basic functionality in new major versions simply does not work. It is remarkable in the worst way.
  • While they did win the Apple Design Award for the first version, the interface seems to be getting worse with every new release. The color scheme for Papers 3 makes me depressed. Literal 50 shades of gray.

The last two issues have driven me crazy. I don't see myself upgrading any time soon.

Typesetting

I use LaTeX for almost all articles that I write. It is a pain when working with others who have never used it (or heard of it) but it is worth it. Also, the power you get when using LaTeX with Sweave or knitr simply cannot be underestimated. Apart from exporting bibtex from Papers, the other tools I use are:

  • Sublime Text: This is a great, lightweight editor that has some great add-ons for typesetting that integrate with Skim. Skim is pretty nice, but I would really like Sublime to work with OS X's Preview.
  • texpad is another good editor for OS X but, given the price, it might be difficult to argue that it is better than Sublime. It does hide a lot of the LaTeX junk that goes into typesetting a tex file but this is really a minor perk.

I gave a talk at ENAR last year related to this. We've since moved the book version control to github and have translated all of our Sweave code to knitr.

My Journal Feeds

In no particular order:

Comparing the Bootstrap and Cross-Validation

This is the second of two posts about the performance characteristics of resampling methods. The first post focused on the cross-validation techniques and this post mostly concerns the bootstrap.

Recall from the last post: we have some simulations to evaluate the precision and bias of these methods. I simulated some regression data (so that I know the real answers and compute the true estimate of RMSE). The model that I used was random forest with 1000 trees in the forest and the default value of the tuning parameter. I simulated 100 different data sets with 500 training set instances. For each data set, I also used each of the resampling methods listed above 25 times using different random number seeds. In the end, we can compute the precision and average bias of each of these resampling methods.

Question 3: How do the variance and bias change in the bootstrap?

First, let’s look at how the precision changes over the amount of data held-out and the training set size. We use the variance of the resampling estimates to measure precision.

Again, there shouldn't be any real surprise that the variance is decreasing as the number of bootstrap samples increases.

It is worth noting that compared to the original motivation for the bootstrap, which as to create confidence intervals for some unknown parameter, this application doesn't require a large number of replicates. Originally, the bootstrap was used to estimate the tail probabilities of the bootstrap distribution of some parameters. For example, if we want to get a 95% bootstrap confidence interval, we one simple approach is to accurately measure the 2.5% and 97.5% quantiles. Since these are very extremely values, traditional bootstrapping requires a large number of bootstrap samples (at least 1,000). For our purposes, we want a fairly good estimate of the mean of the bootstrap distribution and this shouldn't require hundreds of resamples.

For the bias, it is fairly well-known that the naive bootstrap produces biased estimates. The bootstrap has a hold-out rate of about 63.2%. Although this is a random value in practice and the mean hold-out percentage is not affected by the number of resamples. Our simulation confirms the large bias that doesn't move around very much (the y-axis scale here is very narrow when compared to the previous post):

Again, no surprises

Question 4: How does this compare to repeated 10-fold CV?

We can compare the sample bootstrap to repeated 10-fold CV. For each method, we have relatively constant hold-out rates and matching configurations for the number of total resamples.

My initial thoughts would be that the naive bootstrap is probably more precise than repeated 10-fold CV but has much worse bias. Let's look at the bias first this time. As predicted, CV is much less biased in comparison:

Now for the variance:

To me, this is very unexpected. Based on these simulations, repeated 10-fold CV is superior in both bias and variance.

Question 5: What about the OOB RMSE estimate?

Since we used random forests as our learner we have access to yet another resampled estimate of performance. Recall that random forest build a large number of trees and each tree is based on a separate bootstrap sample. As a consequence, each tree has an associated "out-of-bag" (OOB) set of instances that were not used in the tree. The RMSE can be calculated for these trees and averaged to get another bootstrap estimate. Recall from the first post that we used 1,000 trees in the forest so the effective number of resamples is 1,000.

Wait - didn't I say above that we don't need very many bootstrap samples? Yes, but that is a different situation. Random forests require a large number of bootstrap samples. The reason is that random forests randomly sample from the predictor set at each split. For this reason, you need a lot of resamples to get stable prediction values. Also, if you have a large number of predictors and a small to medium number of training set instances, the number of resamples should be really large to make sure that each predictor gets a chance to influence the model sufficiently.

Let's look at the last two plots and add a line for the value of the OOB estimate. For variance:

and bias:

the results look pretty good (especially the bias). The great thing about this is that you get this estimate for free and it works pretty well. This is consistent with my other experiences too. For example, Figure 8.18 shows that the CV and OOB error rates for the usual random forest model track very closely for the solubility data.

The bad news is that:

  • This estimate may not work very well for other types of models. Figure 8.18 does not show nearly identical performance estimates for random forests based on conditional inference trees.
  • There are a limited number of performance estimates where the OOB estimate can be computed. This is mostly due to software limitations (as opposed to theoretical concerns). For the caret package, we can compute RMSE and R2 for regression and accuracy and the Kappa statistic for classification but this is mostly by taking the output that the randomForest function provides. If you want to get the area under the ROC curve or some other measure, you are out of luck.
  • If you are comparing random forest’s OOB error rate with the CV error rate from another model, it may not be a very fair comparison.

Comparing Different Species of Cross-Validation

This is the first of two posts about the performance characteristics of resampling methods. I just had major shoulder surgery, but I've pre-seeded a few blog posts. More will come as I get better at one-handed typing.

First, a review:

  • Resampling methods, such as cross-validation (CV) and the bootstrap, can be used with predictive models to get estimates of model performance using the training set.
  • These estimates can be made to tune the model or to get a good sense of how the model works without touching the test set.

There are quite a few methods for resampling. Here is a short summary (more in Chapter 4 of the book):

  • k-fold cross-validation randomly divides the data into k blocks of roughly equal size. Each of the blocks is left out in turn and the other k-1 blocks are used to train the model. The held out block is predicted and these predictions are summarized into some type of performance measure (e.g. accuracy, root mean squared error (RMSE), etc.). The k estimates of performance are averaged to get the overall resampled estimate. k is 10 or sometimes 5. Why? I have no idea. When k is equal to the sample size, this procedure is known as Leave-One-Out CV. I generally don't use it and won't consider it here.
  • Repeated k-fold CV does the same as above but more than once. For example, five repeats of 10-fold CV would give 50 total resamples that are averaged. Note this is not the same as 50-fold CV.
  • Leave Group Out cross-validation (LGOCV), aka Monte Carlo CV, randomly leaves out some set percentage of the data B times. It is similar to min-training and hold-out splits but only uses the training set.
  • The bootstrap takes a random sample with replacement from the training set B times. Since the sampling is with replacement, there is a very strong likelihood that some training set samples will be represented more than once. As a consequence of this, some training set data points will not be contained in the bootstrap sample. The model is trained on the bootstrap sample and those data points not in that sample are predicted as hold-outs.

Which one should you use? It depends on the data set size and a few other factors. We statisticians tend to think about the operating characteristics of these procedures. For example, each of the methods above can be characterized in terms of their bias and precision.

Suppose that you have a regression problem and you are interested in measuring RMSE. Imagine that, for your data, there is some “true” RMSE value that a particular model could achieve. The bias is the difference between what the resampling procedure estimates your RMSE to be for that model and the true RMSE. Basically, you can think of it as accuracy of estimation. The precision measures how variable the result is. Some types of resampling have higher bias than others and the same is true for precision.

Imagine that the true RMSE is the target we are trying to hit and suppose that we have four different types of resampling. This graphic is typically used when we discuss accuracy versus precision.

Clearly we want to be in the lower right.

Generally speaking, the bias of a resampling procedure is thought to be related to how much data is held out. If you hold-out 50% of your data using 2-fold CV, the thinking is that your final RMSE estimate will be more biased than one that held out 10%. On the other hand, the conventional wisdom is that holding less data out decreases precision since each hold-out sample has less data to get a stable estimate of performance (i.e. RMSE).

I ran some simulations to evaluate the precision and bias of these methods. I simulated some regression data (so that I know the real answers and compute the true estimate of RMSE). The model that I used was random forest with 1000 trees in the forest and the default value of the tuning parameter. I simulated 100 different data sets with 500 training set instances. For each data set, I also used each of the resampling methods listed above 25 times using different random number seeds. In the end, we can compute the precision and average bias of each of these resampling methods.

I won’t show the distributions of the precision and bias values across the simulations but use the median of these values. The median represents the distributions well and are simpler to visualize.

Question 1a and 1b: How do the variance and bias change in basic CV? Also, Is it worth repeating CV?

First, let’s look at how the precision changes over the amount of data held-out and the training set size. We use the variance of the resampling estimates to measure precision.

First, a value of 5 on the x-axis is 5-fold CV and 10 is 10-fold CV. Values greater than 10 are repeated 10-fold (i.e. a 60 is six repeats of 10-fold CV). For on the left-hand side of the graph (i.e. 5-fold CV), the median variance is 0.019. This measures how variable 5-fold CV is across all the simulated data sets.

There probably isn't any surprise here: if your measure additional replicates, the measured variance goes down. At some point the variance will level off but we are still gaining precision by repeating 10-fold CV more than once. Looking at the first two data points on the (single 5-fold and 10-fold CV), the reduction in variance is probably due to how much is being left out (10% versus 80%) as well as the number of resamples (5 versus 10).

What about bias? The conventional wisdom is that the bias should be better for the 10-fold CV replicates since less is being left out in those cases. Here are the results:

From this, 5-fold CV is pessimistically biased and that bias is reduced by moving to 10-fold CV. Perhaps it is within the noise, but it would also appear that repeating 10-fold CV a few times can also marginally reduce the bias.

Question 2a and 2b: How does the amount held back affect LGOCV? Is it better than basic CV?

Looking at the leave-group-out CV results, the variance analysis shows an interesting pattern:

Visually at least, it appears that the amount held-out has a slightly a bigger influence on the variance of the results than the number of times that the process is repeated. Leaving more out buys you better individual resampled RMSE values (i.e. more precision).

That's one side of the coin. What about the bias?

From this, LGOCV is overly pessimistic as you increase the amount held out. This could be because, with less data used to training the model, the less substrate random forest has to create an accurate model. It is hard to say why the bias didn't flatten out towards zero when small amounts of data are left out.

Also, the number of held-out data sets doesn't appear to reduce the bias.

One these results alone, if you use LGOCV try to leave a small amount out (say 10%) and do a lot of replicates to control the variance. But... why not just do repeated 10-fold CV?

We have simulations where both LGOCV and 10-fold CV left out 10%. We can do a head-to-head comparison of these results to see which procedure seems to work better. Recall that the main difference between these two procedures is that repeated 10-fold CV splits the hold-out data points evenly within a fold. LGOCV just randomly selects samples each time. In ease case, the same training set sample will show up in more than one of the hold-out data sets so the difference is more about configuration of samples.

Here are the variance curves:

That seems pretty definitive: all other things being equal, you gain about a log unit of precision using repeated 10-fold CV instead of LGOCV with a 10% hold-out.

The bias curves show a fair amount of noise (keeping in mind the scale of this graph compared to the other bias images above):

I would say that there is no real difference in bias and expected this prior to seeing the results. We are always leaving 10% behind and, if this is what drives bias, the two procedures should be about the same.

So my overall conclusion, so far, is that repeated 10-fold CV is the best in terms of variance and bias. As always, caveats apply. For example, if you have a ton of data, the precision and bias of 10- or even 5-fold CV may be acceptable. Your mileage may vary.

The next post will look at:

  • the variance and bias of the nominal bootstrap estimate
  • a comparison of repeated 10-fold CV to the bootstrap
  • the out-of-bag estimate of RMSE from the individual random forest model and how it compares to the other procedures.

EDIT: based on the comments, here is one of the simulation files. I broke them up to run in parallel on our grid but they are all the same (except the seeds). Here is the markdown file for the post if you want the plot code or are interested to see how I summarized the results.