useR! 2014 Highlights

  • My talk went well; here are the slides and a link to the paper pre-print.
  • Hadley Wickham gave an excellent tutorial on dplyr.
  • Based on the talk I saw, I think I will take the data sets from the book and make some public visualizations on the Plotly website.
  • There were a few presentations on interactive graphics that were very good (here, here and here).
  • Tal Galili gave an excellent talk on visualizing and analyzing hierarchical clustering algorithms.
  • The HeR session was great. I also learned about R-ladies.
  • Revolution Analytics and RStudio proposed two similar techniques for preserving versions for better reproducibility.
  • The subsemble package by Erin LeDell was impressive.
  • I had dinner with John Chambers, which was pretty cool too.

New caret version with adaptive resampling

A new version of caret is on CRAN now.

There are a number of bug fixes:

  • A man page with the list of models available via train was added back into the package. See ?models.
  • Thoralf Mildenberger found and fixed a bug in the variable importance calculation for neural network models.
  • The output of varImp for pamr models was updated to clarify the ordering of the importance scores.
  • getModelInfo was updated to generate a more informative error message if the user looks for a model that is not in the package's model library.
  • A bug was fixed related to how seeds were set inside of train.
  • The model "parRF" (parallel random forest) was added back into the library.
  • When case weights are specified in train, the hold-out weights are exposed when computing the summary function.
  • A check was made to convert a data.table given to train to a data frame (see this post).

One big new feature is that adaptive resampling can be used. I'll be speaking about this at useR! this year. Also, while I'm submitting a manuscript, a pre-print is available at arxiv.

Basically, after a minimum number of resamples have been processed, all tuning parameter values are not treated equally. Some that are unlikely to be optimal are ignored as resampling proceeds. There can be substantial speed-ups in doing so and there is a low probability that a poor model will be found. Here is a plot of the median speed-up (y axis) versus the estimated probability of model at least as good as the one found using all the resamples will occur.

example_adaptive.jpg

The manuscript has more details about the other factors in the graph. One nice property of this methodology is that, when combined with parallel processing, the speed-ups could be as high as 30-fold (for the simulated example).

These features should be considered experimental. Send me any feedback on them that you may have.

A Tutorial and Talk at useR! 2014 [Important Update]

See the update below

I'll be doing a morning tutorial at useR! at the end of June in Los Angeles. I've done this same presentation at the last few conferences and this will probably be the last time for this specific workshop.

The tutorial outline is:

  • Conventions in R
  • Data splitting and estimating performance
  • Data pre-processing
  • Over-fitting and resampling
  • Training and tuning tree models
  • Training and tuning a support vector machine
  • Comparing models (as time allows)
  • Parallel processing (as time allows)

I'm also giving a talk called "Adaptive Resampling in a Parallel World":

Many predictive models require parameter tuning. For example, a classification tree requires the user to specify the depth of the tree. This type of "meta parameter" or "tuning parameter" cannot be estimated directly from the training data. Resampling (e.g. cross-validation or the bootstrap) is a common method for finding reasonable values of these parameters (Kuhn and Johnson, 2013) . Suppose B resamples are used with M candidate values of the tuning parameters. This can quickly increase the computational complexity of the task. Some of the M models could be disregarded early in the resampling process due to poor performance. Maron and Moore (1997) and Shen el at (2011) describe methods to adaptively filter which models are evaluated during resampling and reducing the total number of model fits. However, model parameter tuning is an "embarrassingly parallel" task; model fits can be calculated across multiple cores or machines to reduce the total training time. With the availability of parallel processing is it still advantageous to adaptively resample?

This talk will briefly describe adaptive resampling methods and characterize their effectiveness using parallel processing via simulations.

The [conference website][http://user2014.stat.ucla.edu] has updated their website to say:

This year there is no separate registration process or extra fee for attending tutorials.

UPDATE! Since this is the case, I won't be giving out the book to all the attendees as I originally intended. However, the conference does supply tutorial presenters with a stipend so I will be using that to purchase about a dozen copies that I will randomly distribute to whoever attends.

Sorry for the confusion...