A Tutorial and Talk at useR! 2014 [Important Update]

See the update below

I'll be doing a morning tutorial at useR! at the end of June in Los Angeles. I've done this same presentation at the last few conferences and this will probably be the last time for this specific workshop.

The tutorial outline is:

  • Conventions in R
  • Data splitting and estimating performance
  • Data pre-processing
  • Over-fitting and resampling
  • Training and tuning tree models
  • Training and tuning a support vector machine
  • Comparing models (as time allows)
  • Parallel processing (as time allows)

I'm also giving a talk called "Adaptive Resampling in a Parallel World":

Many predictive models require parameter tuning. For example, a classification tree requires the user to specify the depth of the tree. This type of "meta parameter" or "tuning parameter" cannot be estimated directly from the training data. Resampling (e.g. cross-validation or the bootstrap) is a common method for finding reasonable values of these parameters (> Kuhn and Johnson, 2013)> . Suppose B resamples are used with M candidate values of the tuning parameters. This can quickly increase the computational complexity of the task. Some of the M models could be disregarded early in the resampling process due to poor performance. > Maron and Moore (1997)> and > Shen el at (2011)> describe methods to adaptively filter which models are evaluated during resampling and reducing the total number of model fits. However, model parameter tuning is an "embarrassingly parallel" task; model fits can be calculated across multiple cores or machines to reduce the total training time. With the availability of parallel processing is it still advantageous to adaptively resample?

This talk will briefly describe adaptive resampling methods and characterize their effectiveness using parallel processing via simulations.

The [conference website][http://user2014.stat.ucla.edu] has updated their website to say:

This year there is no separate registration process or extra fee for attending tutorials.

UPDATE! Since this is the case, I won't be giving out the book to all the attendees as I originally intended. However, the conference does supply tutorial presenters with a stipend so I will be using that to purchase about a dozen copies that I will randomly distribute to whoever attends.

Sorry for the confusion...

Cross-validation pitfalls when selecting and assessing regression and classification models

Damjan Krstajic and friends have a great paper on pitfalls of cross-validation. Although the paper uses chemistry data, the meat of the article is broadly applicable. It does a great job of illustrating different resampling approaches and I learned more about double and nested cross-validation.

Figure 10 surprised me; I assumed that the precision in resampled estimates is mostly driven by the number of resamples. For example, a resampled estimate of the RMSE using 64 resamples has a standard error of sd/8 which is twice as good as one using 16 resamples (i.e. sd/4). In their work, the variation in 50 repeats of 10-fold CV are much better than 50 repeats of 10-fold nested CV.

Finally, the article has an excellent historical summary of the pivotal papers on this subject and does a great job of labeling and articulating the different goals that one might have when resampling predictive models.