Exercise Solutions

I'm finally recovering form the summer and will start posing again soon.

In the meantime, Kjell and I have made some progress on the exercise solutions. We'd like some feedback from the readers and instructors on how to release them:

Hello,

Thank you for contacting Max and me regarding solutions to exercises in Applied Predictive Modeling. We took a needed break after completing the manuscript, and we're now working on compiling the exercise solutions.

Over the past year, we have had requests for solutions from practitioners who are applying these techniques and approaches in their work, professors who are using the text in their classes, and students who are taking a course which uses APM as a text.

Given that the text is being used in courses and that the exercises may be assigned as part of the coursework, we'd like to be sensitive to those professors and not completely divulge our solutions. At the same time, we would also like to provide as many solutions as possible for practitioners who are working through the exercises to improve their skills. These are clearly two different audiences with different needs.

That said, the general nature of predictive modeling and the way we have worded many of the exercises provide the potential for a number of possible "correct" solutions--not just the solutions that we provide.

Springer, our publisher, has suggested one possible solution by offering to host a password protected website for exercise solutions. Another potential path would be to provide solutions to select exercises (e.g. odd numbered exercises). At the other extreme, we could post all solutions.

In the context of this background, we have some questions for you:

For those of you who are professors using APM as a text, what approach to exercise solutions would be most useful to you? Would you be negatively impacted if we published solutions to all exercises?

For those of you who are practitioners, what approach would be the best alternative (to publishing all solutions) for you, given that the text is being used in courses?

We welcome your feedback and other ideas you have for providing exercise solutions.

Thanks, and we look forward to hearing from you.

Best regards,
Kjell (kjell@arboranalytics.com) and Max (max.kuhn@pfizer.com)

useR! 2014 Highlights

My talk went well; here are the slides and a link to the paper pre-print.
Hadley Wickham gave an excellent tutorial on dplyr.
Based on the talk I saw, I think I will take the data sets from the book and make some public visualizations on the Plotly website.
There were a few presentations on interactive graphics that were very good (here, here and here).
Tal Galili gave an excellent talk on visualizing and analyzing hierarchical clustering algorithms.
The HeR session was great. I also learned about R-ladies.
Revolution Analytics and RStudio proposed two similar techniques for preserving versions for better reproducibility.
The subsemble package by Erin LeDell was impressive.
I had dinner with John Chambers, which was pretty cool too.

New caret version with adaptive resampling

A new version of caret is on CRAN now.

There are a number of bug fixes:

A man page with the list of models available via train was added back into the package. See ?models.
Thoralf Mildenberger found and fixed a bug in the variable importance calculation for neural network models.
The output of varImp for pamr models was updated to clarify the ordering of the importance scores.
getModelInfo was updated to generate a more informative error message if the user looks for a model that is not in the package's model library.
A bug was fixed related to how seeds were set inside of train.
The model "parRF" (parallel random forest) was added back into the library.
When case weights are specified in train, the hold-out weights are exposed when computing the summary function.
A check was made to convert a data.table given to train to a data frame (see this post).

One big new feature is that adaptive resampling can be used. I'll be speaking about this at useR! this year. Also, while I'm submitting a manuscript, a pre-print is available at arxiv.

Basically, after a minimum number of resamples have been processed, all tuning parameter values are not treated equally. Some that are unlikely to be optimal are ignored as resampling proceeds. There can be substantial speed-ups in doing so and there is a low probability that a poor model will be found. Here is a plot of the median speed-up (y axis) versus the estimated probability of model at least as good as the one found using all the resamples will occur.

The manuscript has more details about the other factors in the graph. One nice property of this methodology is that, when combined with parallel processing, the speed-ups could be as high as 30-fold (for the simulated example).

These features should be considered experimental. Send me any feedback on them that you may have.