Some Thoughts on "Do we Need Hundreds of Classifiers to Solve Real World Classification Problems?"

Sorry for the blogging break. I’ve got a few planned for the next few weeks based on some work I’ve been doing.

In the meantime, you should check out “Do we Need Hundreds of Classifiers to Solve Real World Classification Problems?” by Manuel Fernandez-Delgado at JMLR. They took a large number of classifiers and ran them against a large number of data sets from UCI. I was a reviewer on this paper (it heavily relies on caret) and have been interested in seeing peoples reaction when it was made public.

My thoughts:

Before reading this manuscript, take a few minutes and read this oldie by David Hand.
Obviously, it pays to tune your model. That is not the most profound statement to make but the paper does a good job of quantifying the impact of tuning the hyper-parameters.
Random forest takes the top spot on a regular basis. I was surprised by this since, in my experience, boosting does better and bagging does almost as well.
The authors believe that the parallel version of random forest in caret had an edge in performance. That’s hard to believe since it does’t do anything more than split the forest across different cores. That’s it. I took it out of the package for a bit because, if you are tuning the model, parallelizing the cross-validation is faster. I put it back in a few verisons ago since I knew people would want it after reading this manuscript.
I was hoping that the authors would take a better graphical and analytical approach to comparing the methods. Table after table numbs my soul.
Despite the number of models used, it would have been nice to see more emphasis on recent deep-learning models and boosting via the gbm package.

Exercise Solutions

I'm finally recovering form the summer and will start posing again soon.

In the meantime, Kjell and I have made some progress on the exercise solutions. We'd like some feedback from the readers and instructors on how to release them:

Hello,

Thank you for contacting Max and me regarding solutions to exercises in Applied Predictive Modeling. We took a needed break after completing the manuscript, and we're now working on compiling the exercise solutions.

Over the past year, we have had requests for solutions from practitioners who are applying these techniques and approaches in their work, professors who are using the text in their classes, and students who are taking a course which uses APM as a text.

Given that the text is being used in courses and that the exercises may be assigned as part of the coursework, we'd like to be sensitive to those professors and not completely divulge our solutions. At the same time, we would also like to provide as many solutions as possible for practitioners who are working through the exercises to improve their skills. These are clearly two different audiences with different needs.

That said, the general nature of predictive modeling and the way we have worded many of the exercises provide the potential for a number of possible "correct" solutions--not just the solutions that we provide.

Springer, our publisher, has suggested one possible solution by offering to host a password protected website for exercise solutions. Another potential path would be to provide solutions to select exercises (e.g. odd numbered exercises). At the other extreme, we could post all solutions.

In the context of this background, we have some questions for you:

For those of you who are professors using APM as a text, what approach to exercise solutions would be most useful to you? Would you be negatively impacted if we published solutions to all exercises?

For those of you who are practitioners, what approach would be the best alternative (to publishing all solutions) for you, given that the text is being used in courses?

We welcome your feedback and other ideas you have for providing exercise solutions.

Thanks, and we look forward to hearing from you.

Best regards,
Kjell (kjell@arboranalytics.com) and Max (max.kuhn@pfizer.com)

useR! 2014 Highlights

My talk went well; here are the slides and a link to the paper pre-print.
Hadley Wickham gave an excellent tutorial on dplyr.
Based on the talk I saw, I think I will take the data sets from the book and make some public visualizations on the Plotly website.
There were a few presentations on interactive graphics that were very good (here, here and here).
Tal Galili gave an excellent talk on visualizing and analyzing hierarchical clustering algorithms.
The HeR session was great. I also learned about R-ladies.
Revolution Analytics and RStudio proposed two similar techniques for preserving versions for better reproducibility.
The subsemble package by Erin LeDell was impressive.
I had dinner with John Chambers, which was pretty cool too.