This page will include corrections (things that are flat out wrong) and updates (incorrect now, but not at the time the book was published).
Chapter 3
- Page 34 (Correction): Reader Alistair Bloomfield astutely pointed out that the description of the spatial sign as "each sample is divided by its squared norm" is consistent with the equation shown but inconsistent with the truth. "Squared norm" was meant to mean "Euclidean norm" which involves a square root (
caret
does the correct math). The equation's denominator should be:
||x|| = sqrt(sum_j x_ij^2)
Chapter 4
- Page 87 (Update): If run with the current version of caret, the code to make
predictedProbs
will fail. The SVM function in the kernlab package requires the optionprob.model = TRUE
to be set in order to produce class probabilities. Prior to version 5.15-60 of caret,train
would automatically use this option. It turns out that this was not the best idea. Currently, to get class probabilities you need to either passprob.model = TRUE
into thetrain
function or useclassProbs = TRUE
withintrainControl
. Thanks to N Jog for altering us to this.
Chapter 5
- Page 95 (Correction): The book has "The mean squared error (MSE) is calculated by squaring the residuals and summing them." This should be: "The mean squared error (MSE) is calculated by squaring the residuals, summing them and dividing by the number of samples." Thanks to Ed Kadyszewski.
Chapter 6
Page 132 (Correction): The linear model with the filtered predictors (i.e.
lmFiltered
) should use the data frametrainXfiltered
instead of the one shown in the text (solTrainXtrans
). However, the corresponding analysis shown in section 6.2 is correct. (found by Benjamin De Baets, incorrect correction found by Hannah Schlotter)Page 138 (Correction: The text for exercise 6.1(b) should read "the total number of predictors (100)" (found by Pekka Aalto).
Page 139 (Correction: The code to load the chemical manufacturing process data is incorrect (found by Birol Emir). The correct code is:
> library(AppliedPredictiveModeling) > data(ChemicalManufacturingProcess)
Chapter 7
Page 146 (Clarification): The "first term" of the equation is the intercept. When describing the MARS model, the section of the equation with
h(MolWeight - 5.94516)
is associated with the part of the curve in Fig. 7.3 where the molecular weight is less than 5.94516. This is what we called the "second term". The "last term" of the equation hash(5.94516 − MolWeight)
and corresponds to the area where the weight is less than 5.9.Page 155 (Correction): The text in the last paragraph states that "A linear regression model with an intercept and a term for sin(x) was fit to the model (solid black line)". In Figure 7.7, the least squares line in the bottom panel is, in fact, red. Thanks to Roberto Vitillo.
Chapter 8
Page 182 (Correction): The reference for "Carolin et al 2007" is really for Strobl, C., Boulesteix, A.-L. & Augustin, T., 2007. Unbiased split selection for classification trees based on the Gini Index. Computational Statistics & Data Analysis, 52(1), pp.483–501. Thanks to Rainer Stuts
Page 216 (Correction): The phrase "
ntree
for the number of bootstrap samples" should be "ntrees
for the number of bootstrap samples". (found by Benjamin De Baets)
Chapter 11
Page 251 (Correction): The text states "The top panel of Fig. 11.6 shows histograms." It should read Fig 11.3. Found by Birol Emir.
Page 258 (Correction): Ingo Peter noted that the first part of the equation for PPV with balanced prevalence should not have a multiplier in the denominator, i.e.
PPV = Sensitivity / (Sensitivity + ( 1 – Specificity ))
Chapter 12
Page 291 (Correction): The arrow is black, not red. Found by Birol Emir.
Page 292-293 (Correction): The sentence spanning the pages should start as "When there are more predictors than samples". Found be several people.
Page 325 (Correction): The
hepatic
data are in the AppliedPredictiveModeling R package, notcaret
. Found by Birol Emir.
Chapter 13
Page 333 (Correction): The subscripts for
y
in Eq. 13.3 and the equation directly above that should beil
and notii
. Thanks to "Anonymous".Page 342 (Correction): The text has the wrong interpretation for the binary predictors. The effects of a Saturday submission and an unknown sponsor increase the probability of a successful grant while others listed have a negative effect (i.e. they decrease the probability). Thanks to "Anonymous".
Page 357 (Correction): The curve for the normal distribution is dark blue and not black. Thanks to "Anonymous".
Chapter 14
Page 393 (Correction): The images in Figures 14.10 and 14.11 are switched. The code in the AppliedPredictiveModeling package is correct (the images were mislabeled during book production). Found by Kent Johnson.
Page 411 (Correction): The
hepatic
data are in the AppliedPredictiveModeling R package, notcaret
. Found by Birol Emir.
Chapter 16
Section 16.8 Cost-Sensitive Training (Update): A recent update to caret contains methods for support vector machines, CART trees and C5.0 that include the cost value (or weight) as a tuning parameter that can be directly optimized during training.
Section 16.9 Computing, page 435 (Update): the DWD package is no longer on CRAN. The insurance file containing the data can be found here. Download this file and use load("ticdata.rda") to access the data instead of loading it from the DWD package.
Chapter 17
Page 448 (Correction): In the first paragraph, the text should read "fast (1–5m)" instead of "fast (1–50m)". Thanks to Rafael Wampfler.
Page 450 (Correction): The caption for Figure 17.2 should be "A mosaic plot". Thanks to Rafael Wampfler.
- Page 457 (Correction): The data are loaded using
data(schedulingData)
and notdata(HPC)
. Thanks to Ross Quinlan for finding this error.
Chapter 18
- Page 470 (Correction): The text should read "Recall that the odds of a probability are
p/(1 − p)
." Found by Birol Emir.
Chapter 19
Page 493 (Correction): In the first paragraph (under the quote), the text should read "or the area under the ROC curve" and not "or the error under the ROC curve". Thanks to Rafael Wampfler.
Page 507 (Correction): In the last paragraph, "The y-axis has similar values across models, but these were calculated from ROC curves using predictions on the test set of 66 subjects" should refer to the other axis, as in "The x-axis has similar values across models, but these were calculated from ROC curves using predictions on the test set of 66 subjects". Thanks to Rafael Wampfler.
Chapter 20
Page 526 (Correction): The first line on the page should read "predictors were either continuous or binary". Thanks to Rafael Wampfler.
Page 526 (Correction): The y-axis on Figure 20.2 should reformatted as "R2 (Test Set)". Rafael Wampfler found that one too.
Page 527 (Correction): The words "may be the produce of some external process" should be "may be the product of some external process". Rafael Wampfler gets the Chapter 20 hat trick.
Appendix A
- Page 550 (Correction): In the "Allows n < p" column, the "Elastic net/lasso" row should have a check, meaning that they are well suited for high dimensional data. Thanks to Roy Kamimura for catching the error.
Page 550 (Enhance!): There is no legend for the codes used in the table. These are:
CS
= centering and scalingNZV
= remove near-zero predictorsCorr
= remove highly correlated predictors