The corresponding area under the ROC curve is 0.8, which is nearly as good as the second component. A simple transformation based on visually exploring the data can do just as good of a job as an unbiased empirical algorithm.
These data are from the cell segmentation experiment of Hill et al, and predictor A is the "surface of a sphere created from by rotating the equivalent circle about its diameter" (labeled as EqSphereAreaCh1
in the data) and predictor B is the perimeter of the cell nucleus (PerimCh1
). A specialist in high content screening might naturally take the ratio of these two features of cells because it makes good scientific sense (I am not that person). In the context of the problem, their intuition should drive the feature engineering process.
However, in defense of an algorithm such as PCA, the machine has some benefit. In total, there are almost sixty predictors in these data whose features are just as arcane as EqSphereAreaCh1
. My personal favorite is the "Haralick texture measurement of the spatial arrangement of pixels based on the co-occurrence matrix". Look that one up some time. The point is that there are often too many features to engineer and they might be completely unintuitive from the start.
Another plus for feature extraction is related to correlation. The predictors in this particular data set tend to have high between-predictor correlations and for good reasons. For example, there are many different ways to quantify the eccentricity of a cell (i.e. how elongated it is). Also, the size of a cell's nucleus is probably correlated with the size of the overall cell and so on. PCA can mitigate the effect of these correlations in one fell swoop. An approach of manually taking ratios of many predictors seems less likely to be effective and would take more time.
Last year, in one of the R&D groups that I support, there was a bit of a war being waged between the scientists who focused on biased analysis (i.e. we model what we know) versus the unbiased crowd (i.e. just let the machine figure it out). I fit somewhere in-between and believe that there is a feedback loop between the two. The machine can flag potentially new and interesting features that, once explored, become part of the standard book of "known stuff".