Principles of Applied Statistics

Applied statistics is more than data analysis, but it is easy to lose sight of the big picture. David Cox and Christl Donnelly draw on decades of scientific experience to describe usable principles for the successful application of statistics, showing how good statistical strategy shapes every stage of an investigation. As one advances from research or policy questions, to study design, through modelling and interpretation, and finally to meaningful conclusions, this book will be a valuable guide. Over 100 illustrations from a wide variety of real applications make the conceptual points concrete, illuminating and deepening understanding. This book is essential reading for anyone who makes extensive use of statistical methods in their work.


T his book by David Cox and Christl
Donnelly is an extensive, if condensed, coverage of most (all?) necessary steps and precautions one must go through when contemplating applied (i.e., actual!) statistics. As the authors write in their first sentence, "Applied statistics is more than data analysis." Thus, the title could have been Principled Data Analysis! Indeed, Principles of Applied Statistics reminds me of how much we (at least I) take 'the model' and 'the data' for granted when conducting statistical analyses by going through all the pre-data and post-data steps that lead to the "idealized" (Page 188) data analysis.
The contents of the book are intentionally simple, with hardly any mathematical aspect, but with a clinical attention to exhaustivity and clarity. For instance, even though I would have enjoyed more stress on probabilistic models as the basis for statistical inference, they only appear by the fourth chapter (out of 10) with error in variable models. The painstakingly careful coverage of the myriad tiny, but essential, steps involved in a statistical analysis and the numerous corresponding pitfalls was certainly illuminating.
Just as the book refrains from mathematical digressions ("Our emphasis is on the subject-matter, not on the statistical techniques as such." Page 12), it refrains from engaging in detail and complex data stories. Instead, it uses little grey boxes to convey the pertinent aspects of a given data analysis, referring to a paper for the full story. (I must admit this is frustrating at times, as one would like to read more!) The book reads smoothly, and I must acknowledge I read most of it in trains, metros, and planes over a week.
"A general principle, sounding superficial but difficult to implement, is that analyses should be as simple as possible, but not simpler." Cox and Donnelly (Page 9) To get into more detail, Principles of Applied Statistics covers most purposes of statistical analyses (Chapter 1); design with special emphasis (Chapters 2-3)-which is not surprising, given the authors' record (and "not a moribund art form"! Page 51)-measurement (Chapter 4), including the special case of latent variables and their role in model formulation; preliminary analysis (Chapter 5), by which the authors mean data screening and graphical pre-analysis (at last!) models (Chapters 6-7), separated in model formulation (debating the nature of probability) and model choice, the latter being somehow separated from the standard meaning of the term (covered in §8.4.5 and §8.4.6); formal (mathematical) inference (Chapter 8), handling testing and multiple testing in particular; interpretation (Chapter 9) (i.e., post-processing); and an epilogue (Chapter 10).
The intended readership is rather broad, from practitioners to students-although both categories do require a good dose of maturity to fully appreciate the book-to teachers, to scientists designing experiments with a statistical mind. It may be deemed too philosophical by some, too allusive by others, but I think it constitutes a magnificent testimony to the depth and spectrum of our field.

"Of course, all choices are to some extent provisional." Cox and Donnelly (Page 130)
I personally appreciated the illustration using capture-recapture models (Page 36) with a remark about the impact of toe clipping on frogs, as it reminded me of a similar way of marking lizards when my (then) PhD student Jérôme Dupuis was working on a corresponding capture-recapture data set from Southern France. On the opposite, while John Snow's story (of using maps to explain the cause of cholera) is alluring and his map makes for a great cover (!), I am less convinced it is particularly relevant to this book, given that Snow's scientific inference was conducted without the map, later used to convince local authorities.
"The word Bayesian, however, became more widely used, sometimes representing a regression to the older usage of at prior distributions supposedly representing initial ignorance, sometimes meaning models in which the parameters of interest are regarded as random variables and occasionally meaning little more than that the laws of probability are somewhere invoked." Cox and Donnelly (Page 144) My main quibble with the book goes (most unsurprisingly!) with the processing of Bayesian analysis (pp. 143-144). Indeed, on the one hand, the method is mostly criticized over those two pages. On the other hand, it is the only method presented with this level of detail, including historical background, which seems superfluous for a treatise on applied statistics. The drawbacks mentioned (Page 144) include the following: • The weight of prior information or modeling as "evidence" • The impact of "indifference or ignorance or reference priors" • Whether empirical Bayes modeling has been used to construct the prior • Whether the Bayesian approach is anything more than a "computationally convenient way of obtaining confidence intervals" The empirical Bayes perspective is the original one found in Robbins (1955) and seems to find grace in the authors' eyes ("the most satisfactory formulation," Page 156). Contrary to MCMC methods, "a black box in that typically it is unclear which features of the data are driving the conclusions" (Page 149) … A bit drastic an appreciation!

"If an issue can be addressed nonparametrically then it will often be better to tackle it parametrically; however, if it cannot be resolved nonparametrically then it is usually dangerous to resolve it parametrically." Cox and Donnelly (Page 96)
Apart from a more philosophical paragraph about the distinction between machine learning and statistical analysis in the final chapter, with the drawback of using neural nets and such as black-box methods (Page 185), there is relatively little coverage of nonparametric models in the book, the choice of "parametric formulations" (Page 96) being openly chosen. I can somehow understand this perspective for simpler settings, namely that nonparametric models offer little explanation of the production of the data. However, in more complex models, nonparametric components often are a convenient way to evacuate burdensome nuisance parameters. Again, technical aspects are not the focus of Principles of Applied Statistics, so this also explains why it does not dwell intently on nonparametric models.

"A test of meaningfulness of a possible model for a datagenerating process is whether it can be used directly to simulate data." Cox and Donnelly (Page 104)
The remark above is quite interesting, especially when accounting for David Cox's current appreciation of ABC techniques (see my vignette on ABC in the 24(4) issue of CHANCE). The impossibility to generate from a posited model as some found in econometrics precludes using ABC, but this does not necessarily mean the model should be excluded as unrealistic.
"The overriding general principle is that there should be a seamless flow between statistical and subjectmatter considerations." Cox and Donnelly (Page 188) As mentioned earlier, the last chapter brings a philosophical conclusion on what is (applied) statistics. It stresses the need for a careful and principled use of black-box methods so they preserve a general framework and lead to explicit interpretations. Once again, a must-read for all statisticians! further reading Dupuis, J. 1995 However, while neither can be classified as textbooks (even though Efron's contains exercises), they differ much in their intended audience and purpose.
As I wrote in the review of Principles of Applied Statistics, the book has an encompassing scope with the goal of covering all the methodological steps required by a statistical study. In Large-Scale Inference, Efron focuses on empirical Bayes methodology for large-scale inference, by which he mostly means multiple testing (rather than, say, data mining). As a result, the book is centered on mathematical statistics and is more technical (which does not mean it is less of an exciting read!).
The book was recently reviewed by both Michael Chernick and Jordi Prats for Significance. Akin to the previous reviewer, and unsurprisingly, I found the book nicely written, with a wealth of R (color!) graphs. The R programs and data set are available on Brad Efron's homepage.