What makes effective statistical practice?
Science is about asking questions, getting data and (often) applying statistical methods to use data to answer questions. What are some principles of effective statistical practice that statisticians would like working scientists to know? In the ongoing “Ten Simple Rules” series at PLoS Computational Biology, statisticians Kass and colleagues (2016) present some good advice and guidance. Here is a summary of some of these points:
1. Understand that signals always come with noise
Variability (or variation) is almost inherent in all forms of data. Very often, we want to detect (eg. a biological or physical) signal among lots of noise in the data. Statistics is about using mathematical properties to detect signals in data. This is done by using a pattern of how probabilities of outcomes are distributed (in a statistical model) to specify how signal and noise are combined to obtain the data observed. This fundamental step is what makes statistical inference possible. Statistical analysis is about assessing the signal in the data (if there is one), and analysing the interesting (or irrelevant) variability in the presence of noise.
2. Plan w-e-l-l ahead
It’s always good to develop an analysis protocol or get statistical advice early in planning a study (ie. before data are collected) rather than late. As the statistician Ronald Fisher puts it (cited by the authors), “To consult the statistician after an experiment is finished is often merely to ask him to conduct a post mortem examination. He can perhaps say what the experiment died of.”
3. Worry about data quality
After data are collected, raw data should be locked and made read only. However raw data quite often need pre-processing or “data cleaning” before a formal statistical analysis is applied. Whether we analyse data that are collected intentionally in the course of conducting an experiment, or data that were routinely collected (eg. medical records), it is important to understand and record how data are pre-processed. For example, make sure you understand and record units of measurements for any variable, understand why some data are missing or incomplete, explore preliminary data and work out whether strange looking values were obtained from experimental artefact or reflect real variability of interest.
4. Provide assessments of variability
A basic purpose of inferential statistics is to assess uncertainty, usually in the form of standard errors or confidence intervals. This is because in any study conducted, the estimate of an effect only applies to samples tested in that study, but we are really interested the effect in the population. For any estimate of effect (eg. means, medians, proportions, counts, odds ratios, relative risks, hazard ratios, etc.) provide confidence intervals to indicate precision of the estimate.
5. Make your analysis reproducible
One important aspect of good science is reproducibility: given the same dataset and a complete description of the analysis, is it possible to obtain the same tables, figures and statistical inferences? These findings can be reproduced by being systematic in analysing and recording the steps taken, and by sharing data and code used to produce the results. By now, there are a number of open source tools available to do this collaboratively (eg. Jupyter notebooks for Python and R, knitr for R) – these (very helpfully) can combine a research report with analysis code.
Kass RE, Caffo BS, Davidian M, Meng X-L, Yu B, Reid N (2016) Ten Simple Rules for Effective Statistical Practice. PLoS Comput Biol 12(6): e1004961. doi:10.1371/journal.pcbi.1004961