P values and hypothesis tests cannot indicate the size or precision of effects
P values and hypothesis testing methods are frequently misused in clinical and experimental research, perhaps because of the misconception that they provide simple, objective tools to separate true from untrue facts. In a new paper, the cardiologist Daniel Mark and statisticians Kerry Lee and Frank Harrell explain the role and limitations of p values and hypothesis tests in clinical research.
Clinical and experimental research is often conducted to determine effects of an intervention or test condition on some outcome (eg. the effect of paracetamol vs. placebo on low back pain, or the effect of cold vs. hot temperature on finger sensation). Researchers sample participants from a population, experimentally manipulate the test variable, and conclude whether the variable causes an effect. Mark and colleagues explain that the precision of a treatment effect is not directly related to the p value. This is because treatment effects observed in clinical research are due to a mixture of the true size of the treatment effect under perfect conditions (which cannot be known) and at least three sources of uncertainty: (1) how participants are sampled from the population, which is usually not random, (2) variability between participants and (3) measurement error. Statistical analysis uses probability tools to try to separate true effects from measurement error and participant variability. But any unmeasured uncertainty can influence whether a study shows statistical significance or not, and unmeasured uncertainty cannot be distinguished with statistical or probability tools.
A key limitation of p values is that they are sample size dependent and cannot indicate sizes of effects (ie. small p values correspond to how “unexpected” an effect is, not how big it is). In addition, statistical hypothesis testing cannot indicate whether investigators conducted experiments correctly, and requires the probability of encountering errors in the long run (if experiments were hypothetically repeated many times) to be acceptably controlled.
Mark and colleagues highlight some common misconceptions of p values, eg. that p values are the probability that the null hypothesis is true, or the probability that the observed effect is due to chance, or that a small p value indicates study results are reliable and likely to replicate. Some of these misconceptions have been summarised previously. In light of these limitations, it becomes evident that p values and statistical hypothesis testing cannot indicate the importance of clinical or experimental effects.
To measure clinical or experimental effects, researchers and consumers of research need to focus on how big and how precise an effect is, not whether an effect is unexpected. That is, investigators need to quantify the magnitude of an effect (eg. a mean difference), and its precision with the 95% CI.
Mark DB, Lee KL, Harrell FE Jr (2016) Understanding the role of P values and hypothesis tests in clinical research. JAMA Cardiology doi: 10.1001/jamacardio.2016.3312.