## Common misinterpretations of statistical tests

Researchers often use statistical tests to test hypotheses and/or infer properties of a population based on properties of a sample. A key idea is that all statistical tests assume a statistical model provides a complete and valid representation of variability in the data, and faithfully reflects how the study was conducted and the phenomena being tested. For example, when we read that a t test was used to compare whether the mean values in two groups are different, we assume that (1) variability in the observed data follow a t distribution and are accurately represented by properties of that distribution, (2) the study was conducted exactly according to how methods to power the study were described in the protocol and (3) investigators reported outcomes of all comparisons, not just those that were statistically significant. Any research finding based on statistical tests implicitly include these assumptions. However these assumptions are not often explicitly considered, are misinterpreted, or are violated, leading to incorrect conclusions.

In a recent paper, statisticians Sander Greenland and colleagues (2016) review 25 common misinterpretations of statistical tests, p values, confidence intervals and power, to promote better understanding of statistical tests.

This post highlights five of these misinterpretations; the interested reader can find a full description of all 25 misinterpretations in the original paper:

**1. The p value is the probability that the test hypothesis is true. For example, if a test of the null hypothesis gave p = 0.01, the null hypothesis has only a 1% chance of being true; if instead p = 0.40, the null hypothesis has a 40% chance of being true.**

This statement is incorrect. The p value cannot indicate the probability that the test hypothesis is true because the p value *assumes* the test hypothesis is true. It simply indicates how well the observed data conform to the pattern predicted by the test hypothesis and all other assumptions in the underlying statistical model. So p = 0.01 indicates data are not very close to what the statistical model predicts, while p = 0.40 indicates data are closer to the model prediction.

**2. The p value for the null hypothesis is the probability that chance alone produced the observed association. For example, if p = 0.08, there is an 8% probability that chance alone produced the association.**

This is a common variation of the first misinterpretation and is equally erroneous. The p value is a probability computed *assuming* chance was operating alone. That is, the p value is a probability deduced *from* a set of assumptions (ie. the statistical distribution) and *cannot refer to the probability* of those assumptions.

**3. When the same hypothesis is tested in different studies and none or a minority of the tests are statistically significant (ie. tests show p > 0.05), the overall evidence supports the hypothesis. **

Unfortunately, this belief is often used to claim that the literature supports evidence of no effect when the opposite is true; this belief reflects the tendency of researchers to overestimate the power of most research. Actually, every study could fail to reach statistical significance but when combined, show a statistically significant association.

**4. The 95% confidence interval (CI) in a study has a 95% chance of containing the true effect size.**

This statement is incorrect. The CI is a range between two numbers. The frequency with which an interval contains the true effect is either 100% if the true effect is within the range, or 0% if not. The 95% CI refers to how often 95% of the confidence intervals computed from very many studies would contain the true effect size if all assumptions used to compute the intervals were correct.

**5. An observed 95% CI predicts that 95% of the estimates from future studies will fall inside the observed interval.**

This statement is also incorrect. Under the statistical model, 95% is the frequency with which *other unobserved intervals* will contain the true effect, not how frequently the one interval presented will contain future estimates. Even under ideal conditions, the chance that a future estimate will fall within the current interval is usually much less than 95%.

### Reference

Greenland S, Senn SJ, Rothman KJ, Carlin JB, Poole C, Goodman SN, Altman DG. Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations. Eur J Epidemiol. 2016 Apr;31(4):337-50.