Why p values aren’t necessary: they are not evidence, and they can’t be replicated
In the previous post, we learned that p values are not necessary because they only show the probability of data given the hypotheses, whereas we want to know the probability of hypotheses given the data. These two probabilities can be very different. Other reasons why p values are not necessary are that p values do not provide evidence, and p values can’t be replicated.
P is not evidence that can change beliefs
Since a p value is calculated under a single hypothesis (the null hypothesis), it only provides information about that hypothesis. On its own, this information is not sufficient to make us abandon that hypothesis. The probability of a rare observation under a given hypothesis needs to be compared against a competing hypothesis, and the p value does not do that.
For example, observing that the same person out of 1,000,000 people wins the lottery 3 consecutive times is not enough to make us think that the lottery was rigged. It is only possible to know if the lottery was rigged when there is information to compare it against the competing hypothesis that the lottery was fair. That is, observing 3 consecutive wins only provides evidence that the lottery was rigged compared to the lottery being fair, if the wins were more likely to be observed under a rigged lottery than under a fair lottery.
The p value’s dependence on a single hypothesis means that a large effect in a small trial or a tiny effect in a large trial can both result in identical p values. Sizes of p values cannot indicate sizes nor strengths of effects.
P values can’t be replicated
Cohen (1994) and Schmidt (1996) have called for abandonment of null hypothesis significance testing, as advanced by R.A. Fisher. But they have both very much misunderstood Fisher’s position. Fisher’s standard for establishing firm knowledge was not one of statistical significance in a single study but the ability to repeatedly produce results significant at (at least) the .05 level. Fisher held that replication is essential to confidence in the reliability (reproducibility) of a result, as well as to generalizability (external validity). The problems that Cohen and Schmidt point to result not from the use of significance testing, but from our failure to develop a tradition of replication findings. Our problem is that we have placed little value on replication studies.
This quote, as cited in Schmidt and Hunter’s chapter, recognises the importance of scientific replication, but mistakenly holds that statistically significant findings can be replicated. Unfortunately, significant p values are not reproducible. Geoff Cumming wonderfully showed how fickle p values can be in the video Dance of the p values. Moreover, reproducibility requires high statistical power but most published research is underpowered, so reproducing significant p values is already unlikely, even if p values are reproducible. And non-significant findings do not replicate any better than significant findings.
Tests of statistical significance and p values are widely used to support or refute scientific hypotheses. However p values are problematic: p values don’t tell us whether hypotheses are true, they are not evidence, and they can’t be replicated. Use better alternatives such as estimation statistics to support or refute hypotheses.
Goodman SN and Royall R (1988) Evidence and scientific research American Journal of Public Health 78:1568-1574.
Herbert R (2019) Research note: Significance testing and hypothesis testing: meaningless, misleading and mostly unnecessary. Journal of Physiotherapy 65:178-181.
Schmidt FL and Hunter JE (2016) Eight common but false objections to the discontinuance of significance testing in the analysis of research data. In What If There Were No Significance Tests?. Harlow LL, Mulaik SA and Steiger JH (Eds). Routledge: New York.