## Regression to the mean can lead to false results

Bias is everywhere, and as scientists we must protect ourselves from it at every turn. One area that is particularly prone to the influence of bias is data analysis.

As humans we see patterns everywhere, even when they are not even there! As scientists, there is the added pressure–conscious or not–that we should find a result and if at all possible it should fit our pre-conceptions. One situation where our bias to see patterns and our desire to find results can lead to false conclusions is when we split data into groups based on arbitrary thresholds.

### We are fooled by regression to the mean

As described by Motulsky in his book `Intuitive Biostatistics: A nonmathematical guide to statistical thinking`

, when we take repeated measures, the more extreme the value is on the first measurement, the more likely it is to be closer to the average on the second measure. Unfortunately, because repeated measures tend to regress to the mean, it is relatively simply to *create* an effect when there is none.

The figure below plots pre- (red) and post- (blue) test data from a simulated study looking at the effect of eating chocolate everyday for a month on a person’s elbow flexion strength.

As you might suspect, eating chocolate had no effect on strength (would be great if it did!).

However, because I had a strong suspicion that chocolate would impact strength, I decided to dig deeper into the data. Looking more closely, I note that weaker individuals seems to have improved their strength over the month of chocolate eating, whereas those who were already quite strong seem to have gotten weaker. That is very interesting! Maybe chocolate does have an effect after all, but my initial analysis missed it because I pooled the strong people with the weak people.

The figure below now plots the data based on initial strength. As you can see on the left, people who were weaker at the start of the study tended to get stronger with chocolate. Even more convincing, the right subplot shows that people who are strong get weaker if they eat chocolate for a month. Wow! That is an amazing finding. I should start drafting my manuscript right away!

But wait! The simulated data I created for this example were randomly sampled from a normal distribution. How can such a potent effect be due to chance?

The answer is that the supposed effect simply reflects regression to the mean. By arbitrarily splitting the data into a strong group and a weak group, we created a perfect example of regression to the mean. A key point is that the data were divided *after* looking at the main result. My personal bias to see patterns where there are none, and my overwhelming desire to see an effect of chocolate on strength led me to fabricate an effect.

### Summary

Care must be taken when analysing data. Splitting data into two groups, or tertiles, or quartiles to reveal what seems like a *clear pattern* can lead to false findings. Post-hoc decisions such as these are problematic; researchers will never admit that these arbitrary decisions were made after looking at the results and trying various groupings. Thus, as a reader or reviewer of research papers, be suspicious of data that are grouped on what seems like an arbitrary criterion.

### Reference

Motulsky H. Intuitive biostatistics: a nonmathematical guide to statistical thinking. 3rd ed. (2014). Oxford University Press.