Does it matter that data are Normally distributed?
Hypothesis testing vs. Estimation
Hypothesis tests require that populations are Normally distributed in order for the tests to be reliable. When samples are drawn from Normally distributed populations, the distributions of F or t statistics can be calculated for any given sample size, and the F or t statistic for a specific experiment can be obtained from the distribution. This is how “null hypothesis significance testing” is used to make statistical inferences i.e., using samples to infer properties of populations.
But we are more interested in knowing how big an effect is to know whether it is meaningful. (Examples of effects are within- or between-group mean differences.) Also, if the experiment is repeated, any future experiment will be based on different samples so that the effect “jumps around”. So we want to know whether our estimate of the effect was precise, given the samples we used. This information is only made explicit using estimation methods (i.e. confidence intervals) to make statistical inferences about populations, because the CI tells us how much an effect “jumps around”.
So there are 2 distributions to think about: distributions of data, and distributions of statistics. We describe distributions of data (e.g. with mean, SD) but we are really more interested in distributions of statistics (e.g. mean difference and 95% CI) to make inferences about populations. 95% CI are calculated using distributions of statistics, not distributions of data.
Does violating the assumption that data are Normally distributed matter?
So, hypothesis tests assume sample data are Normally distributed. The mean is a statistic of the sample. Does the distribution of the sample influence the distribution of the mean of the sample? That is, if the sample data are Normally or non-Normally distributed, does the distribution of the mean remain Normal or does it change? Are 95% confidence intervals of means robust to non-Normal data?
To begin, we generate a Normally-distributed population of 1000 subjects from which we will draw random samples of subjects, with sample sizes 10, 30 and 100. If we calculate the sample mean statistic and 95% CI for each of our three samples, we expect that the CI width of the mean would be wide for a small sample, and narrow for a large sample. We would like to know: without assuming that the population or sample is Normally distributed, how does the distribution of the mean statistic change?
We apply a resampling technique called bootstrapping to generate many samples and calculate a mean for each sample. For the sample of size 10, we generate 500 bootstrapped samples of size 10 by sampling with replacement from the first sample, using the function
numpy.random.choice(). This procedure is known as a non-parametric bootstrap because it resamples the data, and makes no assumptions about sample parameters such as the mean and standard deviation. For each bootstrapped sample, we calculate the mean. We then plot the means from all bootstrapped samples in a histogram. The same bootstrapping procedures are performed for the samples of size 30 and 100.
Figure 1 shows the distribution of the population (top panel), the distributions of the samples of sizes 10, 30, 100 (middle panel), and the distributions of the bootstrapped sample means for each of the samples (bottom panel). We see that the population, samples, and sample means are Normally distributed. The sample means are distributed more narrowly for larger samples.
Next, we generate a skewed population using a gamma distribution, draw random samples as above, generate bootstrapped samples and plot the distributions of the sample means. Figure 2 shows the result:
Surprisingly, while the population and samples are both skewed, the distribution of the sample means is remarkably Normal.
Remarkably, no matter how skewed the populations or samples are, the distribution of the sample means is just about Normal. Larger sample sizes produce more precise estimates of effects (i.e. the spread of mean values along the x-axis gets narrower as sample size increases), but violating the assumption that data are Normally distributed has little to no effect on the distribution of the sample means; these remain Normally distributed.
Bootstrap resampling simulations show that distributions of the mean remain Normal even though populations and samples are non-Normal. The Normal distribution of the mean is more apparent with larger samples. This implies that confidence intervals of means (that indicate whether the mean was estimated with precision) are robust, provided sample size is large enough, even when the assumption of Normality does not hold.
Python code to simulate data and generate figures is available in the file