Verify if data are normally distributed in R: part 3
In the first and second post of this series, we learned how to graph our data using histograms and Q-Q plots to see whether it is normally distributed, and quantify the shape of the distribution by considering skew and kurtosis. In this, the final post in this series, we will learn to use the Shapiro-Wilk test to determine whether data are normally distributed.
Looking back at our “first post”:””, we simulated three data sets. The figure below plots the histograms and density graphs for these three data sets.
We will now use the Shapiro-Wilk test to determine whether these data deviate from a comparable normal distribution. Specifically, this test compares our data to a normally distributed set of data with the same mean and standard deviation.
If the test is non-significant (p>0.05), it is telling us that our data are not significantly different from a normal distribution.
If the test is significant (p<0.05), it is telling us that our data are significantly different from a normal distribution.
Let’s run the Shapiro-Wilk test on our three data sets using
shapiro.test(r1$values) Shapiro-Wilk normality test data: r1$values W = 0.98682, p-value = 0.964
shapiro.test(r2$values) Shapiro-Wilk normality test data: r2$values W = 0.91098, p-value = 0.01575
shapiro.test(r2$values) Shapiro-Wilk normality test data: r3$values W = 0.87679, p-value = 0.002381
In line with our previous “visual inspection”:”” and “quantification”:””, results from the Shapiro-Wilk test indicate that
r1 is likely normally distributed, whereas
r3 are not.
For those of you who were attentive in the “last post”:””, we already came across the Shapiro-Wilk test. Running the
stat.desc() function from the
pastec package provides an output that includes the
p values of the Shapiro-Wilk test.
The only downside to the Shapiro-Wilk test is that it is quite sensitive when the sample size is large (>80) . Thus, even slight deviations from a normal distribution will result in a significant result. As always with statistical results, care must be taken when interpreting results from the Shapiro-Wilk test. Importantly, always plot your data to see whether it corroborates the test results.
We have seen three different ways to test whether data are normally distributed. None of them are perfect. They each have their strengths and weaknesses. The best solution is to use the three approaches and come to an informed decision, rather than simply eyeballing the shape of a distribution, or basing ourselves entirely on whether a p-value is above or below 0.05.