Verify if data are normally distributed in R: part 3

Posted on June 21, 2018 by Martin Héroux Leave a comment

In the first and second post of this series, we learned how to graph our data using histograms and Q-Q plots to see whether it is normally distributed, and quantify the shape of the distribution by considering skew and kurtosis. In this, the final post in this series, we will learn to use the Shapiro-Wilk test to determine whether data are normally distributed.

Shapiro-Wilk test

Looking back at our “first post”:””, we simulated three data sets. The figure below plots the histograms and density graphs for these three data sets.

top = r1, middle = r2, bottom = r3

We will now use the Shapiro-Wilk test to determine whether these data deviate from a comparable normal distribution. Specifically, this test compares our data to a normally distributed set of data with the same mean and standard deviation.

If the test is non-significant (p>0.05), it is telling us that our data are not significantly different from a normal distribution.

If the test is significant (p<0.05), it is telling us that our data are significantly different from a normal distribution.

Let’s run the Shapiro-Wilk test on our three data sets using shapiro.test():.

shapiro.test(r1$values)

    Shapiro-Wilk normality test

data:  r1$values
W = 0.98682, p-value = 0.964

shapiro.test(r2$values)


    Shapiro-Wilk normality test

data:  r2$values
W = 0.91098, p-value = 0.01575

shapiro.test(r2$values)

    Shapiro-Wilk normality test

data:  r3$values
W = 0.87679, p-value = 0.002381

In line with our previous “visual inspection”:”” and “quantification”:””, results from the Shapiro-Wilk test indicate that r1 is likely normally distributed, whereas r2 and r3 are not.

For those of you who were attentive in the “last post”:””, we already came across the Shapiro-Wilk test. Running the stat.desc() function from the pastec package provides an output that includes the w and p values of the Shapiro-Wilk test.

The only downside to the Shapiro-Wilk test is that it is quite sensitive when the sample size is large (>80) . Thus, even slight deviations from a normal distribution will result in a significant result. As always with statistical results, care must be taken when interpreting results from the Shapiro-Wilk test. Importantly, always plot your data to see whether it corroborates the test results.

Summary

We have seen three different ways to test whether data are normally distributed. None of them are perfect. They each have their strengths and weaknesses. The best solution is to use the three approaches and come to an informed decision, rather than simply eyeballing the shape of a distribution, or basing ourselves entirely on whether a p-value is above or below 0.05.

tagged with normal distribution, Shapiro-Wilk, statistics

R
Tutorials

M	T	W	T	F	S	S
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30