Verify if data are normally distributed in R: part 2
In our previous post, we learned how to inspect whether or data were normally distributed or not using plots. It is always important to visualise our data. However, inspecting such plots is open for interpretation and, possibly, abuse.
We will now learn how to analyse our data and generate numerical values that describe how our data are distributed.
Quantifying the shape of distributions
Looking back at our previous post, we created three simulated data sets. The figure below plots the histograms and density graphs for these three data sets.
We can use functions from two different R packages to quantify the shape of these distributions plotted in red. The first is
describe() from the
psych package. The second is
stat.desc() from the
# Install packages if not already available # install.packages("psych") # install.packages("pastecs") # Load libraries library(psych) library(pastecs) # Assuming r1 from previous post is available in workspace describe(r1$values) stat.desc(r1$values, basic = FALSE, norm = TRUE)
The output of
vars n mean sd median trimmed mad min max range skew kurtosis se X1 1 30 0.02 0.95 -0.03 0.03 0.9 -2 1.83 3.83 -0.05 -0.6 0.17
The output of
median mean SE.mean CI.mean.0.95 var std.dev coef.var -0.02845874 0.01817551 0.17279326 0.35340189 0.89572530 0.94642765 52.07159653 skewness skew.2SE kurtosis kurt.2SE normtest.W normtest.p -0.05346174 -0.06261735 -0.59726863 -0.35861409 0.98682151 0.96401280
Both of these functions generate various summary statistics that describe the data set
r1. We can see the mean and standard deviation, as well as other familiar measures such as the median, the min, the max, the range, the standard error.
These outputs also provide values of skew and kurtosis. In a normal distribution, skew and kurtosis should be zero. The further away these values are from zero, the more likely our data are not normally distributed.
Skew and kurtosis.
Skew: Positive values of skew indicate a pile-up of values on the left of the distribution (similar to data set r3), while negative values of skew indicate the values are piled-up on the right of the distribution (similar to data set r2).
Kurtosis: Positive values of kurtosis indicate a pointy and heavy-tailed distribution, whereas negative values indicate a flat and light-tailed distribution.
Let’s compare the skew and kurtosis across our three data sets.
Skew Kurtosis r1: -0.053 -0.597 r2: 0.966 0.379 r3: -0.903 -0.402
As the labels in the above figure indicate, we simulated the data to have different levels of skew.
r1 was sampled from a normal distribution,
r2 was sampled from a distribution that was positively skewed, and
r3 was sampled from a distribution that was negatively skewed.
This is reflected in the values of skew return for each of out data sets: close to zero for
r1, almost 1 for
r2 and -0.9 for
The level of kurtosis in our data was not directly specified, but we can see how these values do describe the amount of kurtosis in these data.
skew.2SE and kurt.2SE
stat-desc() also provides
By converting skew and kurtosis to z-scores, it is possible to determine how common (or uncommon) the level of skew and kurtosis in our sample truly are. The value of
kurt.2SE are equal to skew and kurtosis divided by 2 standard errors. By normalizing skew and kurtosis in this way, if
kurt.2SE are greater than 1, we can conclude that there is only a 5% chance (i.e. p < 0.05) of obtaining values of skew and kurtosis as or more extreme than this by chance.
Because these normalized values involve dividing by 2 standard errors, they are sensitive to the size of the sample.
kurt.2SE are most appropriate for relatively small samples, 30-50. For larger samples, it is best to compute values corresponding to
2.58SE (p < 0.01) and
3.29SE (p < 0.001). In very large samples, say 200 observations or more, it is best to look at the shape of the distribution visually and consider the actual values of skew and kurtosis, not their normalized values.
In this post we learned how to quantify the shape of the distribution associated with our data. In the final blog in this series, we will learn how to specifically test whether a distribution is normal using the Shapiro-Wilk’s test.