Verify if data are normally distributed in R: part 2

In our previous post, we learned how to inspect whether or data were normally distributed or not using plots. It is always important to visualise our data. However, inspecting such plots is open for interpretation and, possibly, abuse.
We will now learn how to analyse our data and generate numerical values that describe how our data are distributed.
Quantifying the shape of distributions
Looking back at our previous post, we created three simulated data sets. The figure below plots the histograms and density graphs for these three data sets.
Figure 1: top = r1, middle = r2, bottom = r3
We can use functions from two different R packages to quantify the shape of these distributions plotted in red. The first is describe()
from the psych
package. The second is stat.desc()
from the pastecs
package.
# Install packages if not already available
# install.packages("psych")
# install.packages("pastecs")
# Load libraries
library(psych)
library(pastecs)
# Assuming r1 from previous post is available in workspace
describe(r1$values)
stat.desc(r1$values, basic = FALSE, norm = TRUE)
The output of describe(r1$values)
is:
vars n mean sd median trimmed mad min max range skew kurtosis se
X1 1 30 0.02 0.95 -0.03 0.03 0.9 -2 1.83 3.83 -0.05 -0.6 0.17
The output of stat.desc(r1$values)
is:
median mean SE.mean CI.mean.0.95 var std.dev coef.var
-0.02845874 0.01817551 0.17279326 0.35340189 0.89572530 0.94642765 52.07159653
skewness skew.2SE kurtosis kurt.2SE normtest.W normtest.p
-0.05346174 -0.06261735 -0.59726863 -0.35861409 0.98682151 0.96401280
Both of these functions generate various summary statistics that describe the data set r1
. We can see the mean and standard deviation, as well as other familiar measures such as the median, the min, the max, the range, the standard error.
These outputs also provide values of skew and kurtosis. In a normal distribution, skew and kurtosis should be zero. The further away these values are from zero, the more likely our data are not normally distributed.
Skew and kurtosis.
Skew: Positive values of skew indicate a pile-up of values on the left of the distribution (similar to data set r3), while negative values of skew indicate the values are piled-up on the right of the distribution (similar to data set r2).
Kurtosis: Positive values of kurtosis indicate a pointy and heavy-tailed distribution, whereas negative values indicate a flat and light-tailed distribution.
Let’s compare the skew and kurtosis across our three data sets.
Skew Kurtosis
r1: -0.053 -0.597
r2: 0.966 0.379
r3: -0.903 -0.402
As the labels in the above figure indicate, we simulated the data to have different levels of skew. r1
was sampled from a normal distribution, r2
was sampled from a distribution that was positively skewed, and r3
was sampled from a distribution that was negatively skewed.
This is reflected in the values of skew return for each of out data sets: close to zero for r1
, almost 1 for r2
and -0.9 for r3
.
The level of kurtosis in our data was not directly specified, but we can see how these values do describe the amount of kurtosis in these data.
skew.2SE and kurt.2SE
stat-desc()
also provides skew.2SE
and kurt.2SE
.
By converting skew and kurtosis to z-scores, it is possible to determine how common (or uncommon) the level of skew and kurtosis in our sample truly are. The value of skew.2SE
and kurt.2SE
are equal to skew and kurtosis divided by 2 standard errors. By normalizing skew and kurtosis in this way, if skew.2SE
and kurt.2SE
are greater than 1, we can conclude that there is only a 5% chance (i.e. p < 0.05) of obtaining values of skew and kurtosis as or more extreme than this by chance.
Because these normalized values involve dividing by 2 standard errors, they are sensitive to the size of the sample. skew.2SE
and kurt.2SE
are most appropriate for relatively small samples, 30-50. For larger samples, it is best to compute values corresponding to 2.58SE
(p < 0.01) and 3.29SE
(p < 0.001). In very large samples, say 200 observations or more, it is best to look at the shape of the distribution visually and consider the actual values of skew and kurtosis, not their normalized values.
Summary
In this post we learned how to quantify the shape of the distribution associated with our data. In the final blog in this series, we will learn how to specifically test whether a distribution is normal using the Shapiro-Wilk’s test.