Verify if data are normally distributed in R: part 2

Posted on June 14, 2018 by Martin Héroux Leave a comment

In our previous post, we learned how to inspect whether or data were normally distributed or not using plots. It is always important to visualise our data. However, inspecting such plots is open for interpretation and, possibly, abuse.

We will now learn how to analyse our data and generate numerical values that describe how our data are distributed.

Quantifying the shape of distributions

Looking back at our previous post, we created three simulated data sets. The figure below plots the histograms and density graphs for these three data sets.

Figure 1: top = r1, middle = r2, bottom = r3

We can use functions from two different R packages to quantify the shape of these distributions plotted in red. The first is describe() from the psych package. The second is stat.desc() from the pastecs package.

# Install packages if not already available
# install.packages("psych")
# install.packages("pastecs")

# Load libraries
library(psych)
library(pastecs)

# Assuming r1 from previous post is available in workspace
describe(r1$values)

stat.desc(r1$values, basic = FALSE, norm = TRUE)

The output of describe(r1$values) is:

   vars  n mean   sd median trimmed mad min  max range  skew kurtosis   se
X1    1 30 0.02 0.95  -0.03    0.03 0.9  -2 1.83  3.83 -0.05     -0.6 0.17

The output of stat.desc(r1$values) is:

      median         mean      SE.mean CI.mean.0.95          var      std.dev     coef.var 
 -0.02845874   0.01817551   0.17279326   0.35340189   0.89572530   0.94642765  52.07159653 
    skewness     skew.2SE     kurtosis     kurt.2SE   normtest.W   normtest.p 
 -0.05346174  -0.06261735  -0.59726863  -0.35861409   0.98682151   0.96401280

Both of these functions generate various summary statistics that describe the data set r1. We can see the mean and standard deviation, as well as other familiar measures such as the median, the min, the max, the range, the standard error.

These outputs also provide values of skew and kurtosis. In a normal distribution, skew and kurtosis should be zero. The further away these values are from zero, the more likely our data are not normally distributed.

Skew and kurtosis.

Skew: Positive values of skew indicate a pile-up of values on the left of the distribution (similar to data set r3), while negative values of skew indicate the values are piled-up on the right of the distribution (similar to data set r2).

Kurtosis: Positive values of kurtosis indicate a pointy and heavy-tailed distribution, whereas negative values indicate a flat and light-tailed distribution.

Let’s compare the skew and kurtosis across our three data sets.

    Skew    Kurtosis
r1: -0.053  -0.597
r2:  0.966   0.379
r3: -0.903  -0.402

As the labels in the above figure indicate, we simulated the data to have different levels of skew. r1 was sampled from a normal distribution, r2 was sampled from a distribution that was positively skewed, and r3 was sampled from a distribution that was negatively skewed.

This is reflected in the values of skew return for each of out data sets: close to zero for r1, almost 1 for r2 and -0.9 for r3.

The level of kurtosis in our data was not directly specified, but we can see how these values do describe the amount of kurtosis in these data.

skew.2SE and kurt.2SE

stat-desc() also provides skew.2SE and kurt.2SE.

By converting skew and kurtosis to z-scores, it is possible to determine how common (or uncommon) the level of skew and kurtosis in our sample truly are. The value of skew.2SE and kurt.2SE are equal to skew and kurtosis divided by 2 standard errors. By normalizing skew and kurtosis in this way, if skew.2SE and kurt.2SE are greater than 1, we can conclude that there is only a 5% chance (i.e. p < 0.05) of obtaining values of skew and kurtosis as or more extreme than this by chance.

Because these normalized values involve dividing by 2 standard errors, they are sensitive to the size of the sample. skew.2SE and kurt.2SE are most appropriate for relatively small samples, 30-50. For larger samples, it is best to compute values corresponding to 2.58SE (p < 0.01) and 3.29SE (p < 0.001). In very large samples, say 200 observations or more, it is best to look at the shape of the distribution visually and consider the actual values of skew and kurtosis, not their normalized values.

Summary

In this post we learned how to quantify the shape of the distribution associated with our data. In the final blog in this series, we will learn how to specifically test whether a distribution is normal using the Shapiro-Wilk’s test.

tagged with normal distribution, R, statistics

R
Tutorials