## Verify if data are normally distributed in R: part 2 In our previous post, we learned how to inspect whether or data were normally distributed or not using plots. It is always important to visualise our data. However, inspecting such plots is open for interpretation and, possibly, abuse.

We will now learn how to analyse our data and generate numerical values that describe how our data are distributed.

### Quantifying the shape of distributions

Looking back at our previous post, we created three simulated data sets. The figure below plots the histograms and density graphs for these three data sets.

Figure 1: top = r1, middle = r2, bottom = r3 We can use functions from two different R packages to quantify the shape of these distributions plotted in red. The first is `describe()` from the `psych` package. The second is `stat.desc()` from the `pastecs` package.

```# Install packages if not already available
# install.packages("psych")
# install.packages("pastecs")

library(psych)
library(pastecs)

# Assuming r1 from previous post is available in workspace
describe(r1\$values)

stat.desc(r1\$values, basic = FALSE, norm = TRUE)
```

The output of `describe(r1\$values)` is:

```   vars  n mean   sd median trimmed mad min  max range  skew kurtosis   se
X1    1 30 0.02 0.95  -0.03    0.03 0.9  -2 1.83  3.83 -0.05     -0.6 0.17
```

The output of `stat.desc(r1\$values)` is:

```      median         mean      SE.mean CI.mean.0.95          var      std.dev     coef.var
-0.02845874   0.01817551   0.17279326   0.35340189   0.89572530   0.94642765  52.07159653
skewness     skew.2SE     kurtosis     kurt.2SE   normtest.W   normtest.p
-0.05346174  -0.06261735  -0.59726863  -0.35861409   0.98682151   0.96401280
```

Both of these functions generate various summary statistics that describe the data set `r1`. We can see the mean and standard deviation, as well as other familiar measures such as the median, the min, the max, the range, the standard error.

These outputs also provide values of skew and kurtosis. In a normal distribution, skew and kurtosis should be zero. The further away these values are from zero, the more likely our data are not normally distributed.

Skew and kurtosis.

Skew: Positive values of skew indicate a pile-up of values on the left of the distribution (similar to data set r3), while negative values of skew indicate the values are piled-up on the right of the distribution (similar to data set r2).

Kurtosis: Positive values of kurtosis indicate a pointy and heavy-tailed distribution, whereas negative values indicate a flat and light-tailed distribution.

Let’s compare the skew and kurtosis across our three data sets.

```    Skew    Kurtosis
r1: -0.053  -0.597
r2:  0.966   0.379
r3: -0.903  -0.402
```

As the labels in the above figure indicate, we simulated the data to have different levels of skew. `r1` was sampled from a normal distribution, `r2` was sampled from a distribution that was positively skewed, and `r3` was sampled from a distribution that was negatively skewed.

This is reflected in the values of skew return for each of out data sets: close to zero for `r1`, almost 1 for `r2` and -0.9 for `r3`.

The level of kurtosis in our data was not directly specified, but we can see how these values do describe the amount of kurtosis in these data.

skew.2SE and kurt.2SE

`stat-desc()` also provides `skew.2SE` and `kurt.2SE`.

By converting skew and kurtosis to z-scores, it is possible to determine how common (or uncommon) the level of skew and kurtosis in our sample truly are. The value of `skew.2SE` and `kurt.2SE` are equal to skew and kurtosis divided by 2 standard errors. By normalizing skew and kurtosis in this way, if `skew.2SE` and `kurt.2SE` are greater than 1, we can conclude that there is only a 5% chance (i.e. p < 0.05) of obtaining values of skew and kurtosis as or more extreme than this by chance.

Because these normalized values involve dividing by 2 standard errors, they are sensitive to the size of the sample. `skew.2SE` and `kurt.2SE` are most appropriate for relatively small samples, 30-50. For larger samples, it is best to compute values corresponding to `2.58SE` (p < 0.01) and `3.29SE` (p < 0.001). In very large samples, say 200 observations or more, it is best to look at the shape of the distribution visually and consider the actual values of skew and kurtosis, not their normalized values.

### Summary

In this post we learned how to quantify the shape of the distribution associated with our data. In the final blog in this series, we will learn how to specifically test whether a distribution is normal using the Shapiro-Wilk’s test.