## Verify if data are normally distributed in R: part 1

Many statistical tests assume that the sampling distribution is normally distributed. This does not mean that the data we collected for our experiment is normally distributed, but rather that the distribution of mean values from many samples of the same size will be normally distributed. Unfortunately, we do no have access to the sampling distribution. However, based on the central limit theorem, we know that if our sample is approximately normally distributed, so too will be the sampling distribution.

There are a few ways to assess whether our data are normally distributed, the first of which is to visualize it.

### Three different samples

To verify whether our data (and the underlying sampling distribution) are normally distributed, we will create three simulated data sets, which can be downloaded here (`r1.txt`

, `r2.txt`

, `r3.txt`

). Each sample contains 30 observations from a different population. The samples are plotted below.

### Frequency distributions and density plots

The first graphs that we will learn to make in R are frequency distributions and density plots. These graphs are useful because they allow us to see the general shape of the distribution.

```
# Import ggplot2
library(ggplot2)
# Set the working directory to location of data (change as needed)
setwd('~/Desktop')
# Read the data
r1 = read.csv('r1.txt', header=FALSE)
r2 = read.csv('r2.txt', header=FALSE)
r3 = read.csv('r3.txt', header=FALSE)
# Rename the columns
names(r1) <- 'values'
names(r2) <- 'values'
names(r3) <- 'values'
# Generate and same histogram
p1 = ggplot(r1, aes(x=values)) +
geom_histogram(binwidth=.25, colour="black", fill="white")
p1
ggsave('r1.png')
# Repeat for r2 and r3
```

In order to better visualise the distribution of our data, we will add density plots over our histograms. When plotted together like this, it is easy to get a general idea of whether our data are normally distributed or not.

```
# Continuing from previous session...
# overlay histogram and density plot
p11 = ggplot(r1, aes(x=values)) +
geom_histogram(aes(y = ..density..), binwidth=.25, colour="black", fill="white") +
stat_function(fun = dnorm, lwd = 2, col = 'red',
args = list(mean = mean(r1$values), sd = sd(r1$values)))
p11
ggsave('r11.png')
# Repeat for r2 and r3
```

### Q-Q graphs

The other type of graph that is useful when investigating whether our data are normally distributed is the q-q graph. The quantile-quantile graph plots the cumulative values we have in our data against the cumulative probability of a particular distribution. In our case, we will be using the normal distribution.

** stat_qq_line().**

*ggplot2 has recently added functionality to its qq geometry. It is now possible to add the diagonal line to Q-Q graphs. However, this functionality was not available in the currently available version of ggplot2.*

*You should be able to use some of these new features soon.*

The values in our data are ranked and sorted, and each value is then compared to the expected value that the score should have in a normal distribution. If our data are normally distributed, the values in our data should have approximately the same values as those from a normal distribution, which would result in a straight diagonal line.

We can generate a simply Q-Q plot with the following code:

```
p111_1 <- ggplot(r1, aes(sample = values)) + stat_qq()
P111_1
ggsave('r111_1.png')
```

As you can see from the resulting figure, it can be a little difficult to view the expected diagonal line. Therefore, we can use the following R function to add the diagonal line on our Q-Q plot.

```
qqplot.data <- function (vec) # argument: vector of numbers
{
y <- quantile(vec[!is.na(vec)], c(0.25, 0.75))
x <- qnorm(c(0.25, 0.75))
slope <- diff(y)/diff(x)
int <- y[1L] - slope * x[1L]
d <- data.frame(resids = vec)
ggplot(d, aes(sample = resids)) + stat_qq() + geom_abline(slope = slope, intercept = int)
}
```

By running this code, we now have access to our new function, `qqplot.data`

. This makes it easy to generate Q-Q plots with the corresponding diagonal line.

```
p111 <- qqplot.data(r1$values)
p111
ggsave('r111.png')
# Repeat for r2 and r3
```

As we can see, data from `r1`

stay close to the ideal diagonal line, indicating they are most likely normally distributed. The story is not as clear for `r2`

and `r3`

. Data from these two samples do not stay as close to the ideal diagonal line, providing some evidence that our data might be skewed.

### Summary

In this post we have learned how to visually inspect our data to see if they are normally distributed. We generate histograms with density plots, as well as Q-Q plots and their corresponding diagonal lines. While this provides a good indication of whether or not our three data sets are normally distributed or not, visual inspection can lead to varied interpretations. In our next post, we will learn how to characterise, numerically, the distribution of our data.