Verify if data are normally distributed in R: part 1

Many statistical tests assume that the sampling distribution is normally distributed. This does not mean that the data we collected for our experiment is normally distributed, but rather that the distribution of mean values from many samples of the same size will be normally distributed. Unfortunately, we do no have access to the sampling distribution. However, based on the central limit theorem, we know that if our sample is approximately normally distributed, so too will be the sampling distribution.

There are a few ways to assess whether our data are normally distributed, the first of which is to visualize it.

Three different samples

To verify whether our data (and the underlying sampling distribution) are normally distributed, we will create three simulated data sets, which can be downloaded here (r1.txt, r2.txt, r3.txt). Each sample contains 30 observations from a different population. The samples are plotted below.

Figure1

Frequency distributions and density plots

The first graphs that we will learn to make in R are frequency distributions and density plots. These graphs are useful because they allow us to see the general shape of the distribution.

# Import ggplot2
library(ggplot2)
# Set the working directory to location of data (change as needed)
setwd('~/Desktop')
# Read the data
r1 = read.csv('r1.txt', header=FALSE)
r2 = read.csv('r2.txt', header=FALSE)
r3 = read.csv('r3.txt', header=FALSE)
# Rename the columns
names(r1) <- 'values'
names(r2) <- 'values'
names(r3) <- 'values'
# Generate and same histogram 
p1 = ggplot(r1, aes(x=values)) + 
  geom_histogram(binwidth=.25, colour="black", fill="white")
p1
ggsave('r1.png')
# Repeat for r2 and r3

Histogram of r1

r1

 

Histogram of r2

r2

Histogram of r3

r3

In order to better visualise the distribution of our data, we will add density plots over our histograms. When plotted together like this, it is easy to get a general idea of whether our data are normally distributed or not.

# Continuing from previous session...

# overlay histogram and density plot
p11 = ggplot(r1, aes(x=values)) +
            geom_histogram(aes(y = ..density..), binwidth=.25, colour="black", fill="white") +
            stat_function(fun = dnorm, lwd = 2, col = 'red', 
                          args = list(mean = mean(r1$values), sd = sd(r1$values)))
p11
ggsave('r11.png')
# Repeat for r2 and r3


Histogram and density plot of r1

r11

Histogram and density plot of r2

r22

Histogram and density plot of r3

r33

Q-Q graphs

The other type of graph that is useful when investigating whether our data are normally distributed is the q-q graph. The quantile-quantile graph plots the cumulative values we have in our data against the cumulative probability of a particular distribution. In our case, we will be using the normal distribution.

stat_qq_line().

ggplot2 has recently added functionality to its qq geometry. It is now possible to add the diagonal line to Q-Q graphs. However, this functionality was not available in the currently available version of ggplot2.

You should be able to use some of these new features soon.

The values in our data are ranked and sorted, and each value is then compared to the expected value that the score should have in a normal distribution. If our data are normally distributed, the values in our data should have approximately the same values as those from a normal distribution, which would result in a straight diagonal line.

We can generate a simply Q-Q plot with the following code:

p111_1 <- ggplot(r1, aes(sample = values)) + stat_qq()
P111_1 
ggsave('r111_1.png')

Q-Q graph of r1

r111_1

As you can see from the resulting figure, it can be a little difficult to view the expected diagonal line. Therefore, we can use the following R function to add the diagonal line on our Q-Q plot.

qqplot.data <- function (vec) # argument: vector of numbers
{
  y <- quantile(vec[!is.na(vec)], c(0.25, 0.75))
  x <- qnorm(c(0.25, 0.75))
  slope <- diff(y)/diff(x)
  int <- y[1L] - slope * x[1L]
  d <- data.frame(resids = vec)
  ggplot(d, aes(sample = resids)) + stat_qq() + geom_abline(slope = slope, intercept = int)
  
}

By running this code, we now have access to our new function, qqplot.data. This makes it easy to generate Q-Q plots with the corresponding diagonal line.

p111 <- qqplot.data(r1$values)
p111
ggsave('r111.png')
# Repeat for r2 and r3

Q-Q graph with diagonal for r1

r111

Q-Q graph with diagonal for r2

r222

Q-Q graph with diagonal for r3

r333

As we can see, data from r1 stay close to the ideal diagonal line, indicating they are most likely normally distributed. The story is not as clear for r2 and r3. Data from these two samples do not stay as close to the ideal diagonal line, providing some evidence that our data might be skewed.

Summary

In this post we have learned how to visually inspect our data to see if they are normally distributed. We generate histograms with density plots, as well as Q-Q plots and their corresponding diagonal lines. While this provides a good indication of whether or not our three data sets are normally distributed or not, visual inspection can lead to varied interpretations. In our next post, we will learn how to characterise, numerically, the distribution of our data.

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s