R: Visualize and calculate a Pearson correlation

Scientists are often interested in understanding the relationship between two variables. One simple way to understand and quantify a relationship between two variables is correlation analysis.

Assumptions. This post assumes you understand the theory behind correlation analysis and have a working knowledge of R; it focuses on how to run this type of analysis in R.

The dataset: foot length and subject height

In this tutorial we will calculate the correlation between the length of a person’s foot and a person’s height. The dataset we will use contains data on length of the left foot print (col 1) and height (col 2) in 1020 adult male Tamil Indians. Right-click on the link and select Save Link As.... Save the file as indian_foot_height.dat in the working directory of your R session.

Visualizing the relationship

Before running the correlation analysis, the first thing we need to do is visualize the data. To do so, we need to install the ggplot2 library in R (if not already installed) then load the data into our workspace.

# Comment next line if ggplot2 already installed
install.packages("ggplot2") 
# Load the ggplot2 library
library(ggplot2)
# Import data: assumes .dat file is located in current working directory
foot_height <- read.table("india_foot_height.dat", header=FALSE)
# Give columns relevant names
names(foot_height) <- c("foot","height")

Great, we are now ready to plot the data. We will use ggplot2 to plot an x-y scatter plot.
If you are not familiar with ggplot2, we will first create a plot object scatter_plot. We will also specify the aesthetics for our plot, the foot and height data contained in the foot_height dataframe. Finally, we will add the point (+ geom_point()) and label geometries (+ labs()) to our plot object.

scatter_plot <- ggplot(foot_height, aes(foot, height))
scatter_plot + geom_point() + labs(x = "foot length (cm)", y = "height (cm)") 

This produces the following plot:

scatter

People with shorter feet seem to be shorter whereas those with longer feet appear to be taller (or is it the other way round?! People who are shorter have shorter feet whereas those who are taller have longer feet. Remember, correlations tell us nothing about causal relationships between variables). Importantly, there are no unusual data points (e.g., outliers) and the data seem to be distributed relatively linearly (e.g., not u-shaped or exponential). This means it is appropriate for us to go ahead and quantify the linear relationship between foot length and subject height.

Let’s plot the line of best fit (i.e., the line that minimizes the squared difference between the line and each point). We will do this by adding geom_smooth() to our ggplot2 figure.
We will use the lm method (linear method) plot the best fit line. By default, geom_smooth() also plots the 95% CI of the best-fit line.

scatter_plot <- ggplot(foot_height, aes(foot, height))
scatter_plot + geom_point() + labs(x = "foot length (cm)", y = "height (cm)") + geom_smooth(method="lm")

scatter_line

The figure looks pretty good, and it confirms our initial hunch that subject height and foot length are correlated with one another.

Calculating Pearson’s correlation

Because foot length and subject height are both continuous variables, will use Pearson’s product-moment correlation to quantify the strength of the relationship between these two variables. There are a few ways to do this in R, but we will only consider one method here.

# Calculating Pearson's product-moment correlation
 cor.test(foot_height$foot, foot_height$height, method = "pearson", conf.level = 0.95)

We used the cor.test() function and provided it with the foot and height variables from our foot_height dataframe. We specified we wanted to use the Pearson method (other types of correlation analysis are available) and we specified the level of confidence (i.e., 0.95). Below is the output generated by R when you run the above command:

 Pearson's product-moment correlation

data:  foot_height$foot and foot_height$height
t = 22.598, df = 1018, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.5355953 0.6174523
sample estimates:
      cor 
0.5779759 

Starting from the bottom, the correlation analysis resulted in R=0.58 [95% CI: 0.54-0.62]. The output also tells us that the correlation was statistically significant; the t-value, degrees of freedom and p-value are all provided.

Summary

This post was an introduction to performing correlation analysis in R. To consolidate your learning, you may want to read the documentation associated with the cor.test() function (type help "cor.test" in the R console) and try changing some of the parameters. You can also create two new variables and see how changing some of the values changes the output of the correlation analysis.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s