R: Visualize and calculate a Pearson correlation
Scientists are often interested in understanding the relationship between two variables. One simple way to understand and quantify a relationship between two variables is correlation analysis.
Assumptions. This post assumes you understand the theory behind correlation analysis and have a working knowledge of R; it focuses on how to run this type of analysis in R.
The dataset: foot length and subject height
In this tutorial we will calculate the correlation between the length of a person’s foot and a person’s height. The dataset we will use contains data on length of the left foot print (col 1) and height (col 2) in 1020 adult male Tamil Indians. Right-click on the link and select
Save Link As.... Save the file as
indian_foot_height.dat in the working directory of your R session.
Visualizing the relationship
Before running the correlation analysis, the first thing we need to do is visualize the data. To do so, we need to install the ggplot2 library in R (if not already installed) then load the data into our workspace.
# Comment next line if ggplot2 already installed install.packages("ggplot2") # Load the ggplot2 library library(ggplot2) # Import data: assumes .dat file is located in current working directory foot_height <- read.table("india_foot_height.dat", header=FALSE) # Give columns relevant names names(foot_height) <- c("foot","height")
Great, we are now ready to plot the data. We will use ggplot2 to plot an x-y scatter plot.
If you are not familiar with ggplot2, we will first create a plot object
scatter_plot. We will also specify the aesthetics for our plot, the foot and height data contained in the
foot_height dataframe. Finally, we will add the point (
+ geom_point()) and label geometries (
+ labs()) to our plot object.
scatter_plot <- ggplot(foot_height, aes(foot, height)) scatter_plot + geom_point() + labs(x = "foot length (cm)", y = "height (cm)")
This produces the following plot:
People with shorter feet seem to be shorter whereas those with longer feet appear to be taller (or is it the other way round?! People who are shorter have shorter feet whereas those who are taller have longer feet. Remember, correlations tell us nothing about causal relationships between variables). Importantly, there are no unusual data points (e.g., outliers) and the data seem to be distributed relatively linearly (e.g., not u-shaped or exponential). This means it is appropriate for us to go ahead and quantify the linear relationship between foot length and subject height.
Let’s plot the line of best fit (i.e., the line that minimizes the squared difference between the line and each point). We will do this by adding
geom_smooth() to our ggplot2 figure.
We will use the
lm method (linear method) plot the best fit line. By default,
geom_smooth() also plots the 95% CI of the best-fit line.
scatter_plot <- ggplot(foot_height, aes(foot, height)) scatter_plot + geom_point() + labs(x = "foot length (cm)", y = "height (cm)") + geom_smooth(method="lm")
The figure looks pretty good, and it confirms our initial hunch that subject height and foot length are correlated with one another.
Calculating Pearson’s correlation
Because foot length and subject height are both continuous variables, will use Pearson’s product-moment correlation to quantify the strength of the relationship between these two variables. There are a few ways to do this in R, but we will only consider one method here.
# Calculating Pearson's product-moment correlation cor.test(foot_height$foot, foot_height$height, method = "pearson", conf.level = 0.95)
We used the
cor.test() function and provided it with the foot and height variables from our
foot_height dataframe. We specified we wanted to use the
Pearson method (other types of correlation analysis are available) and we specified the level of confidence (i.e., 0.95). Below is the output generated by R when you run the above command:
Pearson's product-moment correlation data: foot_height$foot and foot_height$height t = 22.598, df = 1018, p-value < 2.2e-16 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: 0.5355953 0.6174523 sample estimates: cor 0.5779759
Starting from the bottom, the correlation analysis resulted in R=0.58 [95% CI: 0.54-0.62]. The output also tells us that the correlation was statistically significant; the t-value, degrees of freedom and p-value are all provided.
This post was an introduction to performing correlation analysis in R. To consolidate your learning, you may want to read the documentation associated with the
cor.test() function (type
help "cor.test" in the R console) and try changing some of the parameters. You can also create two new variables and see how changing some of the values changes the output of the correlation analysis.