Statistics you are interested in: simple linear regression – part 1

We introduced simple linear regression in a previous series and learned how to perform it in R (1, 2). What is the theory behind simple linear regression? How is it used to understand relationships between variables? What is another way to perform it in Python?

The hsb2.csv dataset (available here) contains demographic and academic test scores data from 200 students. We import the data using Python’s Pandas package and print the first few lines to view the variables:

import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.formula.api as smf

df = pd.read_csv('hsb2.csv')
print(df.head())

Out[1]:
    id  female  race  ses  schtyp  prog  read  write  math  science  socst
0   70       0     4    1       1     1    57     52    41       47     57
1  121       1     4    2       1     3    68     59    53       63     61
2   86       0     4    3       1     1    44     33    54       58     31
3  141       0     4    3       1     3    63     44    47       53     56
4  172       0     4    2       1     2    47     52    57       53     61

We are interested in the relationship between reading and science scores. First, we visualise the individual subject data in a scatterplot:

 

Each data point represents science and reading scores from a single subject. Since one data point provides no information about another data point (i.e. the data are not correlated), we say that the data are statistically independent.

Simple linear regression can be used to determine how the average value of the continuous outcome y varies with the value of a single explanatory variable x (also called a predictor). Here, the outcome is science scores and the explanatory variable is reading scores. Using ordinary least squares estimation, we draw a line of best fit through the data such that the distance of each data point from the line is minimised, and the slope of the line shows how much science scores change for an increase in 1 unit of reading scores.

Using the Statsmodels package, we perform ordinary least squares (OLS) regression of science on reading scores and plot the line of best fit:

md = smf.ols('science ~ read', df)
md_fit = md.fit(reml=True)

intercept = md_fit.params[0]
slope = md_fit.params[1]

fig, ax = plt.subplots(figsize=(6,4))
ax = plt.scatter(df.read, df.science)
plt.xlabel('Standardized reading score')
plt.ylabel('Standardized science score')
xstart, xstop = min(df.read), max(df.read)
ystart, ystop = xstart * slope + intercept, xstop * slope + intercept
plt.plot([xstart, xstop], [ystart, ystop], 'r-')
fig.tight_layout()
plt.savefig('regress.png', dpi=300)

 

 

The continuous regression line captures the overall trend that science scores increase as reading scores increase, however it is clear that for any given reading score, there is a lot of variability in science scores. This variability could be due to measurement error, variability between students, and other unmeasured variables that determine science scores.

Using this estimation procedure says:

Science score = Intercept + (Slope x Reading score) + error

The statistical assumptions of this regression model are concerned with how the errors of the outcome (i.e. the last term in the equation) are distributed. Specifically, the model assumes the errors are statistically independent, Normally distributed, and have constant variance at every reading score. Violations of the assumption of Normality and constant variance are in general not too bad, but violation of the assumption of independence is bad.

In contrast, no assumptions about distribution of the predictor are made. It is possible to include predictors that are binary, categorical or discrete (e.g. counts) in such models.

In the next post, we will print the output from the OLS regression and learn how to interpret it.

Summary

We performed simple linear regression of a continuous dependent outcome on a single continuous explanatory variable using ordinary least squares estimation in Python’s Statsmodels package.

We viewed a scatterplot of individual subject data and fitted a straight line of best fit through the data.

Reference

Vittinghoff E, Glidden DV, Shiboski SC, McCulloch CE (2010) Regression methods in biostatistics. Linear, logistic, survival and repeated measures models. Springer, New York.

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s