Statistics you are interested in: simple linear regression – part 2

In the previous post, we performed simple linear regression of science scores on reading scores from 200 students using ordinary least squares (OLS) estimation. This was done using Python’s Statsmodels package. What does the OLS output show and how should it be interpreted? Here is the figure of the individual subject data and the line of best fit, as well as the Python output from the OLS regression:

Figure 1:

 

 

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.formula.api as smf

df = pd.read_csv('hsb2.csv')

md = smf.ols('science ~ read', df)
md_fit = md.fit(reml=True)
print(md_fit.summary())

Out[1]:
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                science   R-squared:                       0.397
Model:                            OLS   Adj. R-squared:                  0.394
Method:                 Least Squares   F-statistic:                     130.4
Date:                Tue, 15 May 2018   Prob (F-statistic):           1.57e-23
Time:                        21:00:51   Log-Likelihood:                -691.21
No. Observations:                 200   AIC:                             1386.
Df Residuals:                     198   BIC:                             1393.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     20.0670      2.836      7.076      0.000      14.474      25.660
read           0.6085      0.053     11.420      0.000       0.503       0.714
==============================================================================
Omnibus:                        1.705   Durbin-Watson:                   2.002
Prob(Omnibus):                  0.426   Jarque-Bera (JB):                1.359
Skew:                           0.179   Prob(JB):                        0.507
Kurtosis:                       3.186   Cond. No.                         277.
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Most statistical packages generate a similar output from regression models. Here, the output confirms we have performed ordinary least squares regression (Model and Method, lines 15-16) with science scores as the dependent variable (line 14). Performing this estimation procedure shows science and reading scores are related such that:

Science score = 20.067 + (0.6085 x Reading score) + error

These numbers are the coefficients of Intercept and read; the coefficient for read is the slope of the line of best fit (lines 26-27, also showing a few other statistics). The estimates of the intercept and slope determine the fitted science scores, for every subject’s observed science score. It is clear that the observed scores of many subjects do not lie on the fitted line; the difference between the observed and fitted scores is used to calculate the error (also called the residuals).

The coefficients of Intercept and read are statistics which follow a distribution, and the software calculates the standard error of the distribution. When the outcome is Normally distributed, the coefficients of Intercept and read have a Normal distribution, and the ratio of the slope coefficient to its standard error has a t-distribution with n-2 degrees of freedom. So the p-value of read tests whether the slope is significantly different to slope = 0, i.e. there is no systematic relationship between predictor and outcome. Here, we see that the p-value for read is < 0.0001, which means the slope of the line is significantly different from zero. The 95% CI of the slope indicates how precisely the slope was estimated. Here, the slope varies from 0.503 to 0.714 for 95% of the time if the study were repeated. So we say that for a 1 unit increase in reading score, science scores increase by 0.61 on average (95% CI 0.50 to 0.71).

However, the slope of the line does not show to what extent variability in science scores is explained by reading scores. The amount of explained variability is quantified by the R-square statistic (line 14, also called the coefficient of determination). The R-square is interpreted as the proportion of total variability of the outcome that is explained by the model. Here, R-squared is 0.397, which means about 40% of the variability in science scores is explained by reading scores.

Note, the adjusted R-square is not the explained variance because it is calculated using different degrees of freedom. Adjusted R-square values are used to compare regression models.

In the next post, we will perform covariate adjustment and learn how to write the output to a text file.

Summary

We viewed and learned to interpret the output from an ordinary least squares regression model. Specifically, the relationship between outcome and predictor is determined using the intercept and slope of the model, and the R-square value shows how much variability in the outcome is explained by the predictor. You are interested in slope and its 95% CI, and in the R-square.

Reference

Vittinghoff E, Glidden DV, Shiboski SC, McCulloch CE (2010) Regression methods in biostatistics. Linear, logistic, survival and repeated measures models. Springer, New York.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s