Statistics you are interested in: simple linear regression – part 2

In the previous post, we performed simple linear regression of science scores on reading scores from 200 students using ordinary least squares (OLS) estimation. This was done using Python’s Statsmodels package. What does the OLS output show and how should it be interpreted? Here is the figure of the individual subject data and the line of best fit, as well as the Python output from the OLS regression:
Figure 1:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 |
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.formula.api as smf
df = pd.read_csv('hsb2.csv')
md = smf.ols('science ~ read', df)
md_fit = md.fit(reml=True)
print(md_fit.summary())
Out[1]:
OLS Regression Results
==============================================================================
Dep. Variable: science R-squared: 0.397
Model: OLS Adj. R-squared: 0.394
Method: Least Squares F-statistic: 130.4
Date: Tue, 15 May 2018 Prob (F-statistic): 1.57e-23
Time: 21:00:51 Log-Likelihood: -691.21
No. Observations: 200 AIC: 1386.
Df Residuals: 198 BIC: 1393.
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 20.0670 2.836 7.076 0.000 14.474 25.660
read 0.6085 0.053 11.420 0.000 0.503 0.714
==============================================================================
Omnibus: 1.705 Durbin-Watson: 2.002
Prob(Omnibus): 0.426 Jarque-Bera (JB): 1.359
Skew: 0.179 Prob(JB): 0.507
Kurtosis: 3.186 Cond. No. 277.
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
|
Most statistical packages generate a similar output from regression models. Here, the output confirms we have performed ordinary least squares regression (Model and Method, lines 15-16) with science scores as the dependent variable (line 14). Performing this estimation procedure shows science and reading scores are related such that:
Science score = 20.067 + (0.6085 x Reading score) + error
These numbers are the coefficients of Intercept
and read
; the coefficient for read
is the slope of the line of best fit (lines 26-27, also showing a few other statistics). The estimates of the intercept and slope determine the fitted science scores, for every subject’s observed science score. It is clear that the observed scores of many subjects do not lie on the fitted line; the difference between the observed and fitted scores is used to calculate the error
(also called the residuals).
The coefficients of Intercept
and read
are statistics which follow a distribution, and the software calculates the standard error of the distribution. When the outcome is Normally distributed, the coefficients of Intercept
and read
have a Normal distribution, and the ratio of the slope coefficient to its standard error has a t-distribution with n-2 degrees of freedom. So the p-value of read
tests whether the slope is significantly different to slope = 0, i.e. there is no systematic relationship between predictor and outcome. Here, we see that the p-value for read
is < 0.0001, which means the slope of the line is significantly different from zero. The 95% CI of the slope indicates how precisely the slope was estimated. Here, the slope varies from 0.503 to 0.714 for 95% of the time if the study were repeated. So we say that for a 1 unit increase in reading score, science scores increase by 0.61 on average (95% CI 0.50 to 0.71).
However, the slope of the line does not show to what extent variability in science scores is explained by reading scores. The amount of explained variability is quantified by the R-square statistic (line 14, also called the coefficient of determination). The R-square is interpreted as the proportion of total variability of the outcome that is explained by the model. Here, R-squared
is 0.397, which means about 40% of the variability in science scores is explained by reading scores.
Note, the adjusted R-square is not the explained variance because it is calculated using different degrees of freedom. Adjusted R-square values are used to compare regression models.
In the next post, we will perform covariate adjustment and learn how to write the output to a text file.
Summary
We viewed and learned to interpret the output from an ordinary least squares regression model. Specifically, the relationship between outcome and predictor is determined using the intercept and slope of the model, and the R-square value shows how much variability in the outcome is explained by the predictor. You are interested in slope and its 95% CI, and in the R-square.
Reference
Vittinghoff E, Glidden DV, Shiboski SC, McCulloch CE (2010) Regression methods in biostatistics. Linear, logistic, survival and repeated measures models. Springer, New York.