Handling outlying or skewed data with robust regression

Suppose we want to compare the means of two groups where there are outlying data, or data are skewed. Previously, we learned of the pros and cons to log-transforming such data for analysis. The key disadvantage is that the between-group difference, either in log-units or as the anti-logged ratio, is difficult to interpret meaningfully. However, transforming such data is not the only option. It is possible to deal with outlying or skewed data by using a robust approach.

Robust regression can mean different things, but it usually refers to a model which reduces the influence of extreme observations (McElreath 2020). For example, when sample size is small, calculating a 95% CI using a t value from the T distribution, instead of a z value from the Normal distribution, is one way of applying a robust approach. The advantage of applying a robust regression is that we can obtain the between-group difference and 95% CI about the difference in the natural units of the outcome. Let’s use robust regression to compare the heights of two groups of people, in which one group has extreme, outlying observations.

Let’s create the following data:

  1. Draw 100 random values of heights from a Normal distribution (mean = 168 cm, standard deviation = 20 cm). Assign these data to group 0.
  2. Draw 98 random values of human heights from the identical Normal distribution. To these, add heights from 2 very tall people. Assign these data to group 1.
  3. For both groups, plot histograms and scatterplots of height.
  4. Compare the mean difference in height between groups using an ordinary least-squares linear regression vs. robust regression.

Python code to generate the data, and Stata and R code to run the regressions are shown at the end of this post. Running the Python code generates this figure:

Fig 1. Histograms of heights from group 0 (left), heights from group 1 (middle), and scatterplots of heights from both groups (right). Two individuals in group 1 are extremely tall.

The mean (average) difference in height between groups should be close to 0, since we created the data in both groups using the same Normal distribution. But the 2 very tall individuals will influence the mean height of group 1 by pulling it up. How does a robust regression lessen the influence of such extreme observations?

A robust regression effectively performs a weighted regression such that extreme observations carry less weight upon the overall effect, compared to non-extreme observations. Both Stata and R can perform robust regression using the same kinds of weights:

> Two types of weights are used. In Huber weighting, observations with small residuals get a weight of 1, the larger the residual, the smaller the weight. With biweighting, all cases with a non-zero residual get down-weighted at least a little. The two different kinds of weight are used because Huber weights can have difficulties with severe outliers, and biweights can have difficulties converging or may yield multiple solutions. Using the Huber weights first helps to minimize problems with the biweights.
> >– UCLA tutorial on robust regression in Stata

However, Stata and R implement robust regression slightly differently.

Robust regression in Stata

In Stata, the rreg command implements robust regression by first running an ordinary least squares (OLS) regression. It calculates how influential any extreme observation might be, and drops observations that are too influential. Then, it weights each observation based on how much the observed value deviates from the predicted (modelled) value in an iterative process. The iteration stops when differences in weights between iterations is below a threshold.

Stata shows the iteration history for both types of weights just above the robust regression output. Using Stata defaults, robust regression is approximately 95% as efficient as OLS regression. The most influential points are dropped, and observations with large deviations from predicted values are down-weighted.

Stata output

First, summarise the height data by group. Means should be almost identical, but the 2 outliers in group 1 have pulled the mean up and increased the SD.

. bysort group: summarize height

---------------------------------------------------------------------------------------------------------
-> group = 0

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
      height |        100    167.3216     15.3432   127.3243   204.6427

---------------------------------------------------------------------------------------------------------
-> group = 1

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
      height |        100    168.9106    20.21702   116.0061        272

Now, run an OLS regression, then a robust regression. The coefficient and 95% CI of group indicates the between-group difference. Both outlying observations were not influential enough to have been dropped from the robust regression (Number of obs = 200 for both models). But the robust regression estimate is closer to 0 and the 95% CI are tighter, as desired, compared to the OLS regression estimate and CI. We can loosely interpret this as "the difference between groups in height is 0.16 cm, and we are confident that this difference lies between -4.2 to 4.5 cm 95% of the time". (Herbert et al 2011)

. regress height group

      Source |       SS           df       MS      Number of obs   =       200
-------------+----------------------------------   F(1, 198)       =      0.39
       Model |  126.242209         1  126.242209   Prob > F        =    0.5320
    Residual |  63770.0403       198  322.070911   R-squared       =    0.0020
-------------+----------------------------------   Adj R-squared   =   -0.0031
       Total |  63896.2825       199  321.086847   Root MSE        =    17.946

------------------------------------------------------------------------------
      height |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       group |   1.588976   2.537995     0.63   0.532    -3.415995    6.593946
       _cons |   167.3216   1.794633    93.23   0.000     163.7825    170.8606
------------------------------------------------------------------------------

. rreg height group, gen(rweight)

   Huber iteration 1:  maximum difference in weights = .80669086
   Huber iteration 2:  maximum difference in weights = .05326543
   Huber iteration 3:  maximum difference in weights = .00894569
Biweight iteration 4:  maximum difference in weights = .24059431
Biweight iteration 5:  maximum difference in weights = .01389677
Biweight iteration 6:  maximum difference in weights = .00755576

Robust regression                               Number of obs     =        200
                                                F(  1,       198) =       0.01
                                                Prob > F          =     0.9426

------------------------------------------------------------------------------
      height |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       group |   .1591452   2.205679     0.07   0.943    -4.190493    4.508783
       _cons |   167.1414   1.559651   107.17   0.000     164.0657     170.217
------------------------------------------------------------------------------

Listing the observations and their weights, the outlying observations were down-weighted to 0.

. clist height rweight in 1/10, noobs

   height     rweight
      250           0
      272           0
 116.0061   .20960776
 125.6731   .41330824
 127.3243   .45367491
 204.6427   .50446194
 131.1786   .53447902
 203.3943   .53560991
 200.6264   .59508308
 200.6045   .59554758

Robust regression in R

In R, the rlm command in the MASS package implements several versions of robust regression. The code below shows how to implement this using Huber or bisquare weighting. R doesn’t combine both types of weights in a single regression the way Stata does. Rather, it implements them separately.

R output

First run an OLS regression and get the 95% CI. The coefficient and CI correspond to results from Stata’s regress command.

> # run ordinary least squares linear regression
> ols  summary(ols)

Call:
lm(formula = height ~ group, data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-52.904 -11.114  -1.745   8.735 103.089 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  167.322      1.795  93.234    confint(ols)
                 2.5 %     97.5 %
(Intercept) 163.782527 170.860624
group        -3.415995   6.593945

Then run robust regressions using Huber or bisquare weighting. In our example, bisquare weighting seems to generate more similar results to results from Stata’s rreg command.

> # run robust regression: Huber weighting
> rr.huber  summary(rr.huber)

Call: rlm(formula = height ~ group, data = df)
Residuals:
    Min      1Q  Median      3Q     Max 
-51.695 -10.326  -1.312   9.799 104.298 

Coefficients:
            Value    Std. Error t value 
(Intercept) 167.0359   1.5934   104.8278
group         0.6656   2.2535     0.2954

Residual standard error: 14.73 on 198 degrees of freedom
> confint.default(rr.huber)
                 2.5 %     97.5 %
(Intercept) 163.912847 170.158980
group        -3.751093   5.082275
> # run robust regression: bisquare weighting
> rr.bisquare  summary(rr.bisquare)

Call: rlm(formula = height ~ group, data = df, psi = psi.bisquare)
Residuals:
     Min       1Q   Median       3Q      Max 
-51.2960 -10.1145  -0.9124   9.8228 104.6979 

Coefficients:
            Value    Std. Error t value 
(Intercept) 167.1365   1.5592   107.1912
group         0.1656   2.2051     0.0751

Residual standard error: 14.66 on 198 degrees of freedom
> confint.default(rr.bisquare)
                 2.5 %     97.5 %
(Intercept) 164.080443 170.192540
group        -4.156299   4.487512 

Summary

Robust regression performs a weighted regression such that extreme observations carry less weight on the overall effect. This is a useful alternative to analysing transformed data: we can obtain the estimate and precision in natural units of the outcome, which allows the overall effect to be interpreted more meaningfully.

It seems possible to perform robust regression in Python using the statsmodels package or scikit-learn, but I haven’t tested these options. If you have and find them useful, do let us know.

References

UCLA tutorial on robust regression in Stata.

UCLA tutorial on robust regression in R.

McElreath R (2020) Statistical Rethinking: A Bayesian course with examples in R and Stan (2nd Ed) Florida, USA: Chapman and Hall/CRC, p 233.

Herbert R, Jamtvedt G, Birger Hagen K, Mead J (2011) Practical Evidence-Based Physiotherapy. (2nd Ed) Edinburgh: Elsevier Churchill Livingstone, p 104.

Posts in series

Python code

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm
import pandas as pd

np.random.seed(15)

# generate height data
mean, sd = 168, 15
height_0 = norm.rvs(loc=mean, scale=sd, size=100)
height_1 = norm.rvs(loc=mean, scale=sd, size=97)
height_1 = np.append(height_1, [250.0, 272.0])
group = np.concatenate((np.zeros(len(height_0)), np.ones(len(height_1))))
height = np.concatenate((height_0, height_1))

def jitterx(array, loc=0, scale=0.1):
    '''Create jittered x values for categorical data.
    Adjust loc to change category/group number.
    Adjust scale to change amount of jitter.'''
    jitteredx = np.random.normal(loc=loc, scale=scale, size=len(array))
    return jitteredx

# plot
fig, axes = plt.subplots(1, 3, figsize=(6, 3))
ax1 = plt.subplot(1, 3, 1)
ax1.hist(height_0, density=True, histtype='stepfilled', alpha=0.2)
plt.ticklabel_format(axis='y', style='sci', scilimits=(0, 0))
ax1.set_xlabel('height (cm): group 0')
ax1.set_ylabel('probability')

ax2 = plt.subplot(1, 3, 2)
ax2.hist(height_1, density=True, histtype='stepfilled', alpha=0.2)
plt.ticklabel_format(axis='y', style='sci', scilimits=(0, 0))
ax2.set_xlabel('height (cm): group 1')
ax2.set_ylabel('probability')

ax3 = plt.subplot(1, 3, 3)
xvals_0 = jitterx(height_0, loc=0, scale=0.15)
xvals_1 = jitterx(height_1, loc=1, scale=0.15)
ax3.scatter(xvals_0, height_0, s=4, color='b', alpha=0.2)
ax3.scatter(xvals_1, height_1, s=4, color='b', alpha=0.2)
plt.xticks(ticks=[-1, 0, 1, 2], labels=['', '0', '1', ''])
ax3.set_xlabel('group')
ax3.set_ylabel('height (cm)')

plt.tight_layout()
plt.savefig('fig.png', dpi=300)
plt.close()

# save data
df = pd.DataFrame({'group': group, 'height': height})
df.to_csv('data.csv')

Stata code

clear all
import delimited data.csv

* summarise height data by group
bysort group: summarize height

* run ordinary least squares linear regression
regress height group

* run robust regression
rreg height group, gen(rweight)
sort rweight
clist height rweight in 1/10, noobs

R code

install.packages(MASS)
library(MASS)

df <- read.csv('data.csv', header=TRUE)

# run ordinary least squares linear regression
ols <- lm(height ~ group, data = df)
summary(ols)
confint(ols)

# run robust regression: Huber weighting
rr.huber <- rlm(height ~ group, data = df)
summary(rr.huber)
confint.default(rr.huber)

# run robust regression: bisquare weighting
rr.bisquare <- rlm(height ~ group, data = df, psi = psi.bisquare)
summary(rr.bisquare)
confint.default(rr.bisquare)

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s