## Handling outlying or skewed data with robust regression Suppose we want to compare the means of two groups where there are outlying data, or data are skewed. Previously, we learned of the pros and cons to log-transforming such data for analysis. The key disadvantage is that the between-group difference, either in log-units or as the anti-logged ratio, is difficult to interpret meaningfully. However, transforming such data is not the only option. It is possible to deal with outlying or skewed data by using a robust approach.

Robust regression can mean different things, but it usually refers to a model which reduces the influence of extreme observations (McElreath 2020). For example, when sample size is small, calculating a 95% CI using a t value from the T distribution, instead of a z value from the Normal distribution, is one way of applying a robust approach. The advantage of applying a robust regression is that we can obtain the between-group difference and 95% CI about the difference in the natural units of the outcome. Let’s use robust regression to compare the heights of two groups of people, in which one group has extreme, outlying observations.

Let’s create the following data:

1. Draw 100 random values of heights from a Normal distribution (mean = 168 cm, standard deviation = 20 cm). Assign these data to group 0.
2. Draw 98 random values of human heights from the identical Normal distribution. To these, add heights from 2 very tall people. Assign these data to group 1.
3. For both groups, plot histograms and scatterplots of height.
4. Compare the mean difference in height between groups using an ordinary least-squares linear regression vs. robust regression.

Python code to generate the data, and Stata and R code to run the regressions are shown at the end of this post. Running the Python code generates this figure:

Fig 1. Histograms of heights from group 0 (left), heights from group 1 (middle), and scatterplots of heights from both groups (right). Two individuals in group 1 are extremely tall.

The mean (average) difference in height between groups should be close to 0, since we created the data in both groups using the same Normal distribution. But the 2 very tall individuals will influence the mean height of group 1 by pulling it up. How does a robust regression lessen the influence of such extreme observations?

A robust regression effectively performs a weighted regression such that extreme observations carry less weight upon the overall effect, compared to non-extreme observations. Both Stata and R can perform robust regression using the same kinds of weights:

> Two types of weights are used. In Huber weighting, observations with small residuals get a weight of 1, the larger the residual, the smaller the weight. With biweighting, all cases with a non-zero residual get down-weighted at least a little. The two different kinds of weight are used because Huber weights can have difficulties with severe outliers, and biweights can have difficulties converging or may yield multiple solutions. Using the Huber weights first helps to minimize problems with the biweights.
> >– UCLA tutorial on robust regression in Stata

However, Stata and R implement robust regression slightly differently.

### Robust regression in Stata

In Stata, the `rreg` command implements robust regression by first running an ordinary least squares (OLS) regression. It calculates how influential any extreme observation might be, and drops observations that are too influential. Then, it weights each observation based on how much the observed value deviates from the predicted (modelled) value in an iterative process. The iteration stops when differences in weights between iterations is below a threshold.

Stata shows the iteration history for both types of weights just above the robust regression output. Using Stata defaults, robust regression is approximately 95% as efficient as OLS regression. The most influential points are dropped, and observations with large deviations from predicted values are down-weighted.

### Stata output

First, summarise the height data by group. Means should be almost identical, but the 2 outliers in group 1 have pulled the mean up and increased the SD.

``````. bysort group: summarize height

---------------------------------------------------------------------------------------------------------
-&gt; group = 0

Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
height |        100    167.3216     15.3432   127.3243   204.6427

---------------------------------------------------------------------------------------------------------
-&gt; group = 1

Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
height |        100    168.9106    20.21702   116.0061        272
``````

Now, run an OLS regression, then a robust regression. The coefficient and 95% CI of `group` indicates the between-group difference. Both outlying observations were not influential enough to have been dropped from the robust regression (Number of obs = 200 for both models). But the robust regression estimate is closer to 0 and the 95% CI are tighter, as desired, compared to the OLS regression estimate and CI. We can loosely interpret this as "the difference between groups in height is 0.16 cm, and we are confident that this difference lies between -4.2 to 4.5 cm 95% of the time". (Herbert et al 2011)

``````. regress height group

Source |       SS           df       MS      Number of obs   =       200
-------------+----------------------------------   F(1, 198)       =      0.39
Model |  126.242209         1  126.242209   Prob &gt; F        =    0.5320
Residual |  63770.0403       198  322.070911   R-squared       =    0.0020
Total |  63896.2825       199  321.086847   Root MSE        =    17.946

------------------------------------------------------------------------------
height |      Coef.   Std. Err.      t    P&gt;|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
group |   1.588976   2.537995     0.63   0.532    -3.415995    6.593946
_cons |   167.3216   1.794633    93.23   0.000     163.7825    170.8606
------------------------------------------------------------------------------

. rreg height group, gen(rweight)

Huber iteration 1:  maximum difference in weights = .80669086
Huber iteration 2:  maximum difference in weights = .05326543
Huber iteration 3:  maximum difference in weights = .00894569
Biweight iteration 4:  maximum difference in weights = .24059431
Biweight iteration 5:  maximum difference in weights = .01389677
Biweight iteration 6:  maximum difference in weights = .00755576

Robust regression                               Number of obs     =        200
F(  1,       198) =       0.01
Prob &gt; F          =     0.9426

------------------------------------------------------------------------------
height |      Coef.   Std. Err.      t    P&gt;|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
group |   .1591452   2.205679     0.07   0.943    -4.190493    4.508783
_cons |   167.1414   1.559651   107.17   0.000     164.0657     170.217
------------------------------------------------------------------------------
``````

Listing the observations and their weights, the outlying observations were down-weighted to 0.

``````. clist height rweight in 1/10, noobs

height     rweight
250           0
272           0
116.0061   .20960776
125.6731   .41330824
127.3243   .45367491
204.6427   .50446194
131.1786   .53447902
203.3943   .53560991
200.6264   .59508308
200.6045   .59554758
``````

### Robust regression in R

In R, the `rlm` command in the MASS package implements several versions of robust regression. The code below shows how to implement this using Huber or bisquare weighting. R doesn’t combine both types of weights in a single regression the way Stata does. Rather, it implements them separately.

### R output

First run an OLS regression and get the 95% CI. The coefficient and CI correspond to results from Stata’s `regress` command.

``````&gt; # run ordinary least squares linear regression
&gt; ols  summary(ols)

Call:
lm(formula = height ~ group, data = df)

Residuals:
Min      1Q  Median      3Q     Max
-52.904 -11.114  -1.745   8.735 103.089

Coefficients:
Estimate Std. Error t value Pr(&gt;|t|)
(Intercept)  167.322      1.795  93.234    confint(ols)
2.5 %     97.5 %
(Intercept) 163.782527 170.860624
group        -3.415995   6.593945
``````

Then run robust regressions using Huber or bisquare weighting. In our example, bisquare weighting seems to generate more similar results to results from Stata’s `rreg` command.

``````&gt; # run robust regression: Huber weighting
&gt; rr.huber  summary(rr.huber)

Call: rlm(formula = height ~ group, data = df)
Residuals:
Min      1Q  Median      3Q     Max
-51.695 -10.326  -1.312   9.799 104.298

Coefficients:
Value    Std. Error t value
(Intercept) 167.0359   1.5934   104.8278
group         0.6656   2.2535     0.2954

Residual standard error: 14.73 on 198 degrees of freedom
&gt; confint.default(rr.huber)
2.5 %     97.5 %
(Intercept) 163.912847 170.158980
group        -3.751093   5.082275
``````
``````&gt; # run robust regression: bisquare weighting
&gt; rr.bisquare  summary(rr.bisquare)

Call: rlm(formula = height ~ group, data = df, psi = psi.bisquare)
Residuals:
Min       1Q   Median       3Q      Max
-51.2960 -10.1145  -0.9124   9.8228 104.6979

Coefficients:
Value    Std. Error t value
(Intercept) 167.1365   1.5592   107.1912
group         0.1656   2.2051     0.0751

Residual standard error: 14.66 on 198 degrees of freedom
&gt; confint.default(rr.bisquare)
2.5 %     97.5 %
(Intercept) 164.080443 170.192540
group        -4.156299   4.487512
``````

### Summary

Robust regression performs a weighted regression such that extreme observations carry less weight on the overall effect. This is a useful alternative to analysing transformed data: we can obtain the estimate and precision in natural units of the outcome, which allows the overall effect to be interpreted more meaningfully.

It seems possible to perform robust regression in Python using the statsmodels package or scikit-learn, but I haven’t tested these options. If you have and find them useful, do let us know.

### References

UCLA tutorial on robust regression in Stata.

UCLA tutorial on robust regression in R.

McElreath R (2020) Statistical Rethinking: A Bayesian course with examples in R and Stan (2nd Ed) Florida, USA: Chapman and Hall/CRC, p 233.

Herbert R, Jamtvedt G, Birger Hagen K, Mead J (2011) Practical Evidence-Based Physiotherapy. (2nd Ed) Edinburgh: Elsevier Churchill Livingstone, p 104.

### Python code

``````import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm
import pandas as pd

np.random.seed(15)

# generate height data
mean, sd = 168, 15
height_0 = norm.rvs(loc=mean, scale=sd, size=100)
height_1 = norm.rvs(loc=mean, scale=sd, size=97)
height_1 = np.append(height_1, [250.0, 272.0])
group = np.concatenate((np.zeros(len(height_0)), np.ones(len(height_1))))
height = np.concatenate((height_0, height_1))

def jitterx(array, loc=0, scale=0.1):
'''Create jittered x values for categorical data.
Adjust loc to change category/group number.
Adjust scale to change amount of jitter.'''
jitteredx = np.random.normal(loc=loc, scale=scale, size=len(array))
return jitteredx

# plot
fig, axes = plt.subplots(1, 3, figsize=(6, 3))
ax1 = plt.subplot(1, 3, 1)
ax1.hist(height_0, density=True, histtype='stepfilled', alpha=0.2)
plt.ticklabel_format(axis='y', style='sci', scilimits=(0, 0))
ax1.set_xlabel('height (cm): group 0')
ax1.set_ylabel('probability')

ax2 = plt.subplot(1, 3, 2)
ax2.hist(height_1, density=True, histtype='stepfilled', alpha=0.2)
plt.ticklabel_format(axis='y', style='sci', scilimits=(0, 0))
ax2.set_xlabel('height (cm): group 1')
ax2.set_ylabel('probability')

ax3 = plt.subplot(1, 3, 3)
xvals_0 = jitterx(height_0, loc=0, scale=0.15)
xvals_1 = jitterx(height_1, loc=1, scale=0.15)
ax3.scatter(xvals_0, height_0, s=4, color='b', alpha=0.2)
ax3.scatter(xvals_1, height_1, s=4, color='b', alpha=0.2)
plt.xticks(ticks=[-1, 0, 1, 2], labels=['', '0', '1', ''])
ax3.set_xlabel('group')
ax3.set_ylabel('height (cm)')

plt.tight_layout()
plt.savefig('fig.png', dpi=300)
plt.close()

# save data
df = pd.DataFrame({'group': group, 'height': height})
df.to_csv('data.csv')
``````

### Stata code

``````clear all
import delimited data.csv

* summarise height data by group
bysort group: summarize height

* run ordinary least squares linear regression
regress height group

* run robust regression
rreg height group, gen(rweight)
sort rweight
clist height rweight in 1/10, noobs
``````

### R code

``````install.packages(MASS)
library(MASS)

# run ordinary least squares linear regression
ols &lt;- lm(height ~ group, data = df)
summary(ols)
confint(ols)

# run robust regression: Huber weighting
rr.huber &lt;- rlm(height ~ group, data = df)
summary(rr.huber)
confint.default(rr.huber)

# run robust regression: bisquare weighting
rr.bisquare &lt;- rlm(height ~ group, data = df, psi = psi.bisquare)
summary(rr.bisquare)
confint.default(rr.bisquare)
``````