## Handling outlying or skewed data with robust regression

Suppose we want to compare the means of two groups where there are outlying data, or data are skewed. Previously, we learned of the pros and cons to log-transforming such data for analysis. The key disadvantage is that the between-group difference, either in log-units or as the anti-logged ratio, is difficult to interpret meaningfully. However, transforming such data is not the only option. It is possible to deal with outlying or skewed data by using a *robust* approach.

*Robust regression* can mean different things, but it usually refers to a model which reduces the influence of extreme observations (McElreath 2020). For example, when sample size is small, calculating a 95% CI using a *t* value from the T distribution, instead of a *z* value from the Normal distribution, is one way of applying a robust approach. The advantage of applying a robust regression is that we can obtain the between-group difference and 95% CI about the difference in the natural units of the outcome. Let’s use robust regression to compare the heights of two groups of people, in which one group has extreme, outlying observations.

Let’s create the following data:

- Draw 100 random values of heights from a Normal distribution (mean = 168 cm, standard deviation = 20 cm). Assign these data to group 0.
- Draw 98 random values of human heights from the identical Normal distribution. To these, add heights from 2
*very*tall people. Assign these data to group 1. - For both groups, plot histograms and scatterplots of height.
- Compare the mean difference in height between groups using an ordinary least-squares linear regression vs. robust regression.

Python code to generate the data, and Stata and R code to run the regressions are shown at the end of this post. Running the Python code generates this figure:

**Fig 1.** Histograms of heights from group 0 (left), heights from group 1 (middle), and scatterplots of heights from both groups (right). Two individuals in group 1 are extremely tall.

The mean (average) difference in height between groups should be close to 0, since we created the data in both groups using the same Normal distribution. But the 2 very tall individuals will influence the mean height of group 1 by pulling it up. How does a robust regression lessen the influence of such extreme observations?

A robust regression effectively performs a *weighted regression* such that extreme observations carry less weight upon the overall effect, compared to non-extreme observations. Both Stata and R can perform robust regression using the same kinds of weights:

> Two types of weights are used. In Huber weighting, observations with small residuals get a weight of 1, the larger the residual, the smaller the weight. With biweighting, all cases with a non-zero residual get down-weighted at least a little. The two different kinds of weight are used because Huber weights can have difficulties with severe outliers, and biweights can have difficulties converging or may yield multiple solutions. Using the Huber weights first helps to minimize problems with the biweights.

>
>– UCLA tutorial on robust regression in Stata

However, Stata and R implement robust regression slightly differently.

### Robust regression in Stata

In Stata, the `rreg`

command implements robust regression by first running an ordinary least squares (OLS) regression. It calculates how influential any extreme observation might be, and drops observations that are too influential. Then, it weights each observation based on how much the observed value deviates from the predicted (modelled) value in an iterative process. The iteration stops when differences in weights between iterations is below a threshold.

Stata shows the iteration history for both types of weights just above the robust regression output. Using Stata defaults, robust regression is approximately 95% as efficient as OLS regression. The most influential points are dropped, and observations with large deviations from predicted values are down-weighted.

### Stata output

First, summarise the height data by group. Means should be almost identical, but the 2 outliers in group 1 have pulled the mean up and increased the SD.

```
. bysort group: summarize height
---------------------------------------------------------------------------------------------------------
-> group = 0
Variable | Obs Mean Std. Dev. Min Max
-------------+---------------------------------------------------------
height | 100 167.3216 15.3432 127.3243 204.6427
---------------------------------------------------------------------------------------------------------
-> group = 1
Variable | Obs Mean Std. Dev. Min Max
-------------+---------------------------------------------------------
height | 100 168.9106 20.21702 116.0061 272
```

Now, run an OLS regression, then a robust regression. The coefficient and 95% CI of `group`

indicates the between-group difference. Both outlying observations were not influential enough to have been dropped from the robust regression (Number of obs = 200 for both models). But the robust regression estimate is closer to 0 and the 95% CI are tighter, as desired, compared to the OLS regression estimate and CI. We can loosely interpret this as "the difference between groups in height is 0.16 cm, and we are confident that this difference lies between -4.2 to 4.5 cm 95% of the time". (Herbert et al 2011)

```
. regress height group
Source | SS df MS Number of obs = 200
-------------+---------------------------------- F(1, 198) = 0.39
Model | 126.242209 1 126.242209 Prob > F = 0.5320
Residual | 63770.0403 198 322.070911 R-squared = 0.0020
-------------+---------------------------------- Adj R-squared = -0.0031
Total | 63896.2825 199 321.086847 Root MSE = 17.946
------------------------------------------------------------------------------
height | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
group | 1.588976 2.537995 0.63 0.532 -3.415995 6.593946
_cons | 167.3216 1.794633 93.23 0.000 163.7825 170.8606
------------------------------------------------------------------------------
. rreg height group, gen(rweight)
Huber iteration 1: maximum difference in weights = .80669086
Huber iteration 2: maximum difference in weights = .05326543
Huber iteration 3: maximum difference in weights = .00894569
Biweight iteration 4: maximum difference in weights = .24059431
Biweight iteration 5: maximum difference in weights = .01389677
Biweight iteration 6: maximum difference in weights = .00755576
Robust regression Number of obs = 200
F( 1, 198) = 0.01
Prob > F = 0.9426
------------------------------------------------------------------------------
height | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
group | .1591452 2.205679 0.07 0.943 -4.190493 4.508783
_cons | 167.1414 1.559651 107.17 0.000 164.0657 170.217
------------------------------------------------------------------------------
```

Listing the observations and their weights, the outlying observations were down-weighted to 0.

```
. clist height rweight in 1/10, noobs
height rweight
250 0
272 0
116.0061 .20960776
125.6731 .41330824
127.3243 .45367491
204.6427 .50446194
131.1786 .53447902
203.3943 .53560991
200.6264 .59508308
200.6045 .59554758
```

### Robust regression in R

In R, the `rlm`

command in the MASS package implements several versions of robust regression. The code below shows how to implement this using Huber or bisquare weighting. R doesn’t combine both types of weights in a single regression the way Stata does. Rather, it implements them separately.

### R output

First run an OLS regression and get the 95% CI. The coefficient and CI correspond to results from Stata’s `regress`

command.

```
> # run ordinary least squares linear regression
> ols summary(ols)
Call:
lm(formula = height ~ group, data = df)
Residuals:
Min 1Q Median 3Q Max
-52.904 -11.114 -1.745 8.735 103.089
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 167.322 1.795 93.234 confint(ols)
2.5 % 97.5 %
(Intercept) 163.782527 170.860624
group -3.415995 6.593945
```

Then run robust regressions using Huber or bisquare weighting. In our example, bisquare weighting seems to generate more similar results to results from Stata’s `rreg`

command.

```
> # run robust regression: Huber weighting
> rr.huber summary(rr.huber)
Call: rlm(formula = height ~ group, data = df)
Residuals:
Min 1Q Median 3Q Max
-51.695 -10.326 -1.312 9.799 104.298
Coefficients:
Value Std. Error t value
(Intercept) 167.0359 1.5934 104.8278
group 0.6656 2.2535 0.2954
Residual standard error: 14.73 on 198 degrees of freedom
> confint.default(rr.huber)
2.5 % 97.5 %
(Intercept) 163.912847 170.158980
group -3.751093 5.082275
```

```
> # run robust regression: bisquare weighting
> rr.bisquare summary(rr.bisquare)
Call: rlm(formula = height ~ group, data = df, psi = psi.bisquare)
Residuals:
Min 1Q Median 3Q Max
-51.2960 -10.1145 -0.9124 9.8228 104.6979
Coefficients:
Value Std. Error t value
(Intercept) 167.1365 1.5592 107.1912
group 0.1656 2.2051 0.0751
Residual standard error: 14.66 on 198 degrees of freedom
> confint.default(rr.bisquare)
2.5 % 97.5 %
(Intercept) 164.080443 170.192540
group -4.156299 4.487512
```

### Summary

Robust regression performs a weighted regression such that extreme observations carry less weight on the overall effect. This is a useful alternative to analysing transformed data: we can obtain the estimate and precision in natural units of the outcome, which allows the overall effect to be interpreted more meaningfully.

It seems possible to perform robust regression in Python using the statsmodels package or scikit-learn, but I haven’t tested these options. If you have and find them useful, do let us know.

### References

UCLA tutorial on robust regression in Stata.

UCLA tutorial on robust regression in R.

McElreath R (2020) Statistical Rethinking: A Bayesian course with examples in R and Stan (2nd Ed) Florida, USA: Chapman and Hall/CRC, p 233.

Herbert R, Jamtvedt G, Birger Hagen K, Mead J (2011) Practical Evidence-Based Physiotherapy. (2nd Ed) Edinburgh: Elsevier Churchill Livingstone, p 104.

### Posts in series

- Handling badly behaved data 1: create and plot data
- Handling badly behaved data 2: log-transform one sample
- Handling badly behaved data 3: log-transform to compare 2 means
- Handling badly behaved data 4: robust regression for outlying data

### Python code

```
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm
import pandas as pd
np.random.seed(15)
# generate height data
mean, sd = 168, 15
height_0 = norm.rvs(loc=mean, scale=sd, size=100)
height_1 = norm.rvs(loc=mean, scale=sd, size=97)
height_1 = np.append(height_1, [250.0, 272.0])
group = np.concatenate((np.zeros(len(height_0)), np.ones(len(height_1))))
height = np.concatenate((height_0, height_1))
def jitterx(array, loc=0, scale=0.1):
'''Create jittered x values for categorical data.
Adjust loc to change category/group number.
Adjust scale to change amount of jitter.'''
jitteredx = np.random.normal(loc=loc, scale=scale, size=len(array))
return jitteredx
# plot
fig, axes = plt.subplots(1, 3, figsize=(6, 3))
ax1 = plt.subplot(1, 3, 1)
ax1.hist(height_0, density=True, histtype='stepfilled', alpha=0.2)
plt.ticklabel_format(axis='y', style='sci', scilimits=(0, 0))
ax1.set_xlabel('height (cm): group 0')
ax1.set_ylabel('probability')
ax2 = plt.subplot(1, 3, 2)
ax2.hist(height_1, density=True, histtype='stepfilled', alpha=0.2)
plt.ticklabel_format(axis='y', style='sci', scilimits=(0, 0))
ax2.set_xlabel('height (cm): group 1')
ax2.set_ylabel('probability')
ax3 = plt.subplot(1, 3, 3)
xvals_0 = jitterx(height_0, loc=0, scale=0.15)
xvals_1 = jitterx(height_1, loc=1, scale=0.15)
ax3.scatter(xvals_0, height_0, s=4, color='b', alpha=0.2)
ax3.scatter(xvals_1, height_1, s=4, color='b', alpha=0.2)
plt.xticks(ticks=[-1, 0, 1, 2], labels=['', '0', '1', ''])
ax3.set_xlabel('group')
ax3.set_ylabel('height (cm)')
plt.tight_layout()
plt.savefig('fig.png', dpi=300)
plt.close()
# save data
df = pd.DataFrame({'group': group, 'height': height})
df.to_csv('data.csv')
```

### Stata code

```
clear all
import delimited data.csv
* summarise height data by group
bysort group: summarize height
* run ordinary least squares linear regression
regress height group
* run robust regression
rreg height group, gen(rweight)
sort rweight
clist height rweight in 1/10, noobs
```

### R code

```
install.packages(MASS)
library(MASS)
df <- read.csv('data.csv', header=TRUE)
# run ordinary least squares linear regression
ols <- lm(height ~ group, data = df)
summary(ols)
confint(ols)
# run robust regression: Huber weighting
rr.huber <- rlm(height ~ group, data = df)
summary(rr.huber)
confint.default(rr.huber)
# run robust regression: bisquare weighting
rr.bisquare <- rlm(height ~ group, data = df, psi = psi.bisquare)
summary(rr.bisquare)
confint.default(rr.bisquare)
```