Standard Errors in Python

Estimating standard errors in Python

Prepared by Vincent Grégoire, Department of Finance, The University of Melbourne.

You can download the latest version as a Jupyter (IPython) notebook at https://github.com/vgreg/python-se. Last update: Dec. 6 2016.

Contact: vincent.gregoire@unimelb.edu.au

A common problem faced by financial economists when transitioning to Python is the apparent inability to easily do things that we have built-in functions for in either Stata, SAS or R. This page aims to alleviate this pain by providing code samples replicating most of the methods discussed on Mitchell A. Petersen's standard errors programming advice page, and more.

This page is a perpetual work in progress, and I'm sure there are probably better/cleaner/faster/more efficient ways to do things, please email me if you have any comments or suggestions. I use Python 2.7, so while most of these examples should work without too much modifications in Python 3, I have not tested them.

The four main modules that serve as the basis for data analysis in Python are pandas, statsmodel, numpy and scipy.

Most of these examples rely on on the covariance adjustments of the fit() function for OLS regressions in statsmodels. Unfortunately, the documentation is quite sparse, but some information can be gathered from the documentation for get_robustcov_results().

Special thanks to Charles Martineau for providing some of the examples.

In [1]:
import pandas as pd
import statsmodels.formula.api as sm
import statsmodels.stats.sandwich_covariance as sw
import numpy as np

In order to have a basis for comparison, we'll use Petersen's sample dataset and compare our results with those reported on his page.

In [2]:
from urllib import urlopen

filehandle = urlopen('http://www.kellogg.northwestern.edu/faculty/petersen/htm/papers/se/test_data.txt')
df = pd.read_table(filehandle, names=['firmid','year','x','y'],
                   delim_whitespace=True)
df
Out[2]:
firmid year x y
0 1 1 -1.113973 2.251535
1 1 2 -0.080854 1.242346
2 1 3 -0.237607 -1.426376
3 1 4 -0.152486 -1.109394
4 1 5 -0.001426 0.914686
5 1 6 -1.212737 -1.424686
6 1 7 -0.127273 0.758945
7 1 8 -1.433539 0.929652
8 1 9 -0.242196 1.056465
9 1 10 0.460922 3.308434
10 2 1 -0.550791 -2.545477
11 2 2 -1.287685 -3.021920
12 2 3 -0.220503 -1.003296
13 2 4 0.814318 -0.118388
14 2 5 -0.046372 -1.279670
15 2 6 0.622044 -3.539696
16 2 7 -0.653009 -2.235361
17 2 8 0.029411 -2.552972
18 2 9 -0.529747 -2.697836
19 2 10 1.062629 -0.263351
20 3 1 -1.068723 1.526512
21 3 2 -1.487991 -1.168439
22 3 3 -0.814688 1.043587
23 3 4 -0.233863 -0.119772
24 3 5 0.152062 0.808258
25 3 6 -0.719077 0.406351
26 3 7 -0.789831 0.176693
27 3 8 0.234270 -0.055562
28 3 9 -1.761113 2.518384
29 3 10 -0.327372 -0.783275
... ... ... ... ...
4970 498 1 0.937313 2.435911
4971 498 2 -0.433560 0.137374
4972 498 3 1.249027 -0.736823
4973 498 4 0.536620 0.590931
4974 498 5 2.481492 7.002803
4975 498 6 1.458367 2.051867
4976 498 7 1.267130 2.335333
4977 498 8 1.979471 4.102230
4978 498 9 1.466995 4.206719
4979 498 10 0.981465 1.769253
4980 499 1 0.515411 -1.192308
4981 499 2 -0.082360 0.492265
4982 499 3 0.450559 -1.388515
4983 499 4 0.347505 0.794183
4984 499 5 -0.760993 -1.084250
4985 499 6 -0.891608 -3.299844
4986 499 7 -0.603919 -2.204624
4987 499 8 0.717169 1.272819
4988 499 9 0.426611 -2.184453
4989 499 10 0.662455 -1.470384
4990 500 1 0.028352 1.502491
4991 500 2 -0.527109 -0.595065
4992 500 3 -0.418416 2.122376
4993 500 4 1.557888 0.603051
4994 500 5 -0.187499 -0.818244
4995 500 6 -0.077057 3.720502
4996 500 7 0.218847 0.559121
4997 500 8 -0.155530 -3.766785
4998 500 9 -0.040172 0.903354
4999 500 10 -0.001172 -0.529761

5000 rows × 4 columns

OLS and statsmodels

OLS Coefficients and Standard Errors

There are many, many ways to run a simple OLS regression in Python. This example uses the formula API from statmodels that lets you use R-style formulas. The use_t parameter tells statsmodels to use $t$-statistics to compute the $p$-values. For more information on formula construction, see the patsy documentation.

In [3]:
ols = sm.ols(formula='y ~ x', data=df).fit(use_t=True)
ols.summary()
Out[3]:
OLS Regression Results
Dep. Variable: y R-squared: 0.208
Model: OLS Adj. R-squared: 0.208
Method: Least Squares F-statistic: 1311.
Date: Tue, 06 Dec 2016 Prob (F-statistic): 4.25e-255
Time: 16:31:45 Log-Likelihood: -10573.
No. Observations: 5000 AIC: 2.115e+04
Df Residuals: 4998 BIC: 2.116e+04
Df Model: 1
Covariance Type: nonrobust
coef std err z P>|z| [95.0% Conf. Int.]
Intercept 0.0297 0.028 1.047 0.295 -0.026 0.085
x 1.0348 0.029 36.204 0.000 0.979 1.091
Omnibus: 4.912 Durbin-Watson: 1.096
Prob(Omnibus): 0.086 Jarque-Bera (JB): 4.862
Skew: 0.070 Prob(JB): 0.0880
Kurtosis: 3.063 Cond. No. 1.01

OLS Coefficients and White Standard Errors

Adding heteroscedasticity-consistent standard errors is not much harder. The cov_type parameter can take many values, for heteroscedasticity-consistent standard errors different implementations take the values HC0 (the original White estimator) to HC3.

In [4]:
robust_ols = sm.ols(formula='y ~ x', data=df).fit(cov_type='HC1', use_t=True)
robust_ols.summary()
Out[4]:
OLS Regression Results
Dep. Variable: y R-squared: 0.208
Model: OLS Adj. R-squared: 0.208
Method: Least Squares F-statistic: 1328.
Date: Tue, 06 Dec 2016 Prob (F-statistic): 4.29e-258
Time: 16:31:45 Log-Likelihood: -10573.
No. Observations: 5000 AIC: 2.115e+04
Df Residuals: 4998 BIC: 2.116e+04
Df Model: 1
Covariance Type: HC1
coef std err t P>|t| [95.0% Conf. Int.]
Intercept 0.0297 0.028 1.047 0.295 -0.026 0.085
x 1.0348 0.028 36.444 0.000 0.979 1.091
Omnibus: 4.912 Durbin-Watson: 1.096
Prob(Omnibus): 0.086 Jarque-Bera (JB): 4.862
Skew: 0.070 Prob(JB): 0.0880
Kurtosis: 3.063 Cond. No. 1.01

OLS Coefficients and Standard Errors Clustered by Firm or Year

Clustering can also be acheived by passing cluster to the cov_type parameter. You also need to give an additional parameter cov_kwds, which indicates which group to cluster on. The parameters takes an arrays of labels, which can be the columns of a pandas DataFrame as in this example.

In [5]:
cluster_firm_ols = sm.ols(formula='y ~ x', data=df).fit(cov_type='cluster',
                                                        cov_kwds={'groups': df['firmid']},
                                                        use_t=True)
cluster_firm_ols.summary()
Out[5]:
OLS Regression Results
Dep. Variable: y R-squared: 0.208
Model: OLS Adj. R-squared: 0.208
Method: Least Squares F-statistic: 418.3
Date: Tue, 06 Dec 2016 Prob (F-statistic): 5.61e-68
Time: 16:31:45 Log-Likelihood: -10573.
No. Observations: 5000 AIC: 2.115e+04
Df Residuals: 4998 BIC: 2.116e+04
Df Model: 1
Covariance Type: cluster
coef std err t P>|t| [95.0% Conf. Int.]
Intercept 0.0297 0.067 0.443 0.658 -0.102 0.161
x 1.0348 0.051 20.453 0.000 0.935 1.134
Omnibus: 4.912 Durbin-Watson: 1.096
Prob(Omnibus): 0.086 Jarque-Bera (JB): 4.862
Skew: 0.070 Prob(JB): 0.0880
Kurtosis: 3.063 Cond. No. 1.01
In [6]:
cluster_year_ols = sm.ols(formula='y ~ x', data=df).fit(cov_type='cluster',
                                                        cov_kwds={'groups': df['year']},
                                                        use_t=True)
cluster_year_ols.summary()
Out[6]:
OLS Regression Results
Dep. Variable: y R-squared: 0.208
Model: OLS Adj. R-squared: 0.208
Method: Least Squares F-statistic: 960.6
Date: Tue, 06 Dec 2016 Prob (F-statistic): 1.86e-10
Time: 16:31:45 Log-Likelihood: -10573.
No. Observations: 5000 AIC: 2.115e+04
Df Residuals: 4998 BIC: 2.116e+04
Df Model: 1
Covariance Type: cluster
coef std err t P>|t| [95.0% Conf. Int.]
Intercept 0.0297 0.023 1.269 0.236 -0.023 0.083
x 1.0348 0.033 30.993 0.000 0.959 1.110
Omnibus: 4.912 Durbin-Watson: 1.096
Prob(Omnibus): 0.086 Jarque-Bera (JB): 4.862
Skew: 0.070 Prob(JB): 0.0880
Kurtosis: 3.063 Cond. No. 1.01

OLS Coefficients and Standard Errors Clustered by Firm and Year

Clustering along two dimensions is as easy as clustering along one dimension, all you have to do is pass an array of two colums as the group. The only caveat is that the group cannot be a pandas DataFrame (while it can be a Series), you need to encapsulate it in numpy.array().

In [7]:
cluster_2ways_ols = sm.ols(formula='y ~ x', data=df).fit(cov_type='cluster',
                                                         cov_kwds={'groups': np.array(df[['firmid', 'year']])},
                                                         use_t=True)
cluster_2ways_ols.summary()
Out[7]:
OLS Regression Results
Dep. Variable: y R-squared: 0.208
Model: OLS Adj. R-squared: 0.208
Method: Least Squares F-statistic: 373.3
Date: Tue, 06 Dec 2016 Prob (F-statistic): 1.23e-08
Time: 16:31:45 Log-Likelihood: -10573.
No. Observations: 5000 AIC: 2.115e+04
Df Residuals: 4998 BIC: 2.116e+04
Df Model: 1
Covariance Type: cluster
coef std err t P>|t| [95.0% Conf. Int.]
Intercept 0.0297 0.065 0.456 0.659 -0.118 0.177
x 1.0348 0.054 19.322 0.000 0.914 1.156
Omnibus: 4.912 Durbin-Watson: 1.096
Prob(Omnibus): 0.086 Jarque-Bera (JB): 4.862
Skew: 0.070 Prob(JB): 0.0880
Kurtosis: 3.063 Cond. No. 1.01

OLS Standard Errors Clustering with N dimensions

As far as I know, there is no built-in function for $N$-way clustering in Python when $N>2$. One workaround suggested by my colleague Zhuo Zhong is to use the rpy2 package to link Python with R and estimate the standard errors with the multiwayvcov library. The following example replicates the previous one with 2-way clustering, but can easily be extended to more than 2 groups (just pass more than two groups to the function).

You first need to have R installed. If you are using the Anaconda environment, see http://conda.pydata.org/docs/r-with-conda.html.

To install the required R libraries, you can use the following code:

from rpy2.robjects.packages import importr
utils = importr('utils')
utils.chooseCRANmirror(ind=12)
utils.install_packages('multiwayvcov')
utils.install_packages('lmtest')

Here is a sample function that lets you retreive coefficient estimates, standard errors, $t$-values and $p$-values.

In [8]:
from rpy2.robjects import pandas2ri
import rpy2.robjects as ro

# Hide warnings (R is quite verbose). Comment out to keep the warnings
ro.r['options'](warn=-1)

pandas2ri.activate()
In [9]:
# model is a string (R-style regression model)
# clusters is a list a strings
# returns a pandas DataFrame
def multiway_cluster(df, model, clusters):
    rdf = pandas2ri.py2ri(df)
    ro.globalenv['rdf'] = rdf

    clusters_grp = ' + '.join(['rdf$' + x for x in clusters])
    reg_command = 'reg <- lm(' + model + ', data = rdf)\n'
    vcov_command = 'reg$vcov <- cluster.vcov(reg, ~ ' + clusters_grp + ')\n'
    libraries = '''
library(zoo)
library(multiwayvcov)
library(lmtest)
'''
    output = '''
result <- coeftest(reg, reg$vcov)
regnames <- attributes(result)$dimnames[[1]]
colnames <- attributes(result)$dimnames[[2]]
'''

    command = libraries + reg_command + vcov_command + output
    ro.r(command)

    res = pd.DataFrame(ro.r('result'))
    res.columns = ro.r('colnames')
    res.index = ro.r('regnames')
    
    return res
In [10]:
multiway_cluster(df, 'y ~ x', clusters=['firmid', 'year'])

# Note: R in quite verbose with the warnings, but most of them aren't really warnings.
//anaconda/lib/python2.7/site-packages/rpy2/rinterface/__init__.py:185: RRuntimeWarning: 
Attaching package: ‘zoo’


  warnings.warn(x, RRuntimeWarning)
//anaconda/lib/python2.7/site-packages/rpy2/rinterface/__init__.py:185: RRuntimeWarning: The following objects are masked from ‘package:base’:

    as.Date, as.Date.numeric


  warnings.warn(x, RRuntimeWarning)
Out[10]:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.029680 0.065066 0.456145 6.483054e-01
x 1.034833 0.053561 19.320640 2.860052e-80

OLS Coefficients and Standard Errors with Firm and/or Year Fixed Effects

One way to add fixed effects is by adding the corresponding dummy variables. Thankfully this is quite easy to do within a formula by using C(var) where var is the label variable.

In [11]:
firm_fe_ols = sm.ols(formula='y ~ x + C(firmid)', data=df).fit(use_t=True)
#firm_fe_ols.summary()  
# The summary is ommitted because the large number 
# of dummy variables make it unpleasant to look at.
In [12]:
year_fe_ols = sm.ols(formula='y ~ x + C(year)', data=df).fit(use_t=True)
year_fe_ols.summary()
Out[12]:
OLS Regression Results
Dep. Variable: y R-squared: 0.209
Model: OLS Adj. R-squared: 0.207
Method: Least Squares F-statistic: 131.6
Date: Tue, 06 Dec 2016 Prob (F-statistic): 7.13e-245
Time: 16:31:47 Log-Likelihood: -10570.
No. Observations: 5000 AIC: 2.116e+04
Df Residuals: 4989 BIC: 2.123e+04
Df Model: 10
Covariance Type: nonrobust
coef std err z P>|z| [95.0% Conf. Int.]
Intercept 0.1411 0.090 1.573 0.116 -0.035 0.317
C(year)[T.2] -0.0119 0.127 -0.094 0.925 -0.261 0.237
C(year)[T.3] -0.1453 0.127 -1.145 0.252 -0.394 0.103
C(year)[T.4] -0.2038 0.127 -1.607 0.108 -0.452 0.045
C(year)[T.5] -0.0604 0.127 -0.476 0.634 -0.309 0.188
C(year)[T.6] -0.1312 0.127 -1.034 0.301 -0.380 0.117
C(year)[T.7] -0.1975 0.127 -1.557 0.120 -0.446 0.051
C(year)[T.8] -0.1555 0.127 -1.225 0.220 -0.404 0.093
C(year)[T.9] -0.1535 0.127 -1.210 0.226 -0.402 0.095
C(year)[T.10] -0.0556 0.127 -0.438 0.661 -0.304 0.193
x 1.0351 0.029 36.160 0.000 0.979 1.091
Omnibus: 4.804 Durbin-Watson: 1.096
Prob(Omnibus): 0.091 Jarque-Bera (JB): 4.752
Skew: 0.069 Prob(JB): 0.0929
Kurtosis: 3.061 Cond. No. 10.9
In [13]:
firm_year_fe_ols = sm.ols(formula='y ~ x + C(firmid) + C(year)', data=df).fit(use_t=True)
#firm_year_fe_ols.summary() 
# The summary is ommitted because the large number 
# of dummy variables make it unpleasant to look at.

Fixed Effects and Clustered Standard Errors

By combining the previous examples, you can have fixed effects and clustered standard errors at the same time.

In [14]:
firm_cluster_year_fe_ols = sm.ols(formula='y ~ x + C(year)', data=df).fit(cov_type='cluster',
                                                                          cov_kwds={'groups': df['firmid']},
                                                                          use_t=True)
firm_cluster_year_fe_ols.summary()
Out[14]:
OLS Regression Results
Dep. Variable: y R-squared: 0.209
Model: OLS Adj. R-squared: 0.207
Method: Least Squares F-statistic: 43.23
Date: Tue, 06 Dec 2016 Prob (F-statistic): 1.93e-61
Time: 16:31:48 Log-Likelihood: -10570.
No. Observations: 5000 AIC: 2.116e+04
Df Residuals: 4989 BIC: 2.123e+04
Df Model: 10
Covariance Type: cluster
coef std err t P>|t| [95.0% Conf. Int.]
Intercept 0.1411 0.089 1.587 0.113 -0.034 0.316
C(year)[T.2] -0.0119 0.086 -0.139 0.890 -0.181 0.157
C(year)[T.3] -0.1453 0.086 -1.690 0.092 -0.314 0.024
C(year)[T.4] -0.2038 0.089 -2.288 0.023 -0.379 -0.029
C(year)[T.5] -0.0604 0.087 -0.697 0.486 -0.231 0.110
C(year)[T.6] -0.1312 0.084 -1.562 0.119 -0.296 0.034
C(year)[T.7] -0.1975 0.087 -2.275 0.023 -0.368 -0.027
C(year)[T.8] -0.1555 0.094 -1.662 0.097 -0.339 0.028
C(year)[T.9] -0.1535 0.088 -1.752 0.080 -0.326 0.019
C(year)[T.10] -0.0556 0.088 -0.634 0.526 -0.228 0.117
x 1.0351 0.051 20.361 0.000 0.935 1.135
Omnibus: 4.804 Durbin-Watson: 1.096
Prob(Omnibus): 0.091 Jarque-Bera (JB): 4.752
Skew: 0.069 Prob(JB): 0.0929
Kurtosis: 3.061 Cond. No. 10.9

Fama-MacBeth Coefficients and Standard Errors

There is an undocumented Fama-MacBeth function in pandas. However, it is planned to be deprecated and from my (limited) experience, I was unable to replicate standard errors obtained using other packages. Therefore, the most sensible option to me was to write my own function.

In [15]:
def fama_macbeth(formula, time_label, df, lags=3):
    res = df.groupby(time_label).apply(lambda x: sm.ols(formula=formula,
                                                     data=x).fit())

    l = [x.params for x in res]
    p = pd.DataFrame(l)

    means = {}
    params_labels = res.iloc[0].params.index

    # The ':' character used by patsy doesn't play well with pandas column names.
    p.columns = [x.replace(':', '_INTER_') for x in p.columns]

    for x in p.columns:
        if lags is 0:
            means[x.replace('_INTER_',':')] = sm.ols(formula=x + ' ~ 1',
                                                     data=p[[x]]).fit(use_t=True)
        else:
            means[x.replace('_INTER_',':')] = sm.ols(formula=x + ' ~ 1',
                                                     data=p[[x]]).fit(cov_type='HAC',
                                                                      cov_kwds={'maxlags': lags},
                                                                      use_t=True)

    params = []
    stderrs = []
    tvalues = []
    pvalues = []
    for x in params_labels:
        params.append(means[x].params['Intercept'])
        stderrs.append(means[x].bse['Intercept'])
        tvalues.append(means[x].tvalues['Intercept'])
        pvalues.append(means[x].pvalues['Intercept'])

    result = pd.DataFrame([params, stderrs, tvalues, pvalues]).T
    result.index = params_labels
    result.columns = ['coef', 'stderr', 'tvalue', 'pvalue']
    result['stars'] = ''
    result.loc[result.pvalue < 0.1, 'stars'] = '*'
    result.loc[result.pvalue < 0.05, 'stars'] = '**'
    result.loc[result.pvalue < 0.01, 'stars'] = '***'

    return result

The default behaviour of this function is to apply a Newey-West correction with 3 lags. The statsmodels implementation for the Newey-West estimator should give the same results as the SAS function written by Noah Stoffman.

In [16]:
fama_macbeth('y ~ x', 'year', df)
Out[16]:
coef stderr tvalue pvalue stars
Intercept 0.031278 0.021295 1.468810 1.759490e-01
x 1.035586 0.025883 40.010988 1.893637e-11 ***

It is possible to forgo the Newey-West correction by passing lags=0.

In [17]:
fama_macbeth('y ~ x', 'year', df, lags=0)
Out[17]:
coef stderr tvalue pvalue stars
Intercept 0.031278 0.023356 1.339155 1.805202e-01
x 1.035586 0.033342 31.059889 8.389156e-212 ***

Newey-West Adjustment for Standard Errors

The Newey-West adjustment for standard errors is built-in statsmodels with cov_type=HAC and the maxlags argument passed in the cov_kwds parameter. For much more details, see http://stackoverflow.com/questions/23420454/newey-west-standard-errors-for-ols-in-python.

In [18]:
# Note: this adjustment doesn't really make sense for our sample dataset, it's just an illustration.
nw_ols = sm.ols(formula='y ~ x', data=df).fit(cov_type='HAC',
                                              cov_kwds={'maxlags': 3},
                                              use_t=True)
nw_ols.summary()
Out[18]:
OLS Regression Results
Dep. Variable: y R-squared: 0.208
Model: OLS Adj. R-squared: 0.208
Method: Least Squares F-statistic: 831.8
Date: Tue, 06 Dec 2016 Prob (F-statistic): 2.48e-169
Time: 16:31:49 Log-Likelihood: -10573.
No. Observations: 5000 AIC: 2.115e+04
Df Residuals: 4998 BIC: 2.116e+04
Df Model: 1
Covariance Type: HAC
coef std err t P>|t| [95.0% Conf. Int.]
Intercept 0.0297 0.043 0.695 0.487 -0.054 0.113
x 1.0348 0.036 28.841 0.000 0.964 1.105
Omnibus: 4.912 Durbin-Watson: 1.096
Prob(Omnibus): 0.086 Jarque-Bera (JB): 4.862
Skew: 0.070 Prob(JB): 0.0880
Kurtosis: 3.063 Cond. No. 1.01

Driscoll-Kraay Standard Errors

In [19]:
dk_ols = sm.ols(formula='y ~ x', data=df).fit(cov_type='nw-groupsum',
                                              cov_kwds={'time': df.year,
                                                        'groups': df.firmid,
                                                        'maxlags': 5},
                                              use_t=True)
dk_ols.summary()
Out[19]:
OLS Regression Results
Dep. Variable: y R-squared: 0.208
Model: OLS Adj. R-squared: 0.208
Method: Least Squares F-statistic: 2633.
Date: Tue, 06 Dec 2016 Prob (F-statistic): 3.50e-201
Time: 16:31:49 Log-Likelihood: -10573.
No. Observations: 5000 AIC: 2.115e+04
Df Residuals: 4998 BIC: 2.116e+04
Df Model: 1
Covariance Type: nw-groupsum
coef std err t P>|t| [95.0% Conf. Int.]
Intercept 0.0297 0.023 1.301 0.194 -0.015 0.075
x 1.0348 0.020 51.317 0.000 0.995 1.074
Omnibus: 4.912 Durbin-Watson: 1.096
Prob(Omnibus): 0.086 Jarque-Bera (JB): 4.862
Skew: 0.070 Prob(JB): 0.0880
Kurtosis: 3.063 Cond. No. 1.01
In [ ]: