standard-errors-in-python

Estimating standard errors in Python

Prepared by Vincent Grégoire, Department of Finance, HEC Montréal.

You can download the latest version as a Jupyter notebook at https://github.com/vgreg/python-se. Last update: December 3, 2019.

Contact: vincent.3.gregoire@hec.ca

A common problem faced by financial economists when transitioning to Python is the apparent inability to easily do things that we have built-in functions for in either Stata, SAS or R. This page aims to alleviate this pain by providing code samples replicating most of the methods discussed on Mitchell A. Petersen's standard errors programming advice page, and more.

This page is a perpetual work in progress, and I am sure there are probably better/cleaner/faster/more efficient ways to do things, please email me if you have any comments or suggestions. I tested this notebook with Python 3.6, but these examples should work with Python 2.7 as well without too much work.

The four main modules that serve as the basis for data analysis in Python are pandas, statsmodel, numpy and scipy. For linear models with fixed effects and clustered standard errors, linearmodels is also very useful. Note that linearmodels isn't pre-installed with Anaconda and needs to be installed separately with pip.

Most of these examples rely on the covariance adjustments of the fit() function for OLS regressions in statsmodels. Unfortunately, the documentation is quite sparse, but some information can be gathered from the documentation for get_robustcov_results().

Special thanks to Charles Martineau for providing some of the examples.

In [1]:
import pandas as pd
import statsmodels.formula.api as sm
import statsmodels.stats.sandwich_covariance as sw
import numpy as np
import statsmodels as statsmodels
from linearmodels import PanelOLS, FamaMacBeth
In [2]:
# Print the versions used
print('pandas: ' + str(pd.__version__))
print('numpy: ' + str(np.__version__))
print('statsmodels: ' + str(statsmodels.__version__))
pandas: 0.25.1
numpy: 1.17.2
statsmodels: 0.10.1

In order to have a basis for comparison, we'll use Petersen's sample dataset and compare our results with those reported on his page.

In [3]:
df = pd.read_table('http://www.kellogg.northwestern.edu/faculty/petersen/htm/papers/se/test_data.txt',
                   names=['firmid','year','x','y'],
                   delim_whitespace=True)

OLS and statsmodels

OLS Coefficients and Standard Errors

There are many, many ways to run a simple OLS regression in Python. This example uses the formula API from statmodels that lets you use R-style formulas. The use_t parameter tells statsmodels to use $t$-statistics to compute the $p$-values. For more information on formula construction, see the patsy documentation.

In [4]:
ols = sm.ols(formula='y ~ x', data=df).fit(use_t=True)
ols.summary()
Out[4]:
OLS Regression Results
Dep. Variable: y R-squared: 0.208
Model: OLS Adj. R-squared: 0.208
Method: Least Squares F-statistic: 1311.
Date: Tue, 03 Dec 2019 Prob (F-statistic): 4.25e-255
Time: 14:19:19 Log-Likelihood: -10573.
No. Observations: 5000 AIC: 2.115e+04
Df Residuals: 4998 BIC: 2.116e+04
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 0.0297 0.028 1.047 0.295 -0.026 0.085
x 1.0348 0.029 36.204 0.000 0.979 1.091
Omnibus: 4.912 Durbin-Watson: 1.096
Prob(Omnibus): 0.086 Jarque-Bera (JB): 4.862
Skew: 0.070 Prob(JB): 0.0880
Kurtosis: 3.063 Cond. No. 1.01


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

OLS Coefficients and White Standard Errors

Adding heteroscedasticity-consistent standard errors is not much harder. The cov_type parameter can take many values, for heteroscedasticity-consistent standard errors different implementations take the values HC0 (the original White estimator) to HC3.

In [5]:
robust_ols = sm.ols(formula='y ~ x', data=df).fit(cov_type='HC1', use_t=True)
robust_ols.summary()
Out[5]:
OLS Regression Results
Dep. Variable: y R-squared: 0.208
Model: OLS Adj. R-squared: 0.208
Method: Least Squares F-statistic: 1328.
Date: Tue, 03 Dec 2019 Prob (F-statistic): 4.29e-258
Time: 14:19:19 Log-Likelihood: -10573.
No. Observations: 5000 AIC: 2.115e+04
Df Residuals: 4998 BIC: 2.116e+04
Df Model: 1
Covariance Type: HC1
coef std err t P>|t| [0.025 0.975]
Intercept 0.0297 0.028 1.047 0.295 -0.026 0.085
x 1.0348 0.028 36.444 0.000 0.979 1.091
Omnibus: 4.912 Durbin-Watson: 1.096
Prob(Omnibus): 0.086 Jarque-Bera (JB): 4.862
Skew: 0.070 Prob(JB): 0.0880
Kurtosis: 3.063 Cond. No. 1.01


Warnings:
[1] Standard Errors are heteroscedasticity robust (HC1)

OLS Coefficients and Standard Errors Clustered by Firm or Year

Clustering can also be acheived by passing cluster to the cov_type parameter. You also need to give an additional parameter cov_kwds, which indicates which group to cluster on. The parameters takes an arrays of labels, which can be the columns of a pandas DataFrame as in this example.

Warning: patsy automatically throws out missing values from the regression, but this does not affect the clustering groups, which may result in misaligned rows. The easy solution is to manually remove missing values from the dataset using .dropna().

In [6]:
cluster_firm_ols = sm.ols(formula='y ~ x', data=df).fit(cov_type='cluster',
                                                        cov_kwds={'groups': df['firmid']},
                                                        use_t=True)
cluster_firm_ols.summary()
Out[6]:
OLS Regression Results
Dep. Variable: y R-squared: 0.208
Model: OLS Adj. R-squared: 0.208
Method: Least Squares F-statistic: 418.3
Date: Tue, 03 Dec 2019 Prob (F-statistic): 5.61e-68
Time: 14:19:19 Log-Likelihood: -10573.
No. Observations: 5000 AIC: 2.115e+04
Df Residuals: 4998 BIC: 2.116e+04
Df Model: 1
Covariance Type: cluster
coef std err t P>|t| [0.025 0.975]
Intercept 0.0297 0.067 0.443 0.658 -0.102 0.161
x 1.0348 0.051 20.453 0.000 0.935 1.134
Omnibus: 4.912 Durbin-Watson: 1.096
Prob(Omnibus): 0.086 Jarque-Bera (JB): 4.862
Skew: 0.070 Prob(JB): 0.0880
Kurtosis: 3.063 Cond. No. 1.01


Warnings:
[1] Standard Errors are robust to cluster correlation (cluster)
In [7]:
cluster_year_ols = sm.ols(formula='y ~ x', data=df).fit(cov_type='cluster',
                                                        cov_kwds={'groups': df['year']},
                                                        use_t=True)
cluster_year_ols.summary()
Out[7]:
OLS Regression Results
Dep. Variable: y R-squared: 0.208
Model: OLS Adj. R-squared: 0.208
Method: Least Squares F-statistic: 960.6
Date: Tue, 03 Dec 2019 Prob (F-statistic): 1.86e-10
Time: 14:19:19 Log-Likelihood: -10573.
No. Observations: 5000 AIC: 2.115e+04
Df Residuals: 4998 BIC: 2.116e+04
Df Model: 1
Covariance Type: cluster
coef std err t P>|t| [0.025 0.975]
Intercept 0.0297 0.023 1.269 0.236 -0.023 0.083
x 1.0348 0.033 30.993 0.000 0.959 1.110
Omnibus: 4.912 Durbin-Watson: 1.096
Prob(Omnibus): 0.086 Jarque-Bera (JB): 4.862
Skew: 0.070 Prob(JB): 0.0880
Kurtosis: 3.063 Cond. No. 1.01


Warnings:
[1] Standard Errors are robust to cluster correlation (cluster)

OLS Coefficients and Standard Errors Clustered by Firm and Year

Clustering along two dimensions is as easy as clustering along one dimension, all you have to do is pass an array of two colums as the group. The only caveat is that the group cannot be a pandas DataFrame (while it can be a Series), you need to encapsulate it in numpy.array().

In [8]:
cluster_2ways_ols = sm.ols(formula='y ~ x', data=df).fit(cov_type='cluster',
                                                         cov_kwds={'groups': np.array(df[['firmid', 'year']])},
                                                         use_t=True)
cluster_2ways_ols.summary()
Out[8]:
OLS Regression Results
Dep. Variable: y R-squared: 0.208
Model: OLS Adj. R-squared: 0.208
Method: Least Squares F-statistic: 373.3
Date: Tue, 03 Dec 2019 Prob (F-statistic): 1.23e-08
Time: 14:19:19 Log-Likelihood: -10573.
No. Observations: 5000 AIC: 2.115e+04
Df Residuals: 4998 BIC: 2.116e+04
Df Model: 1
Covariance Type: cluster
coef std err t P>|t| [0.025 0.975]
Intercept 0.0297 0.065 0.456 0.659 -0.118 0.177
x 1.0348 0.054 19.322 0.000 0.914 1.156
Omnibus: 4.912 Durbin-Watson: 1.096
Prob(Omnibus): 0.086 Jarque-Bera (JB): 4.862
Skew: 0.070 Prob(JB): 0.0880
Kurtosis: 3.063 Cond. No. 1.01


Warnings:
[1] Standard Errors are robust to cluster correlation (cluster)

OLS Standard Errors Clustering with N dimensions

As far as I know, there is no built-in function for $N$-way clustering in Python when $N>2$. One workaround suggested by my former colleague Zhuo Zhong is to use the rpy2 package to link Python with R and estimate the standard errors with the multiwayvcov library. The following example replicates the previous one with 2-way clustering, but can easily be extended to more than 2 groups (just pass more than two groups to the function).

You first need to have R installed. If you are using the Anaconda environment, see http://conda.pydata.org/docs/r-with-conda.html.

To install the required R libraries, you can use the following code, which you only need to run once:

from rpy2.robjects.packages import importr
utils = importr('utils')
utils.chooseCRANmirror(ind=12)
utils.install_packages('multiwayvcov')
utils.install_packages('lmtest')

Here is a sample function that lets you retreive coefficient estimates, standard errors, $t$-values and $p$-values.

Note: An update to pandas or rpy2 broke the code. I left it here for reference and will fix it eventually.

In [9]:
from rpy2.robjects import pandas2ri
import rpy2.robjects as ro

# Hide warnings (R is quite verbose). Comment out to keep the warnings
ro.r['options'](warn=-1)

pandas2ri.activate()
In [10]:
# model is a string (R-style regression model)
# clusters is a list a strings
# returns a pandas DataFrame
def multiway_cluster(df, model, clusters):
    rdf = pandas2ri.py2ri(df)
    ro.globalenv['rdf'] = rdf

    clusters_grp = ' + '.join(['rdf$' + x for x in clusters])
    reg_command = 'reg <- lm(' + model + ', data = rdf)\n'
    vcov_command = 'reg$vcov <- cluster.vcov(reg, ~ ' + clusters_grp + ')\n'
    libraries = '''
library(zoo)
library(multiwayvcov)
library(lmtest)
'''
    output = '''
result <- coeftest(reg, reg$vcov)
regnames <- attributes(result)$dimnames[[1]]
colnames <- attributes(result)$dimnames[[2]]
'''

    command = libraries + reg_command + vcov_command + output
    ro.r(command)

    res = pd.DataFrame(ro.r('result'))
    res.columns = ro.r('colnames')
    res.index = ro.r('regnames')
    
    return res
In [11]:
# An update to pandas or rpy2 broke the code. I left it here for reference and will fix it eventually.

#multiway_cluster(df, 'y ~ x', clusters=['firmid', 'year'])

OLS Coefficients and Standard Errors with Firm and/or Year Fixed Effects

With linearmodels

The package linearmodels has built-in functions for panel OLS including support for up to two-way fixed effects and clustering (see the documentation for fixed effects not on entity or date).

In [12]:
# linearmodels needs the index to be entity/date.
df2 = df.set_index(['firmid', 'year'])

firm_fe_panel = PanelOLS.from_formula('y ~ x + EntityEffects', data=df2).fit()
firm_fe_panel.summary
Out[12]:
PanelOLS Estimation Summary
Dep. Variable: y R-squared: 0.1916
Estimator: PanelOLS R-squared (Between): 0.2187
No. Observations: 5000 R-squared (Within): 0.1916
Date: Tue, Dec 03 2019 R-squared (Overall): 0.2070
Time: 14:19:19 Log-likelihood -8532.8
Cov. Estimator: Unadjusted
F-statistic: 1066.3
Entities: 500 P-value 0.0000
Avg Obs: 10.0000 Distribution: F(1,4499)
Min Obs: 10.0000
Max Obs: 10.0000 F-statistic (robust): 1066.3
P-value 0.0000
Time periods: 10 Distribution: F(1,4499)
Avg Obs: 500.00
Min Obs: 500.00
Max Obs: 500.00
Parameter Estimates
Parameter Std. Err. T-stat P-value Lower CI Upper CI
x 0.9699 0.0297 32.654 0.0000 0.9116 1.0281


F-test for Poolability: 11.372
P-value: 0.0000
Distribution: F(499,4499)

Included effects: Entity
In [13]:
year_fe_panel = PanelOLS.from_formula('y ~ x + TimeEffects', data=df2).fit()
year_fe_panel.summary
Out[13]:
PanelOLS Estimation Summary
Dep. Variable: y R-squared: 0.2077
Estimator: PanelOLS R-squared (Between): 0.2208
No. Observations: 5000 R-squared (Within): 0.1907
Date: Tue, Dec 03 2019 R-squared (Overall): 0.2078
Time: 14:19:19 Log-likelihood -1.057e+04
Cov. Estimator: Unadjusted
F-statistic: 1307.5
Entities: 500 P-value 0.0000
Avg Obs: 10.0000 Distribution: F(1,4989)
Min Obs: 10.0000
Max Obs: 10.0000 F-statistic (robust): 1307.5
P-value 0.0000
Time periods: 10 Distribution: F(1,4989)
Avg Obs: 500.00
Min Obs: 500.00
Max Obs: 500.00
Parameter Estimates
Parameter Std. Err. T-stat P-value Lower CI Upper CI
x 1.0351 0.0286 36.160 0.0000 0.9789 1.0912


F-test for Poolability: 0.6799
P-value: 0.7279
Distribution: F(9,4989)

Included effects: Time
In [14]:
firm_year_fe_panel = PanelOLS.from_formula('y ~ x + EntityEffects + TimeEffects', data=df2).fit()
firm_year_fe_panel.summary
Out[14]:
PanelOLS Estimation Summary
Dep. Variable: y R-squared: 0.1913
Estimator: PanelOLS R-squared (Between): 0.2187
No. Observations: 5000 R-squared (Within): 0.1916
Date: Tue, Dec 03 2019 R-squared (Overall): 0.2070
Time: 14:19:19 Log-likelihood -8525.9
Cov. Estimator: Unadjusted
F-statistic: 1062.0
Entities: 500 P-value 0.0000
Avg Obs: 10.0000 Distribution: F(1,4490)
Min Obs: 10.0000
Max Obs: 10.0000 F-statistic (robust): 1062.0
P-value 0.0000
Time periods: 10 Distribution: F(1,4490)
Avg Obs: 500.00
Min Obs: 500.00
Max Obs: 500.00
Parameter Estimates
Parameter Std. Err. T-stat P-value Lower CI Upper CI
x 0.9700 0.0298 32.589 0.0000 0.9117 1.0284


F-test for Poolability: 11.203
P-value: 0.0000
Distribution: F(508,4490)

Included effects: Entity, Time

With statsmodels

As far as I know, statsmodels doesn't support fixed effects. One way to add fixed effects is by adding the corresponding dummy variables. Thankfully this is quite easy to do within a formula by using C(var) where var is the label variable. Note that this will work with a few entities/dates, but this will become numerically unstable if there is a large number of entities/dates.

In [15]:
firm_fe_ols = sm.ols(formula='y ~ x + C(firmid)', data=df).fit(use_t=True)
#firm_fe_ols.summary()  
# The summary is ommitted because the large number 
# of dummy variables make it unpleasant to look at.
In [16]:
year_fe_ols = sm.ols(formula='y ~ x + C(year)', data=df).fit(use_t=True)
year_fe_ols.summary()
Out[16]:
OLS Regression Results
Dep. Variable: y R-squared: 0.209
Model: OLS Adj. R-squared: 0.207
Method: Least Squares F-statistic: 131.6
Date: Tue, 03 Dec 2019 Prob (F-statistic): 7.13e-245
Time: 14:19:20 Log-Likelihood: -10570.
No. Observations: 5000 AIC: 2.116e+04
Df Residuals: 4989 BIC: 2.123e+04
Df Model: 10
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 0.1411 0.090 1.573 0.116 -0.035 0.317
C(year)[T.2] -0.0119 0.127 -0.094 0.925 -0.261 0.237
C(year)[T.3] -0.1453 0.127 -1.145 0.252 -0.394 0.103
C(year)[T.4] -0.2038 0.127 -1.607 0.108 -0.453 0.045
C(year)[T.5] -0.0604 0.127 -0.476 0.634 -0.309 0.188
C(year)[T.6] -0.1312 0.127 -1.034 0.301 -0.380 0.118
C(year)[T.7] -0.1975 0.127 -1.557 0.120 -0.446 0.051
C(year)[T.8] -0.1555 0.127 -1.225 0.220 -0.404 0.093
C(year)[T.9] -0.1535 0.127 -1.210 0.226 -0.402 0.095
C(year)[T.10] -0.0556 0.127 -0.438 0.661 -0.304 0.193
x 1.0351 0.029 36.160 0.000 0.979 1.091
Omnibus: 4.804 Durbin-Watson: 1.096
Prob(Omnibus): 0.091 Jarque-Bera (JB): 4.752
Skew: 0.069 Prob(JB): 0.0929
Kurtosis: 3.061 Cond. No. 10.9


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
In [17]:
firm_year_fe_ols = sm.ols(formula='y ~ x + C(firmid) + C(year)', data=df).fit(use_t=True)
#firm_year_fe_ols.summary() 
# The summary is ommitted because the large number 
# of dummy variables make it unpleasant to look at.

Fixed Effects and Clustered Standard Errors

By combining the previous examples, you can have fixed effects and clustered standard errors at the same time.

With linearmodels

In [18]:
firm_year_fe_panel = PanelOLS.from_formula('y ~ x + TimeEffects',
                                           data=df2).fit(cov_type='clustered', cluster_entity=True, cluster_time=True)
firm_year_fe_panel.summary
Out[18]:
PanelOLS Estimation Summary
Dep. Variable: y R-squared: 0.2077
Estimator: PanelOLS R-squared (Between): 0.2208
No. Observations: 5000 R-squared (Within): 0.1907
Date: Tue, Dec 03 2019 R-squared (Overall): 0.2078
Time: 14:19:20 Log-likelihood -1.057e+04
Cov. Estimator: Clustered
F-statistic: 1307.5
Entities: 500 P-value 0.0000
Avg Obs: 10.0000 Distribution: F(1,4989)
Min Obs: 10.0000
Max Obs: 10.0000 F-statistic (robust): 387.33
P-value 0.0000
Time periods: 10 Distribution: F(1,4989)
Avg Obs: 500.00
Min Obs: 500.00
Max Obs: 500.00
Parameter Estimates
Parameter Std. Err. T-stat P-value Lower CI Upper CI
x 1.0351 0.0526 19.681 0.0000 0.9320 1.1382


F-test for Poolability: 0.6799
P-value: 0.7279
Distribution: F(9,4989)

Included effects: Time

With statsmodels

In [19]:
firm_cluster_year_fe_ols = sm.ols(formula='y ~ x + C(year)', data=df).fit(cov_type='cluster',
                                                                          cov_kwds={'groups': df['firmid']},
                                                                          use_t=True)
firm_cluster_year_fe_ols.summary()
Out[19]:
OLS Regression Results
Dep. Variable: y R-squared: 0.209
Model: OLS Adj. R-squared: 0.207
Method: Least Squares F-statistic: 43.23
Date: Tue, 03 Dec 2019 Prob (F-statistic): 1.93e-61
Time: 14:19:20 Log-Likelihood: -10570.
No. Observations: 5000 AIC: 2.116e+04
Df Residuals: 4989 BIC: 2.123e+04
Df Model: 10
Covariance Type: cluster
coef std err t P>|t| [0.025 0.975]
Intercept 0.1411 0.089 1.587 0.113 -0.034 0.316
C(year)[T.2] -0.0119 0.086 -0.139 0.890 -0.181 0.157
C(year)[T.3] -0.1453 0.086 -1.690 0.092 -0.314 0.024
C(year)[T.4] -0.2038 0.089 -2.288 0.023 -0.379 -0.029
C(year)[T.5] -0.0604 0.087 -0.697 0.486 -0.231 0.110
C(year)[T.6] -0.1312 0.084 -1.562 0.119 -0.296 0.034
C(year)[T.7] -0.1975 0.087 -2.275 0.023 -0.368 -0.027
C(year)[T.8] -0.1555 0.094 -1.662 0.097 -0.339 0.028
C(year)[T.9] -0.1535 0.088 -1.752 0.080 -0.326 0.019
C(year)[T.10] -0.0556 0.088 -0.634 0.526 -0.228 0.117
x 1.0351 0.051 20.361 0.000 0.935 1.135
Omnibus: 4.804 Durbin-Watson: 1.096
Prob(Omnibus): 0.091 Jarque-Bera (JB): 4.752
Skew: 0.069 Prob(JB): 0.0929
Kurtosis: 3.061 Cond. No. 10.9


Warnings:
[1] Standard Errors are robust to cluster correlation (cluster)

Fama-MacBeth Coefficients and Standard Errors

There is an undocumented Fama-MacBeth function in pandas. However, it is planned to be deprecated and from my (limited) experience, I was unable to replicate standard errors obtained using other packages. However, linearmodels provides a nice function.

The default behaviour of this function is no adjustment to standard errors.

In [20]:
FamaMacBeth.from_formula('y ~ 1+x', data=df2).fit()
Out[20]:
FamaMacBeth Estimation Summary
Dep. Variable: y R-squared: 0.2078
Estimator: FamaMacBeth R-squared (Between): 0.2208
No. Observations: 5000 R-squared (Within): 0.1907
Date: Tue, Dec 03 2019 R-squared (Overall): 0.2078
Time: 14:19:20 Log-likelihood -1.057e+04
Cov. Estimator: Fama-MacBeth Std Cov
F-statistic: 1310.7
Entities: 500 P-value 0.0000
Avg Obs: 10.0000 Distribution: F(1,4998)
Min Obs: 10.0000
Max Obs: 10.0000 F-statistic (robust): 964.72
P-value 0.0000
Time periods: 10 Distribution: F(1,4998)
Avg Obs: 500.00
Min Obs: 500.00
Max Obs: 500.00
Parameter Estimates
Parameter Std. Err. T-stat P-value Lower CI Upper CI
Intercept 0.0313 0.0234 1.3392 0.1806 -0.0145 0.0771
x 1.0356 0.0333 31.060 0.0000 0.9702 1.1010

id: 0x1a25972750

It is very easy to apply a Newey-West correction with 3 lags.

In [21]:
FamaMacBeth.from_formula('y ~ 1 + x', data=df2).fit(cov_type='kernel',
                                                    kernel='bartlett',
                                                    bandwidth=3)
Out[21]:
FamaMacBeth Estimation Summary
Dep. Variable: y R-squared: 0.2078
Estimator: FamaMacBeth R-squared (Between): 0.2208
No. Observations: 5000 R-squared (Within): 0.1907
Date: Tue, Dec 03 2019 R-squared (Overall): 0.2078
Time: 14:19:20 Log-likelihood -1.057e+04
Cov. Estimator: Fama-MacBeth Kernel Cov
F-statistic: 1310.7
Entities: 500 P-value 0.0000
Avg Obs: 10.0000 Distribution: F(1,4998)
Min Obs: 10.0000
Max Obs: 10.0000 F-statistic (robust): 1440.8
P-value 0.0000
Time periods: 10 Distribution: F(1,4998)
Avg Obs: 500.00
Min Obs: 500.00
Max Obs: 500.00
Parameter Estimates
Parameter Std. Err. T-stat P-value Lower CI Upper CI
Intercept 0.0313 0.0224 1.3934 0.1636 -0.0127 0.0753
x 1.0356 0.0273 37.958 0.0000 0.9821 1.0891

id: 0x1a25a48b10

Newey-West Adjustment for Standard Errors

The Newey-West adjustment for standard errors is built-in statsmodels with cov_type=HAC and the maxlags argument passed in the cov_kwds parameter. For much more details, see http://stackoverflow.com/questions/23420454/newey-west-standard-errors-for-ols-in-python.

Comparison with R: The default in R for the Newey-West estimator is to do pre-whitening, which statsmodels does not support yet. The current implementation is equivalent to turning off the pre-whitening in R: coeftest(model, NeweyWest(model, lag=3, prewhite = FALSE))

Comparison with Stata: The default behavior of the newey function in Stata is to apply small sample correction. The default in statsmodels is the opposite, but you can activate the small sample correction by adding 'use_correction': True to the cov_kwds parameters.

In [22]:
# Note: this adjustment doesn't really make sense for our sample dataset, it's just an illustration.
nw_ols = sm.ols(formula='y ~ x', data=df).fit(cov_type='HAC',
                                              cov_kwds={'maxlags': 3},
                                              use_t=True)
nw_ols.summary()
Out[22]:
OLS Regression Results
Dep. Variable: y R-squared: 0.208
Model: OLS Adj. R-squared: 0.208
Method: Least Squares F-statistic: 831.8
Date: Tue, 03 Dec 2019 Prob (F-statistic): 2.48e-169
Time: 14:19:20 Log-Likelihood: -10573.
No. Observations: 5000 AIC: 2.115e+04
Df Residuals: 4998 BIC: 2.116e+04
Df Model: 1
Covariance Type: HAC
coef std err t P>|t| [0.025 0.975]
Intercept 0.0297 0.043 0.695 0.487 -0.054 0.113
x 1.0348 0.036 28.841 0.000 0.964 1.105
Omnibus: 4.912 Durbin-Watson: 1.096
Prob(Omnibus): 0.086 Jarque-Bera (JB): 4.862
Skew: 0.070 Prob(JB): 0.0880
Kurtosis: 3.063 Cond. No. 1.01


Warnings:
[1] Standard Errors are heteroscedasticity and autocorrelation robust (HAC) using 3 lags and without small sample correction

Driscoll-Kraay Standard Errors

In [23]:
dk_ols = sm.ols(formula='y ~ x', data=df).fit(cov_type='nw-groupsum',
                                              cov_kwds={'time': np.array(df.year),
                                                        'groups': np.array(df.firmid),
                                                        'maxlags': 5},
                                              use_t=True)
dk_ols.summary()
Out[23]:
OLS Regression Results
Dep. Variable: y R-squared: 0.208
Model: OLS Adj. R-squared: 0.208
Method: Least Squares F-statistic: 2633.
Date: Tue, 03 Dec 2019 Prob (F-statistic): 3.50e-201
Time: 14:19:20 Log-Likelihood: -10573.
No. Observations: 5000 AIC: 2.115e+04
Df Residuals: 4998 BIC: 2.116e+04
Df Model: 1
Covariance Type: hac-groupsum
coef std err t P>|t| [0.025 0.975]
Intercept 0.0297 0.023 1.301 0.194 -0.015 0.075
x 1.0348 0.020 51.317 0.000 0.995 1.074
Omnibus: 4.912 Durbin-Watson: 1.096
Prob(Omnibus): 0.086 Jarque-Bera (JB): 4.862
Skew: 0.070 Prob(JB): 0.0880
Kurtosis: 3.063 Cond. No. 1.01


Warnings:
[1] Driscoll and Kraay Standard Errors are robust to cluster correlation (HAC-Groupsum)
In [ ]: