# [SciPy-User] tests for equality of coefficients across regression or groups

josef.pktd@gmai... josef.pktd@gmai...
Mon Mar 29 10:21:45 CDT 2010

I wrote a class that is more convenient to test whether all
coefficient are the same across different regressions or groups. This
can also be used to test for structural breaks at predefined change
points, and it is a bit similar to ANCOVA (as far as I understand)
however currently all coefficients are tested for changes across
groups. (While ANCOVA assumes that some slope coefficients are the
same. But I'm not sure about all the ANCOVA, ANOVA and MANCOVA
terminology).

If the only regressor is a constant, then the result is identical to
scipy.stats.f_oneway.

The original will be in the statsmodels sandbox, until some missing
enhancements are finished. The attached file contains a standalone
version (but requires scikits.statsmodels). Several examples are
included.

Below is par of the example output.
4 groups, groups 0 and 1 and groups 2 and 3 have identical
coefficients, but large difference between 0,1 and 2,3
sample size is 100 where the results are pretty clear cut.

I haven't thought much about the API for this kind of convenience
class. I will finish it up one of the next weekends.
Any comments are welcome, in case someone finds this useful.

Josef

Test for equality of coefficients for all exogenous variables
-------------------------------------------------------------

Table of F-tests for overall or pairwise equality of coefficients
hypothesis F-statistic         p-value  df_denom df_num  reject
('all', (212.94813033530886, 5.0118071432076266e-082, 192.0, 6)) *
((0, 1), (1.8208707506170163, 0.16466906994879624, 192.0, 2))
((0, 2), (281.4877465209409, 8.2293196773132505e-058, 192.0, 2)) *
((0, 3), (322.21218144617899, 4.4051148030359891e-062, 192.0, 2)) *
((1, 2), (313.90670953958153, 3.0218699484829141e-061, 192.0, 2)) *
((1, 3), (356.39739587117498, 2.3340444544965008e-065, 192.0, 2)) *
((2, 3), (1.4442741180990379, 0.23846859348160215, 192.0, 2))
Notes: p-values are not corrected for many tests
(no Bonferroni correction)
* : reject at 5% uncorrected confidence level
Null hypothesis: all or pairwise coefficient are the same
Alternative hypothesis: all coefficients are different

Comparison with stats.f_oneway
(5.4853751213497954, 0.0012219835303810941)

Likelihood Ratio Test
likelihood ratio    p-value       df
(407.06211448271051, 8.4739101836496682e-085, 6.0)
Null model: pooled all coefficients are the same across groups,
Alternative model: all coefficients are allowed to be different
not verified but looks close to f-test result

Ols parameters by group from individual, separate ols regressions
0 [ 14.89986056  10.49887409]
1 [ 14.56954028   9.8411222 ]
2 [ 16.95335652  18.83403447]
3 [ 17.03709822  19.43980093]

Check for heteroscedasticity,
variance and standard deviation for individual regressions
group 0          group 1          group 2          group 3
variance     [ 2.615224    2.73526096  4.38683135  2.91666872]
standard dev [ 1.61716542  1.65386244  2.09447639  1.70782573]
-------------- next part --------------
# -*- coding: utf-8 -*-
"""
F test for null hypothesis that coefficients in several regressions are the same

* implemented by creating groupdummies*exog and testing appropriate contrast
matrices
* similar to test for structural change in all variables at predefined break points
* allows only one group variable
* currently tests for change in all exog variables
* allows for heterogscedasticity, error variance varies across groups (not tried out)

TODO
----

* generalize anova structure,
- structural break in only some variables
- compare structural breaks in several exog versus constant only
- fast way to construct comparisons
* print anova style results
* add all pairwise comparison tests with and without Bonferroni correction
* add additional test, likelihood-ratio, lagrange-multiplier, wald ?
* test for heteroscedasticity, equality of variances
- how?
- like lagrange-multiplier in stattools heteroscedasticity tests

Created on Sat Mar 27 01:48:01 2010
Author: josef-pktd
"""
import numpy as np
from scipy import stats
from scikits.statsmodels.regression import OLS, WLS

class OneWayLS(object):
'''Class to test equality of regression coefficients across groups

This class performs tests whether the linear regression coefficients are
the same across pre-specified groups. This can be used to test for
structural breaks at given change points, or for ANOVA style analysis of
differences in the effect of explanatory variables across groups.

Notes
-----
The test is implemented by regression on the original pooled exogenous
variables and on group dummies times the exogenous regressors.

y_i = X_i beta_i + u_i  for all groups i

The test is for the null hypothesis: beta_i = beta for all i
against the alternative that at least one beta_i is different.

By default it is assumed that all u_i have the same variance. If the
keyword option het is True, then it is assumed that the variance is
group specific. This uses WLS with weights given by the standard errors
from separate regressions for each group.
Note: het=True is not sufficiently tested

The F-test assumes that the errors are normally distributed.

original question from mailing list for equality of coefficients
across regressions, and example in Stata FAQ

*testing*:

* if constant is the only regressor then the result for the F-test is
the same as scipy.stats.f_oneway
(which in turn is verified against NIST for not badly scaled problems)
* f-test for simple structural break is the same as in original script
* power and size of test look ok in examples
* not checked/verified for heteroscedastic case
- for constant only: ftest result is the same with WLS as with OLS - check?

check: I might be mixing up group names (unique)
and group id (integers in arange(ngroups)
not tested for groups that are not arange(ngroups)
make sure groupnames are always consistently sorted/ordered
'''
def __init__(self, y, x, groups=None, het=False, data=None, meta=None):
if groups is None:
raise ValueError('use OLS if there are no groups')
#maybe replace by dispatch to OLS
if data:
y = data[y]
x = [data[v] for v in x]
try:
groups = data[groups]
except [KeyError, ValueError]:
pass
self.endog = np.asarray(y)
self.exog = np.asarray(x)
if self.exog.ndim == 1:
self.exog = self.exog[:,None]
self.groups = np.asarray(groups)
self.het = het

self.groupsint = None
if np.issubdtype(self.groups.dtype, int):
self.unique = np.unique(self.groups)
if (self.unique == np.arange(len(self.unique))).all():
self.groupsint = self.groups

if self.groupsint is None: # groups are not consecutive integers
self.unique, self.groupsint = np.unique(self.groupsint, return_inverse=True)

def fitbygroups(self):
olsbygroup = {}
sigmabygroup = []
for gi, group in enumerate(self.unique):
groupmask = self.groupsint == group
olsbygroup[group] = res
sigmabygroup.append(res.mse_resid)
self.olsbygroup = olsbygroup
self.sigmabygroup = np.array(sigmabygroup)
self.weights = np.sqrt(self.sigmabygroup[self.groupsint]) #TODO:chk sqrt

def fitjoint(self):
if not hasattr(self, 'weights'):
self.fitbygroups()
groupdummy = (self.groupsint[:,None] == self.unique).astype(int)
#order of dummy variables by grous - used
dummyexog = self.exog[:,None,:]*groupdummy[:,1:,None]
exog = np.c_[self.exog, dummyexog.reshape(self.exog.shape[0],-1)] #self.nobs ??
#Notes: I changed to drop first group from dummy
#instead I want one full set dummies
if self.het:
weights = self.weights
res = WLS(self.endog, exog, weights=weights).fit()
else:
res = OLS(self.endog, exog).fit()
self.lsjoint = res
contrasts = {}
nvars = self.exog.shape[1]
nparams = exog.shape[1]
ndummies = nparams - nvars
contrasts['all'] = np.c_[np.zeros((ndummies, nvars)), np.eye(ndummies)]
for groupind,group in enumerate(self.unique[1:]):  #need enumerate if groups != groupsint
groupind = groupind + 1
contr = np.zeros((nvars, nparams))
contr[:,nvars*groupind:nvars*(groupind+1)] = np.eye(nvars)
contrasts[group] = contr
#save also for pairs, see next
contrasts[(self.unique[0], group)] = contr

#Note: I'm keeping some duplication for testing
pairs = np.triu_indices(len(self.unique),1)
for ind1,ind2 in zip(*pairs):  #replace with group1, group2 in sorted(keys)
if ind1 == 0:
continue # need comparison with benchmark/normalization group separate
g1 = self.unique[ind1]
g2 = self.unique[ind2]
group = (g1, g2)
contr = np.zeros((nvars, nparams))
contr[:,nvars*ind1:nvars*(ind1+1)] = np.eye(nvars)
contr[:,nvars*ind2:nvars*(ind2+1)] = -np.eye(nvars)
contrasts[group] = contr
self.contrasts = contrasts

def fitpooled(self):
if self.het:
if not hasattr(self, 'weights'):
self.fitbygroups()
weights = self.weights
res = WLS(self.endog, self.exog, weights=weights).fit()
else:
res = OLS(self.endog, self.exog).fit()
self.lspooled = res

def ftest_summary(self):
if not hasattr(self, 'lsjoint'):
self.fitjoint()
txt = []
summarytable = []

txt.append('F-test for equality of coefficients across groups')
fres = self.lsjoint.f_test(self.contrasts['all'])
txt.append(fres.__str__())
summarytable.append(('all',(fres.fvalue, fres.pvalue, fres.df_denom, fres.df_num)))

#        for group in self.unique[1:]:  #replace with group1, group2 in sorted(keys)
#            txt.append('F-test for equality of coefficients between group'
#                       ' %s and group %s' % (group, '0'))
#            fres = self.lsjoint.f_test(self.contrasts[group])
#            txt.append(fres.__str__())
#            summarytable.append((group,(fres.fvalue, fres.pvalue, fres.df_denom, fres.df_num)))
pairs = np.triu_indices(len(self.unique),1)
for ind1,ind2 in zip(*pairs):  #replace with group1, group2 in sorted(keys)
g1 = self.unique[ind1]
g2 = self.unique[ind2]
txt.append('F-test for equality of coefficients between group'
' %s and group %s' % (g1, g2))
group = (g1, g2)
fres = self.lsjoint.f_test(self.contrasts[group])
txt.append(fres.__str__())
summarytable.append((group,(fres.fvalue, fres.pvalue, fres.df_denom, fres.df_num)))

return '\n'.join(txt), summarytable

def lr_test(self):
'''generic likelihood ration test between nested models

\begin{align} D & = -2(\ln(\text{likelihood for null model}) - \ln(\text{likelihood for alternative model})) \\ & = -2\ln\left( \frac{\text{likelihood for null model}}{\text{likelihood for alternative model}} \right). \end{align}

is distributed as chisquare with df equal to difference in number of parameters or equivalently
difference in residual degrees of freedom  (sign?)

TODO: put into separate function
'''
if not hasattr(self, 'lsjoint'):
self.fitjoint()
if not hasattr(self, 'lspooled'):
self.fitpooled()
loglikejoint = self.lsjoint.llf
loglikepooled = self.lspooled.llf
lrstat = -2*(loglikepooled - loglikejoint)   #??? check sign
lrdf = self.lspooled.df_resid - self.lsjoint.df_resid
lrpval = stats.chi2.sf(lrstat, lrdf)

return lrstat, lrpval, lrdf

if __name__ == '__main__':

"""Example: Test for equality of coefficients across groups/regressions

Created on Sat Mar 27 22:36:51 2010
Author: josef-pktd
"""

import numpy as np
from scipy import stats
#from numpy.testing import assert_almost_equal
import scikits.statsmodels as sm
#from scikits.statsmodels.sandbox.regression.try_onewaygls import OneWayLS

#choose example
#--------------
example = ['null', 'diff'][1]
example_size = [10, 100][1]
example_groups = ['2', '2-2'][1]
#'2-2': 4 groups,
#       groups 0 and 1 and groups 2 and 3 have identical parameters in DGP

#generate example
#----------------
np.random.seed(87654589)
nobs = example_size

#group/regression 1
x1 = np.random.randn(nobs)
y1 = 10 + 15*x1 + 2*np.random.randn(nobs)
x1 = sm.add_constant(x1) #, prepend=True)

#group/regression 2
x2 = np.random.randn(nobs)
if example == 'null':
y2 = 10 + 15*x2 + 2*np.random.randn(nobs)  # if H0 is true
else:
y2 = 19 + 17*x2 + 2*np.random.randn(nobs)

# stack
x = np.concatenate((x1,x2),0)
y = np.concatenate((y1,y2))
if example_groups == '2':
groupind = (np.arange(2*nobs)>nobs-1).astype(int)
else:
groupind = np.mod(np.arange(2*nobs),4)
groupind.sort()

def print_results(res):
groupind = res.groups
#res.fitjoint()  #not really necessary, because called by ftest_summary
ft = res.ftest_summary()
#print ft[0]  #skip because table is nicer
print '\nTable of F-tests for overall or pairwise equality of coefficients'
print 'hypothesis F-statistic         p-value  df_denom df_num  reject'
for row in ft[1]:
print row,
if row[1][1]<0.05:
print '*'
else:
print ''
print 'Notes: p-values are not corrected for many tests'
print '       (no Bonferroni correction)'
print '       * : reject at 5% uncorrected confidence level'
print 'Null hypothesis: all or pairwise coefficient are the same'
print 'Alternative hypothesis: all coefficients are different'

print '\nComparison with stats.f_oneway'
print stats.f_oneway(*[y[groupind==gr] for gr in res.unique])
print '\nLikelihood Ratio Test'
print 'likelihood ratio    p-value       df'
print res.lr_test()
print 'Null model: pooled all coefficients are the same across groups,'
print 'Alternative model: all coefficients are allowed to be different'
print 'not verified but looks close to f-test result'

print '\nOls parameters by group from individual, separate ols regressions'
for group in sorted(res.olsbygroup):
r = res.olsbygroup[group]
print group, r.params

print '\nCheck for heteroscedasticity, '
print 'variance and standard deviation for individual regressions'
print ' '*12, ' '.join('group %-10s' %(gr) for gr in res.unique)
print 'variance    ', res.sigmabygroup
print 'standard dev', np.sqrt(res.sigmabygroup)

#get results for example
#-----------------------

print '\nTest for equality of coefficients for all exogenous variables'
print   '-------------------------------------------------------------'
res = OneWayLS(y,x, groups=groupind.astype(int))
print_results(res)

print '\nOne way ANOVA, constant is the only regressor'
print   '---------------------------------------------'

print 'this is the same as scipy.stats.f_oneway'
res = OneWayLS(y,np.ones(len(y)), groups=groupind)
print_results(res)

print '\nOne way ANOVA, constant is the only regressor with het is true'
print   '--------------------------------------------------------------'

print 'this is similar to scipy.stats.f_oneway,'
print 'but variance is not assumed to be the same across groups'
res = OneWayLS(y,np.ones(len(y)), groups=groupind, het=True)
print_results(res)