[SciPy-User] tests for equality of coefficients across regression or groups

josef.pktd@gmai... josef.pktd@gmai...
Mon Mar 29 10:21:45 CDT 2010


I wrote a class that is more convenient to test whether all
coefficient are the same across different regressions or groups. This
can also be used to test for structural breaks at predefined change
points, and it is a bit similar to ANCOVA (as far as I understand)
however currently all coefficients are tested for changes across
groups. (While ANCOVA assumes that some slope coefficients are the
same. But I'm not sure about all the ANCOVA, ANOVA and MANCOVA
terminology).

If the only regressor is a constant, then the result is identical to
scipy.stats.f_oneway.

The original will be in the statsmodels sandbox, until some missing
enhancements are finished. The attached file contains a standalone
version (but requires scikits.statsmodels). Several examples are
included.

Below is par of the example output.
4 groups, groups 0 and 1 and groups 2 and 3 have identical
coefficients, but large difference between 0,1 and 2,3
sample size is 100 where the results are pretty clear cut.

I haven't thought much about the API for this kind of convenience
class. I will finish it up one of the next weekends.
Any comments are welcome, in case someone finds this useful.

Josef

Test for equality of coefficients for all exogenous variables
-------------------------------------------------------------

Table of F-tests for overall or pairwise equality of coefficients
hypothesis F-statistic         p-value  df_denom df_num  reject
('all', (212.94813033530886, 5.0118071432076266e-082, 192.0, 6)) *
((0, 1), (1.8208707506170163, 0.16466906994879624, 192.0, 2))
((0, 2), (281.4877465209409, 8.2293196773132505e-058, 192.0, 2)) *
((0, 3), (322.21218144617899, 4.4051148030359891e-062, 192.0, 2)) *
((1, 2), (313.90670953958153, 3.0218699484829141e-061, 192.0, 2)) *
((1, 3), (356.39739587117498, 2.3340444544965008e-065, 192.0, 2)) *
((2, 3), (1.4442741180990379, 0.23846859348160215, 192.0, 2))
Notes: p-values are not corrected for many tests
       (no Bonferroni correction)
       * : reject at 5% uncorrected confidence level
Null hypothesis: all or pairwise coefficient are the same
Alternative hypothesis: all coefficients are different

Comparison with stats.f_oneway
(5.4853751213497954, 0.0012219835303810941)

Likelihood Ratio Test
likelihood ratio    p-value       df
(407.06211448271051, 8.4739101836496682e-085, 6.0)
Null model: pooled all coefficients are the same across groups,
Alternative model: all coefficients are allowed to be different
not verified but looks close to f-test result

Ols parameters by group from individual, separate ols regressions
0 [ 14.89986056  10.49887409]
1 [ 14.56954028   9.8411222 ]
2 [ 16.95335652  18.83403447]
3 [ 17.03709822  19.43980093]

Check for heteroscedasticity,
variance and standard deviation for individual regressions
             group 0          group 1          group 2          group 3
variance     [ 2.615224    2.73526096  4.38683135  2.91666872]
standard dev [ 1.61716542  1.65386244  2.09447639  1.70782573]
-------------- next part --------------
# -*- coding: utf-8 -*-
"""
F test for null hypothesis that coefficients in several regressions are the same

* implemented by creating groupdummies*exog and testing appropriate contrast
  matrices
* similar to test for structural change in all variables at predefined break points
* allows only one group variable
* currently tests for change in all exog variables
* allows for heterogscedasticity, error variance varies across groups (not tried out)


TODO
----

* generalize anova structure, 
  - structural break in only some variables
  - compare structural breaks in several exog versus constant only
  - fast way to construct comparisons
* print anova style results
* add all pairwise comparison tests with and without Bonferroni correction
* add additional test, likelihood-ratio, lagrange-multiplier, wald ?
* test for heteroscedasticity, equality of variances 
  - how? 
  - like lagrange-multiplier in stattools heteroscedasticity tests

Created on Sat Mar 27 01:48:01 2010
Author: josef-pktd
"""
import numpy as np
from scipy import stats
from scikits.statsmodels.regression import OLS, WLS


class OneWayLS(object):
    '''Class to test equality of regression coefficients across groups
    
    This class performs tests whether the linear regression coefficients are
    the same across pre-specified groups. This can be used to test for 
    structural breaks at given change points, or for ANOVA style analysis of
    differences in the effect of explanatory variables across groups.
    
    Notes
    -----
    The test is implemented by regression on the original pooled exogenous
    variables and on group dummies times the exogenous regressors.
    
    y_i = X_i beta_i + u_i  for all groups i
    
    The test is for the null hypothesis: beta_i = beta for all i
    against the alternative that at least one beta_i is different.
    
    By default it is assumed that all u_i have the same variance. If the
    keyword option het is True, then it is assumed that the variance is
    group specific. This uses WLS with weights given by the standard errors
    from separate regressions for each group.
    Note: het=True is not sufficiently tested
    
    The F-test assumes that the errors are normally distributed.
    
    
    
    original question from mailing list for equality of coefficients 
    across regressions, and example in Stata FAQ
    
    *testing*:
    
    * if constant is the only regressor then the result for the F-test is
      the same as scipy.stats.f_oneway 
      (which in turn is verified against NIST for not badly scaled problems) 
    * f-test for simple structural break is the same as in original script
    * power and size of test look ok in examples
    * not checked/verified for heteroscedastic case
      - for constant only: ftest result is the same with WLS as with OLS - check?
    
    check: I might be mixing up group names (unique) 
           and group id (integers in arange(ngroups) 
           not tested for groups that are not arange(ngroups)
           make sure groupnames are always consistently sorted/ordered
    '''
    def __init__(self, y, x, groups=None, het=False, data=None, meta=None):
        if groups is None:
            raise ValueError('use OLS if there are no groups')
            #maybe replace by dispatch to OLS
        if data:
            y = data[y]
            x = [data[v] for v in x]
            try:
                groups = data[groups]
            except [KeyError, ValueError]:
                pass
        self.endog = np.asarray(y)
        self.exog = np.asarray(x)
        if self.exog.ndim == 1:
            self.exog = self.exog[:,None]
        self.groups = np.asarray(groups)
        self.het = het        
        
        self.groupsint = None
        if np.issubdtype(self.groups.dtype, int):
            self.unique = np.unique(self.groups)
            if (self.unique == np.arange(len(self.unique))).all():
                self.groupsint = self.groups
        
        if self.groupsint is None: # groups are not consecutive integers
            self.unique, self.groupsint = np.unique(self.groupsint, return_inverse=True)
    
    def fitbygroups(self):
        olsbygroup = {}
        sigmabygroup = []
        for gi, group in enumerate(self.unique):
            groupmask = self.groupsint == group
            res = OLS(self.endog[groupmask], self.exog[groupmask]).fit()
            olsbygroup[group] = res
            sigmabygroup.append(res.mse_resid)
        self.olsbygroup = olsbygroup
        self.sigmabygroup = np.array(sigmabygroup)
        self.weights = np.sqrt(self.sigmabygroup[self.groupsint]) #TODO:chk sqrt
        
    def fitjoint(self):
        if not hasattr(self, 'weights'):
            self.fitbygroups()
        groupdummy = (self.groupsint[:,None] == self.unique).astype(int)
        #order of dummy variables by grous - used
        dummyexog = self.exog[:,None,:]*groupdummy[:,1:,None]
        exog = np.c_[self.exog, dummyexog.reshape(self.exog.shape[0],-1)] #self.nobs ??
        #Notes: I changed to drop first group from dummy 
        #instead I want one full set dummies
        if self.het:
            weights = self.weights
            res = WLS(self.endog, exog, weights=weights).fit()
        else:
            res = OLS(self.endog, exog).fit()
        self.lsjoint = res
        contrasts = {}
        nvars = self.exog.shape[1]
        nparams = exog.shape[1]
        ndummies = nparams - nvars
        contrasts['all'] = np.c_[np.zeros((ndummies, nvars)), np.eye(ndummies)]
        for groupind,group in enumerate(self.unique[1:]):  #need enumerate if groups != groupsint
            groupind = groupind + 1 
            contr = np.zeros((nvars, nparams))
            contr[:,nvars*groupind:nvars*(groupind+1)] = np.eye(nvars)
            contrasts[group] = contr
            #save also for pairs, see next
            contrasts[(self.unique[0], group)] = contr
        
        #Note: I'm keeping some duplication for testing
        pairs = np.triu_indices(len(self.unique),1)
        for ind1,ind2 in zip(*pairs):  #replace with group1, group2 in sorted(keys)
            if ind1 == 0:
                continue # need comparison with benchmark/normalization group separate
            g1 = self.unique[ind1]
            g2 = self.unique[ind2]
            group = (g1, g2)
            contr = np.zeros((nvars, nparams))
            contr[:,nvars*ind1:nvars*(ind1+1)] = np.eye(nvars)
            contr[:,nvars*ind2:nvars*(ind2+1)] = -np.eye(nvars)
            contrasts[group] = contr
        self.contrasts = contrasts
    
    def fitpooled(self):
        if self.het:
            if not hasattr(self, 'weights'):
                self.fitbygroups()
            weights = self.weights
            res = WLS(self.endog, self.exog, weights=weights).fit()
        else:
            res = OLS(self.endog, self.exog).fit()
        self.lspooled = res
        
    def ftest_summary(self):
        if not hasattr(self, 'lsjoint'):
            self.fitjoint()
        txt = []
        summarytable = []
        
        txt.append('F-test for equality of coefficients across groups')
        fres = self.lsjoint.f_test(self.contrasts['all'])
        txt.append(fres.__str__())
        summarytable.append(('all',(fres.fvalue, fres.pvalue, fres.df_denom, fres.df_num)))
        
#        for group in self.unique[1:]:  #replace with group1, group2 in sorted(keys)
#            txt.append('F-test for equality of coefficients between group'
#                       ' %s and group %s' % (group, '0'))
#            fres = self.lsjoint.f_test(self.contrasts[group])
#            txt.append(fres.__str__())
#            summarytable.append((group,(fres.fvalue, fres.pvalue, fres.df_denom, fres.df_num)))
        pairs = np.triu_indices(len(self.unique),1)
        for ind1,ind2 in zip(*pairs):  #replace with group1, group2 in sorted(keys)
            g1 = self.unique[ind1]
            g2 = self.unique[ind2]
            txt.append('F-test for equality of coefficients between group'
                       ' %s and group %s' % (g1, g2))
            group = (g1, g2)
            fres = self.lsjoint.f_test(self.contrasts[group])
            txt.append(fres.__str__())
            summarytable.append((group,(fres.fvalue, fres.pvalue, fres.df_denom, fres.df_num)))
        
        return '\n'.join(txt), summarytable
                            
        
    def lr_test(self):
        '''generic likelihood ration test between nested models
        
            \begin{align} D & = -2(\ln(\text{likelihood for null model}) - \ln(\text{likelihood for alternative model})) \\ & = -2\ln\left( \frac{\text{likelihood for null model}}{\text{likelihood for alternative model}} \right). \end{align}
            
            is distributed as chisquare with df equal to difference in number of parameters or equivalently
            difference in residual degrees of freedom  (sign?) 
        
        TODO: put into separate function
        '''
        if not hasattr(self, 'lsjoint'):
            self.fitjoint()
        if not hasattr(self, 'lspooled'):
            self.fitpooled()
        loglikejoint = self.lsjoint.llf
        loglikepooled = self.lspooled.llf
        lrstat = -2*(loglikepooled - loglikejoint)   #??? check sign
        lrdf = self.lspooled.df_resid - self.lsjoint.df_resid
        lrpval = stats.chi2.sf(lrstat, lrdf)
        
        return lrstat, lrpval, lrdf
        
        
if __name__ == '__main__':

    """Example: Test for equality of coefficients across groups/regressions
    
    Created on Sat Mar 27 22:36:51 2010
    Author: josef-pktd
    """
    
    import numpy as np
    from scipy import stats
    #from numpy.testing import assert_almost_equal
    import scikits.statsmodels as sm
    #from scikits.statsmodels.sandbox.regression.try_onewaygls import OneWayLS
    
    #choose example
    #--------------
    example = ['null', 'diff'][1]
    example_size = [10, 100][1]
    example_groups = ['2', '2-2'][1]
    #'2-2': 4 groups, 
    #       groups 0 and 1 and groups 2 and 3 have identical parameters in DGP
    
    #generate example
    #----------------
    np.random.seed(87654589)
    nobs = example_size
    
    #group/regression 1
    x1 = np.random.randn(nobs)
    y1 = 10 + 15*x1 + 2*np.random.randn(nobs)
    x1 = sm.add_constant(x1) #, prepend=True)
    
    #group/regression 2
    x2 = np.random.randn(nobs)
    if example == 'null':
        y2 = 10 + 15*x2 + 2*np.random.randn(nobs)  # if H0 is true
    else:   
        y2 = 19 + 17*x2 + 2*np.random.randn(nobs)
    
    x2 = sm.add_constant(x2)
    
    # stack
    x = np.concatenate((x1,x2),0)
    y = np.concatenate((y1,y2))
    if example_groups == '2':
        groupind = (np.arange(2*nobs)>nobs-1).astype(int)
    else:
        groupind = np.mod(np.arange(2*nobs),4) 
        groupind.sort()
    
    
    def print_results(res):
        groupind = res.groups
        #res.fitjoint()  #not really necessary, because called by ftest_summary
        ft = res.ftest_summary()
        #print ft[0]  #skip because table is nicer
        print '\nTable of F-tests for overall or pairwise equality of coefficients'
        print 'hypothesis F-statistic         p-value  df_denom df_num  reject'
        for row in ft[1]: 
            print row,
            if row[1][1]<0.05:
                print '*'
            else:
                print ''
        print 'Notes: p-values are not corrected for many tests'
        print '       (no Bonferroni correction)'
        print '       * : reject at 5% uncorrected confidence level'
        print 'Null hypothesis: all or pairwise coefficient are the same'
        print 'Alternative hypothesis: all coefficients are different'
        
        print '\nComparison with stats.f_oneway'
        print stats.f_oneway(*[y[groupind==gr] for gr in res.unique])
        print '\nLikelihood Ratio Test' 
        print 'likelihood ratio    p-value       df'
        print res.lr_test()
        print 'Null model: pooled all coefficients are the same across groups,' 
        print 'Alternative model: all coefficients are allowed to be different'
        print 'not verified but looks close to f-test result'
        
        print '\nOls parameters by group from individual, separate ols regressions'
        for group in sorted(res.olsbygroup):
            r = res.olsbygroup[group]
            print group, r.params 
        
        print '\nCheck for heteroscedasticity, '
        print 'variance and standard deviation for individual regressions'
        print ' '*12, ' '.join('group %-10s' %(gr) for gr in res.unique)
        print 'variance    ', res.sigmabygroup
        print 'standard dev', np.sqrt(res.sigmabygroup)
    
    
    #get results for example
    #-----------------------
    
    print '\nTest for equality of coefficients for all exogenous variables'
    print   '-------------------------------------------------------------'
    res = OneWayLS(y,x, groups=groupind.astype(int))
    print_results(res)
    
    print '\nOne way ANOVA, constant is the only regressor'
    print   '---------------------------------------------'
    
    print 'this is the same as scipy.stats.f_oneway'
    res = OneWayLS(y,np.ones(len(y)), groups=groupind)
    print_results(res)
    
    
    print '\nOne way ANOVA, constant is the only regressor with het is true'
    print   '--------------------------------------------------------------'
    
    print 'this is similar to scipy.stats.f_oneway,' 
    print 'but variance is not assumed to be the same across groups'
    res = OneWayLS(y,np.ones(len(y)), groups=groupind, het=True)
    print_results(res)




More information about the SciPy-User mailing list