[Nipy-devel] rm 2-way ANOVA

josef.pktd@gmai... josef.pktd@gmai...
Sun Nov 15 20:37:58 CST 2009

On Sun, Nov 15, 2009 at 9:04 PM, Skipper Seabold <jsseabold@gmail.com> wrote:
> On Sun, Nov 15, 2009 at 8:37 PM, Skipper Seabold <jsseabold@gmail.com> wrote:
>> On Sun, Nov 15, 2009 at 8:18 PM,  <josef.pktd@gmail.com> wrote:
>>> On Sun, Nov 15, 2009 at 6:43 PM, Christian Brodbeck
>>> <christianmbrodbeck@gmail.com> wrote:
>>>> Thanks for all the responses! I managed to implement
>>>> with scikits.statsmodels.OLS and dummy-coding the equivalent of R's
>>>> aov(V~f+Error(subject)). Does anyone happen to know how I would implement
>>>> subject as random factor?
>>> I assume random factors would be similar as random effects in a panel
>>> data model. I looked at it but we don't have an example yet. We have
>>> generalized least squares that takes an arbitrary error
>>> variance-covariance matrix. But until now I haven't tried to
>>> build/estimate the error covariance matrix in a more general setting.
>>> The only example we have is GLSAR. We are still missing quite a bit of
>>> supporting functionality to make this kind of models easy to use.
>>> (Actually I haven't seen random factor ANOVA before, so I have to read
>>> up a bit.)
>>>> I'd be interested to see how you went about doing this, if you think
>>>> it is general enough to contribute to the project.  Did you primarily
>>>> use Josef's code?
>>>> I wrote my own code, it depends very much on what your source data looks
>>>> like and how your factors are coded (as Josef said). It is fairly simple in
>>>> principle though and could be adapted. If in your data array subjects are
>>>> alternating inside blocks, the fixed effect of subject can be added to the
>>>> design matrix as :
>>>>     _subjects_   _conditions...
>>>> s1  1  0  0  0    1 ... \
>>>> s2  0  1  0  0    1      \ condition 1
>>>> s3  0  0  1  0    1      /
>>>> s4 -1 -1 -1 -1    1     /
>>>> s1  1  0  0  0    0     \
>>>> s2  0  1  0  0    0      \ condition 2
>>>> ...
>>> I don't think I understand the meaning of the -1 in s4.
>>>> (after Rutherford (2000) Introducing Anova and Ancova Amazon preview; I
>>>> ordered the book so I hope will know more soon)
>>>> I did this with:
>>>> n = number_of_subjects
>>>> p = number_of_conditions
>>>> one_block = np.vstack((np.eye(n-1), np.r_[[-1]*(n-1)]))
>>>> subjects_matrix = np.vstack([one_block]*p)
>>>> design_matrix = np.hstack((subjects_matrix, rest_of_design_matrix))
>>>> Also, Scikits.statsmodels still requires me to repeat the procedure and call
>>>> the fit function for each individual test seperately, right? There is no
>>>> direct call for fitting a model for multiple tests independently (i.e. for
>>>> electrode by time plots)?
>>>> I'm not quite sure what you're after here, but if you want to fit
>>>> multiple models, then you have to instantiate each model and call fit
>>>> as the code is at the moment, because the way that I see it (I have an
>>>> econometrics background) a model is both a response and a set of
>>>> covariates (not sure I want to open this can of worms again...but the
>>>> code is probably not as general as it could be given the way I am used
>>>> to approaching problems in economics).  This should be pretty
>>>> straightforward using list comprehension, but if you want to post (or
>>>> send me off list, if you think it's too off-topic) a few more details
>>>> about your code, we can have a look at extending the functionality
>>>> (probably won't happen until Christmas break for me unless someone
>>>> else wants to have a go).
>>>> Yes you would have to fit each model, what I meant was merely whether there
>>>> might be a function to avoid the for loop in Python (129 electrodes * 150
>>>> sample points) .. not very urgent, really.
>>> If I understand this correctly than you have to loop 129 times ?
>>> The main regression classes GLS and subclasses are not designed for
>>> multiple endogenous variables. The parameter estimation would be easy
>>> to extend to multiple endog, however all the statistical results are
>>> designed for a single estimation problem.
>>> There are three possible answers
>>> During summer there was a brief discussion making the model reusable.
>>> This would not avoid the loop over endog, but would make it cheaper.
>>> Currently, I think the only duplicate calculation is the pinv of the
>>> exog in OLS and additionally the cholesky decomposition of the error
>>> covariance matrix in GLS. If these are expensive calculations, then
>>> introducing a copy method in the model class could avoid duplicate
>>> calculations, but it would still return individual result instances.
>>> Since there was no more discussion on this, we didn't pursue it any
>>> further.
>>> Skipper started in sandbox.sysreg a seemingly unrelated regression SUR
>>> model. With  identical design matrix per equation (or identity error
>>> correlation matrix), this would just reduce to individual ols
>>> regressions. But I don't think, SUR is designed to provide a lot of
>>> equation specific result statistics, but I haven't looked closely at
>>> all available results.
>>> For cases where a large number of regression with only a small set of
>>> results is required, then it is better to write a dedicated function.
>>> For example, Wes in pandas is using panel regression in a moving or
>>> rolling window in a long time series. This can be much more
>>> efficiently done in a special purpose function/class, than just
>>> calling ols several hundred or thousand times.
>>> The first could be done with small changes to the current code, but we
>>> would have to be careful which attributes/properties are copied and
>>> which are used by reference. Currently, we do minimal copying, e.g.
>>> the data, endog, exog, I think, is only copied if it is not a numpy
>>> array, and the result instances keep a reference to the model
>>> instance.
>>> SUR could potentially be an easy solution for this, but it is designed
>>> for a different purpose, and I don't know if it provides the correct
>>> results, and whether it would actually be faster, given the extra
>>> calculations for the covariance matrix.
>> My initial thoughts were that some kind of solution like seemingly
>> unrelated regressions would be useful, but basically ignoring all of
>> the covariance stuff.  Ie., your response would be 129 * N x 1 (or
>> whatever) and then your design would be a big block diagonal (using
>> sparse matrices) of your covariates.  I think this is how you do
>> multiple OLS regressions at the same time, but I haven't done the
>> linear algebra proofs myself to confirm.
> Yes, this is right.  Using a trunk of scipy for block_diag you have
> import scikits.statsmodels as sm
> import numpy as np
> from scipy import linalg
> data1 = sm.datasets.longley.Load()
> data2 = sm.datasets.longley.Load()
> data1.exog = sm.add_constant(data1.exog)
> data2.exog = sm.add_constant(data2.exog)
> exog = linalg.block_diag(data1.exog, data2.exog)
> endog = np.vstack((data1.endog[:,None],data1.endog[:,None]))
> res = sm.OLS(endog,exog).fit()
> In [17]: res.params
> Out[17]:
> array([  1.50618723e+01,  -3.58191793e-02,  -2.02022980e+00,
>        -1.03322687e+00,  -5.11041057e-02,   1.82915146e+03,
>        -3.48225863e+06,   1.50618723e+01,  -3.58191793e-02,
>        -2.02022980e+00,  -1.03322687e+00,  -5.11041057e-02,
>         1.82915146e+03,  -3.48225863e+06])
> res2 = sm.OLS(data1.endog, data1.exog).fit()
> In [19]: res2.params
> Out[19]:
> array([  1.50618723e+01,  -3.58191793e-02,  -2.02022980e+00,
>        -1.03322687e+00,  -5.11041057e-02,   1.82915146e+03,
>        -3.48225863e+06])
> So if you are using OLS to do the anova, then you can use the block
> diagonal.  You should probably use sparse matrices though if your
> design is very big.  There is an example of building a block diagonal
> matrix in our sandbox (haven't had time to clean this up, write tests,
> though the results are correct via the 'eyeball' test, or get a
> polished results class yet).  It's in the __init__ method for the SUR
> class.  http://bazaar.launchpad.net/~scipystats/statsmodels/trunk/annotate/head%3A/scikits/statsmodels/sandbox/sysreg.py

For the case when all design matrices are the same, you could save on
replicating them in the block diagonal matrix.
OLS parameter estimation is easy for this case, linalg.leastsq allows
for matrix endog, the work is in getting all the anova and regression
result statistics.

Also, building the sparse block diagonal matrix might take more time
than the simple loop over individual regression. (I guess, as long as
there is not a very large number of small regression problems.)


>> Skipper
>>> Writing a specialized class is the only fast option for specialized
>>> heavy duty estimation that doesn't fit in the current framework and
>>> where the extras results are not important.
>>> We have received only limited feedback for the actual use of
>>> statsmodels, so we don't really know where the potential problems or
>>> bottlenecks could be. For our use of estimating a few econometric
>>> specifications and getting the result statistics, statsmodels works
>>> pretty well (although it still has many missing pieces.)
>>> Josef
>>>> -Christian

More information about the Nipy-devel mailing list