[SciPy-dev] stats.models report/preannouncement
Wed Aug 19 17:41:46 CDT 2009
The GSOC project for stats.models is close to it's end.
We are currently finishing up some final changes and
improvements to the docstrings. We intend to "release" the
package next week. As a reminder, the GSOC project
"stats.models" had as a target to correct and update
Jonathan Taylor's statistical models (currently found in
NiPy) for eventual re-inclusion in SciPy.
models is a pure python package.
models only includes:
* regression: mainly OLS and generalized least squares, GLS
including weighted least squares and least squares with AR
* glm: generalized linear models
* rlm: robust linear models
* datasets: for examples and tests
The other code which we didn't have enough time to verify and fix
was moved to a sandbox folder. The formula framework is not used any
more (in the verified code). Only the verified part would go into
Compared to the original code, the class structure and some of the
method arguments have changed. Additional estimation
results, e.g. test statistics have been included.
Most importantly, almost every result has been verified with at least
one other statistical package, R, Stata and SAS. The guiding principal
for the rewrite was that all numbers have to be verified, even if we
don't manage to cover everything. There are a few remaining issues,
that we hope to clear up by next week. Not all parts of the code have
been tested for unexpected inputs. We are currently adding checks for,
and conversions of array types and dimension. Additionally, many of
the tests call rpy to compare the results directly with R. We use an
extended wrapper for R models in the test suite. This provides greater
flexibility writing new test cases, but will eventually be replaced by
hard coded expected results.
The code is written for plain NumPy arrays.
We have also included several datasets from the public domain and by
permission for the tests and examples. The datasets follow
fairly closely David C's datasets proposal in
scikits.learn, with some small modifications. The datasets
are set up so that it is easy to add more datasets.
The current question is, what will be the near future for
"models"? We would like to distribute it as a standalone
package to gain experience with the API, and to allow us to
make changes without being committed to backwards
compatibility. It will also give us the opportunity to find
and kill some remaining bugs, and fill some holes in our
We can either package it as a scikit or as a independent
package distributed through pypi. Also depending on the feedback,
it could go into scipy 0.8 to close the gap in the stats area,
but with a warning that there might still be some changes to
the API, or we could wait for 0.9, if 0.8 is coming out
Earlier this summer, there was a discussion on the nipy
mailing list on the structure of the API and about possible
additional methods. The release as a standalone package or
scikit should give us the opportunity to discuss any changes
in the design or API and make adjustments to the code
without having to go through the scipy installation and
release cycle constraints.
We are also discussing future inclusion of additional
models. Separate, and only partially dependent on the
models code, we would like to create a package that can be
used as a staging ground for new models. We will focus on
models that are closer to our area, mainly econometrics and
time series analysis, and less "pure" statistics. There has
also been some interest by others to add additional models
or new cases to existing models. This could also be the
location for the part of "models", that we moved in the
sandbox, until they can be fixed and verified. Whether
additional models are included in scipy or remain separate
can be discussed as they mature.
So, how should we package "models" for next week? And how
soon should we plan to include models in scipy.
Skipper, Josef, and Alan
More information about the Scipy-dev