[SciPy-User] statsmodels gsoc application for comments

Skipper Seabold jsseabold@gmail....
Wed Apr 7 23:52:10 CDT 2010


All,

Was hoping anyone who is interested could give some last minute
comments on my google summer of code application (deadline is Friday).
 Given the focus on Python 3 transition, it's a shot in the dark, but
I'd like to continue working on the project.

Note that the schedule is in order of importance for the developers
and that most if not all of the goals are part of my work and have
been begun in some form or another but need more focused attention.

Cheers,

Skipper

Title
------
Development of the SciPy SciKit Statsmodels

Summary
--------------
Statsmodels is a pure Python package for statistical and econometric modeling.
It began as part of the SciPy project, was spun off into NiPy [0], and then
picked up last year as my GSoC 2009 project.  I intend to make statsmodels ready
for the jump to Python 3, focus on the lingering design issues to ensure the
package is mature and generally usable for all, and to extend the types of
estimators available.


Background
------------------
Last summer I successfully complete my first Google Summer of Code with the
statsmodels package for the SciPy project under the Python Software
Foundation.
Since the end of GSoC 2009, my mentor and I have been able to release our
software as a standalone package while continuing to work on the code [1,2].
The project and its aim have garnered some attention, and a group of users,
particularly in Finance, Economics, and the Social Sciences more generally,
are coming together to work on related projects in Python.  In addition to
discussions on the scipy mailing list, we are seeing some traffic on our
pystatsmodels mailing list [3].  Pandas is a package focused on quantitative
finance that leverages statsmodels, and Wes McKinney has presented on pandas at
PyCon 2010 [4] and will lead a special topic presentation at SciPy conference
2010 [5].  Projects such as larry and tabular are working on data structures
and their analysis geared towards statistics [6,7].  Further, the scipy.stats
sprint has currently garnered the most votes for the SciPy conference [8].
Given that the project is approaching a critical mass, I would like the
opportunity to go forward towards its main aim as a statistical/econometric
library for Python users while also getting the project ready for the Python 3
transition.  Only continued efforts will see to it that Python 3 becomes a
serious option for statistical data analysis for a larger group of users.


Project Schedule
----------------
pre-GSoC: Organize code in the sandbox so that "low hanging fruit" is very
          easy to pick.  Search for appropriately licensed code that might
          be helpful to translate for the project.

Week 1 - 2 (May 24 - June 6)
           Work on design issues, particularly, the model wrapper idea [9,10]
           so that statsmodels remains a useful library for other projects that
           implement different data structures rather than just numpy arrays.
           Streamline the maximum likelihood framework including the Model
           needs (ie., analytic gradients, hessians; automatic differentiation,
           numerical differentations ideas).  Generic maximum likelihood
           framework.  Design of generic bootstrap methods and generic
           post-estimation testing framework.  Internal variable name handling
           and summary functions for output using SimpleTable in the sandbox.

Week 2 - 4 (June 7 - June 20)
           Work on time series models.  Bring together and test all helper
           functions, including solvers and algorithsm.  Finish VARMA and
           GARCH models, Hodrick-Prescott filter and Kalman filter.

Week 4 - 6 (June 21 - July 4)
           Work on systems of equations models and mainly design.
           Work on panel data models.  Finish including random-effects.
           Dynamic Panel estimators.  Error specification tests.  Implicit in
           this is a general instrumental variables framework.

Week 6 - 8 (July 5 - July 18)
           Nonparametric estimators. Nonparametric regression and univariate
           kernel density estimators with bandwidth cross-validation.
           Multivariate density estimators.

Week 8 - 10 (July 19 - August 1)
           Information theoretic measures.  Generalized maximum entropy.
           Refactor of scipy.maxentropy to be more general.

Week 11 - (August 2 - August 8)
          Catch up week for any work not finished according to deadlines above.

Week 12 - (August 9 - August 16)
          Polish and tie up any remaining loose ends in the code, making sure
          test coverage is good, call signatures and inheritance structures are
          consistent, and all TODOs and NOTES are covered.  Make sure there are
          no remaining sphinx/documentation build issues.


About Me
---------------
I am finishing my second year as a PhD student in Economics at American
University in Washington, DC.  Since last summer, I have continued to study
statistics and econometrics including special topics in microeconometrics,
time series and macroeconometrics, and information theory and entropy
econometrics.  I try to do as much of my work in Python as possible, when this
is not an option I work in R, Octave/Matlab, and other commercial statistical
software (mainly Stata, eViews, SAS, NLogit/Limdep). Working with these other
packages definitely helps inform design decisions on the statsmodels package
and reinforces my belief that Python should be the scientific programming and
scripting language of choice for practicing researchers!


[1] http://statsmodels.sourceforge.net/
[2] https://code.launchpad.net/statsmodels
[3] http://groups.google.ca/group/pystatsmodels
[4] http://us.pycon.org/2010/conference/schedule/event/50/
[5] http://conference.scipy.org/scipy2010/papers.html
[6] http://larry.sourceforge.net/
[7] http://www.parsemydata.com/tabular/
[8] http://conference.scipy.org/scipy2010/sprints.html
[9] http://groups.google.ca/group/pystatsmodels/browse_thread/thread/72267d8e784a318b/
[10] http://groups.google.ca/group/pystatsmodels/browse_thread/thread/a47a84be6a41c45e/


More information about the SciPy-User mailing list