[SciPy-dev] code for incremental least squares

Nathaniel Smith njs@pobox....
Tue Feb 16 14:18:49 CST 2010


Hello all,
I have a need for solving some very large least squares regression
problems -- but fortunately my model matrix can be generated
incrementally, so there are some tricks to do the important
calculations without actually loading the whole thing into memory. The
resulting code is rather general, and might be of wider interest, so I
thought I'd send a note here and see what people thought about
upstreaming it somehow...

Specifically, what it implements is:
  -- Least squares regression for very large numbers of observations
  -- Using either Cholesky or QR methods
  -- Multivariate or univariate
  -- Full statistical tests (t tests, linear hypothesis F tests,
univariate and multivariate)
  -- "Grouped" weighted least squares (i.e., if your observations fall
into a small number of different classes, and are homoskedastic within
each class, but not across classes, it can handle that), with simple
EM estimation of the weights ("feasible generalized least squares")
  -- The Cholesky solver's initial "accumulation" step has a
parallelization-friendly API, and can take advantage of sparseness in
the model matrix

Most of the code is unencumbered, but I cribbed some of the
calculations for multivariate statistics (Wilk's lambda and all that)
from R, so those functions are GPL'ed (but could presumably be
rewritten if anyone cares). Also the sparse Cholesky method should
probably take advantage of the GPL'ed scikits.sparse.cholmod, but that
isn't actually implemented yet (it just uses sparse matrices for the
accumulation step, and then converts to a dense matrix at the end).

So, any thoughts on what would be best to do with this? I'd be happy
to contribute it to SciPy or whatever. In some ways it overlaps with
scikits.statsmodels, but of course the API is quite different.

-- Nathaniel


More information about the SciPy-Dev mailing list