[SciPy-dev] code for incremental least squares
Nathaniel Smith
njs@pobox....
Tue Feb 16 23:58:57 CST 2010
On Tue, Feb 16, 2010 at 2:56 PM, <josef.pktd@gmail.com> wrote:
> On Tue, Feb 16, 2010 at 4:56 PM, Nathaniel Smith <njs@pobox.com> wrote:
> pandas has expanding ols, (besides moving ols) which is similar in the
> idea that, for each new observation (or groups of observations seems
> to be in your case), the ols estimate is calculated in a recursive
> way.
Yes, and similar code can be used -- certainly you can calculate both
expanding and moving ols using what I've been calling "the cholesky
approach". (CHOLMOD actually includes code for "update" and "downdate"
operations, i.e., adding/removing rows to your model matrix directly
on the decomposition, without saving X'X separately and
re-decomposing. Of course, if your model matrix isn't sparse then
CHOLMOD is irrelevant.)
> However your code looks more efficient because of the incremental QR
> updating, and that you update more summary/sufficient statistics, but
> I just had a brief look and I don't remember the details of pandas.
I actually don't use the QR approach much, because it is much slower
than the Cholesky approach. In principle the QR version might be more
numerically stable (so far I've been lucky and haven't had problems),
and it is useful in some other cases (e.g. the thoroughly magic spline
fitting routines in R package "mgcv" have a mode that lets you fit
very large data sets by doing the QR decomposition in advance with
something like incremental_qr, and then pass in the decomposition to
the magic fitting functions), so it's worth keeping around. But I
don't know if it's possible to do a QR "downdate" for the "moving ols"
case to toss out observations at the trailing edge.
> My application is also the more in time series when validation of the
> estimator requires continuous updating of the estimate. In pandas case
> and in my case it is more computational efficiency that makes
> incremental estimation attractive than memory requirements.
> For this part I was looking more into using Kalman Filter, but,
> although I have seen papers that use matrix decomposition to do more
> efficient updating, my knowledge of recursive QR, or cholesky or ?? is
> not great.
If your goal is robust, fast, incremental OLS for general consumption
on potentially misbehaved data, then I'd definitely recommend looking
into true incremental QR algorithms, like that "AS274" I mentioned. It
looks like the biglm folks figured out how to do incremental sandwich
estimation too: see slide 32-ish:
http://faculty.washington.edu/tlumley/tutorials/user-biglm.pdf
> Actually, having more observations than explanatory variables is not
> necessarily the only standard case anymore. In machine learning and in
> econometrics in a "Data-Rich Environment", the number of observations
> and variables might be of the same order.
Yes, but then you need some regularization to make the problem
well-defined -- simple OLS will blow up. (Of course, penalized least
squares may work fine, and is just as tractable as OLS; I think the
same techniques work, though I haven't checked the details.)
> But none of the applications
> that I know of in econometrics would struggle with memory problems.
Yes, my data sets are from EEG (brain waves), so I get 250 samples per
second (or more), and I'm trying to estimate whole impulse response
curves (so I have predictors like "event with property x1 occurred 100
ms ago", "event with property x1 occurred 104 ms ago", ...), which
leads to thousands of predictors and millions of observations = tens
of gigabyte sized model matrices. It's a somewhat unusual situation to
be in, but I guess data sets grow fast enough that it may become more
common...
-- Nathaniel
More information about the SciPy-Dev
mailing list