[Numpy-discussion] lstsq functionality
Charles R Harris
charlesr.harris@gmail....
Tue Jul 20 10:23:25 CDT 2010
On Tue, Jul 20, 2010 at 8:32 AM, Skipper Seabold <jsseabold@gmail.com>wrote:
> On Tue, Jul 20, 2010 at 1:08 AM, Charles R Harris
> <charlesr.harris@gmail.com> wrote:
> >
> >
> > On Mon, Jul 19, 2010 at 9:40 PM, Keith Goodman <kwgoodman@gmail.com>
> wrote:
> >>
> >> On Mon, Jul 19, 2010 at 8:27 PM, Charles R Harris
> >> <charlesr.harris@gmail.com> wrote:
> >> >
> >> >
> >> > On Mon, Jul 19, 2010 at 9:02 PM, Keith Goodman <kwgoodman@gmail.com>
> >> > wrote:
> >> >>
> >> >> On Mon, Jul 19, 2010 at 6:53 PM, Joshua Holbrook
> >> >> <josh.holbrook@gmail.com> wrote:
> >> >> > On Mon, Jul 19, 2010 at 5:50 PM, Charles R Harris
> >> >> > <charlesr.harris@gmail.com> wrote:
> >> >> >> Hi All,
> >> >> >>
> >> >> >> I'm thinking about adding some functionality to lstsq because I
> find
> >> >> >> myself
> >> >> >> doing the same fixes over and over. List follows.
> >> >> >>
> >> >> >> Add weights so data points can be weighted.
> >> >> >> Use column scaling so condition numbers make more sense.
> >> >> >> Compute covariance approximation?
> >> >> >>
> >> >> >> Unfortunately, the last will require using svd since there no
> linear
> >> >> >> least
> >> >> >> squares routines in LAPACK that also compute the covariance, at
> >> >> >> least
> >> >> >> that
> >> >> >> google knows about.
> >> >> >>
> >> >> >> Thoughts?
> >> >> >
> >> >> > Maybe make 2 functions--one which implements 1 and 2, and another
> >> >> > which implements 3? I think weights is an excellent idea!
> >> >>
> >> >> I'd like a lstsq that did less, like not calculate the sum of squared
> >> >> residuals. That's useful in tight loops. So I also think having two
> >> >> lstsq makes sense. One as basic as possible; one with bells. How does
> >> >> scipy's lstsq fit into all this?
> >> >
> >> > I think the computation of the residues is cheap in lstsq. The
> >> > algolrithm
> >> > used starts by reducing the design matrix to bidiagonal from and
> reduces
> >> > the
> >> > rhs at the same time. In other words an mxn problem becomes a (n+1)xn
> >> > problem. That's why the summed square of residuals is available but
> not
> >> > the
> >> > individual residuals, after the reduction there is only one residual
> and
> >> > its
> >> > square it the residue.
> >>
> >> That does sound good. But it must take some time. There's indexing,
> >> array creation, if statement, summing:
> >>
> >> if results['rank'] == n and m > n:
> >> resids = sum((transpose(bstar)[n:,:])**2,
> >> axis=0).astype(result_t)
> >>
> >> Here are the timings after removing the sum of squared residuals:
> >>
> >> >> x = np.random.rand(1000,10)
> >> >> y = np.random.rand(1000)
> >> >> timeit np.linalg.lstsq(x, y)
> >> 1000 loops, best of 3: 369 us per loop
> >> >> timeit np.linalg.lstsq2(x, y)
> >> 1000 loops, best of 3: 344 us per loop
> >> >>
> >> >> x = np.random.rand(10,2)
> >> >> y = np.random.rand(10)
> >> >> timeit np.linalg.lstsq(x, y)
> >> 10000 loops, best of 3: 102 us per loop
> >> >> timeit np.linalg.lstsq2(x, y)
> >> 10000 loops, best of 3: 77 us per loop
> >> _
> >
> > Now that I've looked through the driver program I see that you are right.
> > Also, all the info needed for the covariance is almost available in the
> > LAPACK driver program. Hmm, it seems that maybe the best thing to do here
> is
> > to pull out the lapack_lite driver program, modify it, and make it a
> > standalone python module that links to either the lapack_lite programs or
> > the ATLAS versions. But that is more work than just doing things in
> python
> > :-\ It does have the added advantage that all the work arrays can be
> handled
> > internally.
> >
>
> IIUC, I tried this with statsmodels earlier this summer, but found the
> result to be slower than our current implementation (and surely slower
> than lstsq since we have a bit more Python overhead around the LAPACK
> function calls). I will have to dig through the revisions and make
> sure about this.
>
>
You modified the C version of dgelsd in lapack_lite? That is the driver I'm
talking about.
> But for the record, we have weighted least squares (WLS) and you can
> get the covariance matrix. I think you can do column-scaling with our
> GLS (generalized least squares class). I haven't seen the terms
> row-scaling and column-scaling used much in econometrics books do you
> have a good reference offhand?
>
>
The dgelsd driver scales the matrix, but only enough to keep the numbers in
the working range. Column scaling affects the singular values, so if you are
using them to track the condition number of the matrix it can effect the
outcome. The deeper thing that is going on is that some of the usefulness of
the svd is based on the assumption that the euclidean distance between
parameter vectors means something. I just tend to normalize the columns to
unit length. It is a bit like using z-scores in statistics except the column
norm is normalized instead of the column variance. I'm sure there are
references out there but a quick google doesn't show anything obvious.
Normalization seems to be a bit neglected in the usual classroom notes.
Chuck
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.scipy.org/pipermail/numpy-discussion/attachments/20100720/7458bae0/attachment-0001.html
More information about the NumPy-Discussion
mailing list