[Numpy-discussion] lstsq functionality
Skipper Seabold
jsseabold@gmail....
Tue Jul 20 09:32:55 CDT 2010
On Tue, Jul 20, 2010 at 1:08 AM, Charles R Harris
<charlesr.harris@gmail.com> wrote:
>
>
> On Mon, Jul 19, 2010 at 9:40 PM, Keith Goodman <kwgoodman@gmail.com> wrote:
>>
>> On Mon, Jul 19, 2010 at 8:27 PM, Charles R Harris
>> <charlesr.harris@gmail.com> wrote:
>> >
>> >
>> > On Mon, Jul 19, 2010 at 9:02 PM, Keith Goodman <kwgoodman@gmail.com>
>> > wrote:
>> >>
>> >> On Mon, Jul 19, 2010 at 6:53 PM, Joshua Holbrook
>> >> <josh.holbrook@gmail.com> wrote:
>> >> > On Mon, Jul 19, 2010 at 5:50 PM, Charles R Harris
>> >> > <charlesr.harris@gmail.com> wrote:
>> >> >> Hi All,
>> >> >>
>> >> >> I'm thinking about adding some functionality to lstsq because I find
>> >> >> myself
>> >> >> doing the same fixes over and over. List follows.
>> >> >>
>> >> >> Add weights so data points can be weighted.
>> >> >> Use column scaling so condition numbers make more sense.
>> >> >> Compute covariance approximation?
>> >> >>
>> >> >> Unfortunately, the last will require using svd since there no linear
>> >> >> least
>> >> >> squares routines in LAPACK that also compute the covariance, at
>> >> >> least
>> >> >> that
>> >> >> google knows about.
>> >> >>
>> >> >> Thoughts?
>> >> >
>> >> > Maybe make 2 functions--one which implements 1 and 2, and another
>> >> > which implements 3? I think weights is an excellent idea!
>> >>
>> >> I'd like a lstsq that did less, like not calculate the sum of squared
>> >> residuals. That's useful in tight loops. So I also think having two
>> >> lstsq makes sense. One as basic as possible; one with bells. How does
>> >> scipy's lstsq fit into all this?
>> >
>> > I think the computation of the residues is cheap in lstsq. The
>> > algolrithm
>> > used starts by reducing the design matrix to bidiagonal from and reduces
>> > the
>> > rhs at the same time. In other words an mxn problem becomes a (n+1)xn
>> > problem. That's why the summed square of residuals is available but not
>> > the
>> > individual residuals, after the reduction there is only one residual and
>> > its
>> > square it the residue.
>>
>> That does sound good. But it must take some time. There's indexing,
>> array creation, if statement, summing:
>>
>> if results['rank'] == n and m > n:
>> resids = sum((transpose(bstar)[n:,:])**2,
>> axis=0).astype(result_t)
>>
>> Here are the timings after removing the sum of squared residuals:
>>
>> >> x = np.random.rand(1000,10)
>> >> y = np.random.rand(1000)
>> >> timeit np.linalg.lstsq(x, y)
>> 1000 loops, best of 3: 369 us per loop
>> >> timeit np.linalg.lstsq2(x, y)
>> 1000 loops, best of 3: 344 us per loop
>> >>
>> >> x = np.random.rand(10,2)
>> >> y = np.random.rand(10)
>> >> timeit np.linalg.lstsq(x, y)
>> 10000 loops, best of 3: 102 us per loop
>> >> timeit np.linalg.lstsq2(x, y)
>> 10000 loops, best of 3: 77 us per loop
>> _
>
> Now that I've looked through the driver program I see that you are right.
> Also, all the info needed for the covariance is almost available in the
> LAPACK driver program. Hmm, it seems that maybe the best thing to do here is
> to pull out the lapack_lite driver program, modify it, and make it a
> standalone python module that links to either the lapack_lite programs or
> the ATLAS versions. But that is more work than just doing things in python
> :-\ It does have the added advantage that all the work arrays can be handled
> internally.
>
IIUC, I tried this with statsmodels earlier this summer, but found the
result to be slower than our current implementation (and surely slower
than lstsq since we have a bit more Python overhead around the LAPACK
function calls). I will have to dig through the revisions and make
sure about this.
But for the record, we have weighted least squares (WLS) and you can
get the covariance matrix. I think you can do column-scaling with our
GLS (generalized least squares class). I haven't seen the terms
row-scaling and column-scaling used much in econometrics books do you
have a good reference offhand?
Skipper
More information about the NumPy-Discussion
mailing list