[SciPy-User] scipy.linalg.lstsq

Charles R Harris charlesr.harris@gmail....
Sat Sep 25 14:27:38 CDT 2010

On Tue, Sep 14, 2010 at 12:10 PM, Tim Valenta <tonightslastsong@gmail.com>wrote:

> Hello all,
>
> Longtime Python user here.  I'm a computer science guy, went through mostly
> calculus and discrete math courses-- never much in pure statistics.  Which
> leads me into my dilemma:
>
> I'm doing least-square linear regression on a scatter plot of points, where
> several inputs lead to one output.  That is, it's not just one system of x,y
> coordinates, but instead several parallel systems something more like (x1,y)
> (x2, y) (x3, y), upwards to (xn, y).  I'm attempting to get my company off
> of a severely broken calculation engine, but throwing my data at
> linalg.lstsq or stats.linregress isn't just a magic one-step fix to my
> desired solution.
>
> Basically, the company ends up using data from Microsoft Office 2003 Excel,
> which can do hardly-configurable regression calculation on a matrix of data,
> each row formatted like y, x1, x2, x3 (to follow my example given above).
>  It spits out all kinds of variables and data, and the company draws on the
> calculated 'coefficients' to make a pie graph, the coefficients taken as
> percentages.  The Excel coefficient numbers come out as something in the
> range of (-1,1).  Their goal is to take the 2 or 3 largest of the
> coefficients and look at the corresponding xn value (that is, x1, x2, or x3)
> to decide which of the x values is most influential in the resulting y.
>
> So first:
>
> linalg.lstsq gives me back four variables, none of which, but `rank`, do I
> completely understand.  The first returned value, `x`, claims to be a
> vector, but I'm a little lost on that.  If it's a vector, I assume it's
> projecting from the origin point?  But that seems too easy, and likely
> incorrect, but I've got no `intercept` value to observe.  Is this capable of
> giving me the result I'm trying to chase?
>
> Second:
>
> stats.linregress gives me much more familiar data, but seems to only
> compute one (x,y) plot at a time, leading me to conclude that I should do 3
> separate computations: linregress with all (x1,y) data points, then again
> with (x2,y) points, and then again with (x3,y) points.  Using this method, I
> get slopes, intercepts, r-values, etc.  Is it then up to me to minimize the
> r^2 value?  Simply my lack of exposure to stats leaves me unsure of this
> step.
>
> Third:
>
> Is the procedure my company uses totally flawed?  That is, if I want to
> determine which of x1, x2, and x3 is most influential on y, across all
> samples, is there a more direct calculation that yields the correct
> conclusion?
>
> Thanks in advance-- I'm eager to wrap my head around this!  I just need
> some direction as I continue to play with scipy.
>
>
I'm not clear on what the excel spreadsheet is doing, but I suspect
something like Ax ~= y, where A is a matrix formed from the column values
and the x are the coefficients. The least squares solution to this is x,
which is indeed a vector, and the columns with the largest corresponding
elements in x would contribute the most to the solution. However, there are
other possibilities in this situation. It would help if you could supply a
small example along with the result.

Chuck
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.scipy.org/pipermail/scipy-user/attachments/20100925/76268462/attachment-0001.html