Charles R Harris
Sat Sep 25 14:27:38 CDT 2010
On Tue, Sep 14, 2010 at 12:10 PM, Tim Valenta <firstname.lastname@example.org>wrote:
> Hello all,
> Longtime Python user here. I'm a computer science guy, went through mostly
> calculus and discrete math courses-- never much in pure statistics. Which
> leads me into my dilemma:
> I'm doing least-square linear regression on a scatter plot of points, where
> several inputs lead to one output. That is, it's not just one system of x,y
> coordinates, but instead several parallel systems something more like (x1,y)
> (x2, y) (x3, y), upwards to (xn, y). I'm attempting to get my company off
> of a severely broken calculation engine, but throwing my data at
> linalg.lstsq or stats.linregress isn't just a magic one-step fix to my
> desired solution.
> Basically, the company ends up using data from Microsoft Office 2003 Excel,
> which can do hardly-configurable regression calculation on a matrix of data,
> each row formatted like y, x1, x2, x3 (to follow my example given above).
> It spits out all kinds of variables and data, and the company draws on the
> calculated 'coefficients' to make a pie graph, the coefficients taken as
> percentages. The Excel coefficient numbers come out as something in the
> range of (-1,1). Their goal is to take the 2 or 3 largest of the
> coefficients and look at the corresponding xn value (that is, x1, x2, or x3)
> to decide which of the x values is most influential in the resulting y.
> So first:
> linalg.lstsq gives me back four variables, none of which, but `rank`, do I
> completely understand. The first returned value, `x`, claims to be a
> vector, but I'm a little lost on that. If it's a vector, I assume it's
> projecting from the origin point? But that seems too easy, and likely
> incorrect, but I've got no `intercept` value to observe. Is this capable of
> giving me the result I'm trying to chase?
> stats.linregress gives me much more familiar data, but seems to only
> compute one (x,y) plot at a time, leading me to conclude that I should do 3
> separate computations: linregress with all (x1,y) data points, then again
> with (x2,y) points, and then again with (x3,y) points. Using this method, I
> get slopes, intercepts, r-values, etc. Is it then up to me to minimize the
> r^2 value? Simply my lack of exposure to stats leaves me unsure of this
> Is the procedure my company uses totally flawed? That is, if I want to
> determine which of x1, x2, and x3 is most influential on y, across all
> samples, is there a more direct calculation that yields the correct
> Thanks in advance-- I'm eager to wrap my head around this! I just need
> some direction as I continue to play with scipy.
I'm not clear on what the excel spreadsheet is doing, but I suspect
something like Ax ~= y, where A is a matrix formed from the column values
and the x are the coefficients. The least squares solution to this is x,
which is indeed a vector, and the columns with the largest corresponding
elements in x would contribute the most to the solution. However, there are
other possibilities in this situation. It would help if you could supply a
small example along with the result.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the SciPy-User