[SciPy-user] How to perform a multiple linear regression analysis with Scipy?
Bruce Southey
bsouthey@gmail....
Thu Jun 25 08:34:41 CDT 2009
On 06/25/2009 07:26 AM, wierob wrote:
> Hi,
>
> I'm trying to automate multiple linear regression (e.g. trend line
> does not fit y = m*x + n) analysis for a set of samples.
>
> For the simple case where the trend line adheres to y = m*x + n I
> found stats.linregress. polyfit can fit polynomials but does not
> calculate statistical values for goodness of fit. I could not find a
> function that does a regression analysis for e.g. a parabola.
>
> Specifically, I need:
> - the parameters of the fitted function for plotting
> - residuals for plotting and evaluation of the correlation
> - Q-Q-plots ?
> - statistics for evaluating the goodness of fit:
> - (adjusted) coefficient of determination
> - std errors / deviation
> - F-test / p-values
>
> I managed to get some of these using lstsq as follows:
> from pylab import *
>
> # example data taken from http://en.wikipedia.org/wiki/Linear_regression#Example
> # Height (m)
> height = [1.47, 1.5, 1.52, 1.55, 1.57, 1.60, 1.63, 1.65,
> 1.68, 1.7, 1.73, 1.75, 1.78, 1.8, 1.83]
> # Weight (kg)
> weight = [52.21, 53.12, 54.48, 55.84, 57.2, 58.57, 59.93,
> 61.29, 63.11, 64.47, 66.28, 68.1, 69.92, 72.19, 74.46]
>
> observed = weight # just to retain the terminology
> sample_size = len(height) # = len(weight)
>
> # assume parabola -> weight = p0 + p1 * height + p2 * height**2
> number_of_params = 3 # p0, p1 and p2
> X = zeros((len(height), number_of_params),float)
> X[:,0] = 1
> X[:,1] = height
> X[:,2] = [h**2for h in height]
>
> # p - paramter of parabola, sse - sum of squared errors ???
> params, sse = lstsq(X, weight)[:2]
> predicted = [ params[0] + params[1]*h + params[2]*h**2for h in height ]
> predicted_error_u = [ params[0] + params[1]*h + params[2]*h**2 + ssefor h in height ] # ???
> predicted_erro_l = [ params[0] + params[1]*h + params[2]*h**2 - ssefor h in height ] # ???
>
> # according to http://en.wikipedia.org/wiki/Coefficient_of_determination
> mean_observed = mean(observed)
> mean_predicted = mean(predicted)
> ss_tot = sum([ (o-mean_observed)**2for o in observed ])
> ss_reg = sum([ (p-mean_observed)**2for p in predicted ])
> ss_err = sum([ (o-p)**2for o, p in zip(observed, predicted) ]) # == see ??
> r_squared = 1 - ss_err / ss_tot # coefficient_of_determination
>
> r_squared_adjusted = 1 - (1 - r_squared) * ((sample_size - 1) / (sample_size - number_of_params-1 - 1)) # ??
>
>
> clf()
> plot(height, weight,"b.", label="observed")
> plot(height, predicted,'r', label='predicted')
> plot(height, predicted_error_u, 'r:', label='error')
> plot(height, predicted_erro_l, 'r:',)
> text(1.565, 77,"weight = %.2f + %.2f*height + %.2f*height^2" % tuple(params))
> text(1.565, 76,"R-squared = %f (adjusted = %f)" % (r_squared, r_squared_adjusted))
> text(1.565, 75,"ss_tot = %f" % ss_tot)
> text(1.565, 74,"ss_reg = %f" % ss_reg)
> text(1.565, 73,"ss_err = %f" % ss_err)
> title("Observed and predicted data")
> xlabel("Height (m)")
> ylabel("Weight (kg)")
> legend(loc=0)
> savefig("regression.png")
>
> residuals = [ o - efor o, e in zip(observed, predicted) ]
>
> clf()
> plot(height, residuals,"b.", label="residuals")
> plot(gca().get_xlim(), [0, 0],"k")
> title("Residual plot")
> xlabel("Height (m)")
> ylabel("observed Weight - predicted Weiht (kg)")
> savefig("residual_plot.png")
> I compared the results to R:
> height<- c(1.47, 1.5, 1.52, 1.55, 1.57, 1.60, 1.63, 1.65, 1.68, 1.7, 1.73, 1.75, 1.78, 1.8, 1.83)
> weight<- c(52.21, 53.12, 54.48, 55.84, 57.2, 58.57, 59.93,61.29, 63.11, 64.47, 66.28, 68.1, 69.92, 72.19, 74.46)
> h<- height*height
> summary(lm(weight ~ height + h))
> and they match fairly good, except for the adjusted R-squared which is
> equal to R-squared in the python code. However, I have no idea how to
> compute the F-statistic.
>
> Does Scipy have a function (or more) that already does this? If not,
> can anybody give me hint on how to compute the missing values (stderr,
> F-statistic)?
>
> Thanks in advance.
>
> regards
> robert
> ------------------------------------------------------------------------
>
> _______________________________________________
> SciPy-user mailing list
> SciPy-user@scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-user
>
Hi,
Numerous options to stop reinventing the wheel plus do everything using
a more general 'matrix' terms:
http://www.scipy.org/Cookbook/OLS
http://code.google.com/p/econpy/source/browse/trunk/pytrix/ls.py
Josef's work:
http://thread.gmane.org/gmane.comp.python.scientific.user/18990/
Jonathan Taylor's statistical models also discussed on the list as part
of Skipper Seabold's GSoC project:
http://article.gmane.org/gmane.comp.python.scientific.devel/10625/
Bruce
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.scipy.org/pipermail/scipy-user/attachments/20090625/bf641183/attachment.html
More information about the SciPy-user
mailing list