[SciPy-user] How to perform a multiple linear regression analysis with Scipy?

Skipper Seabold jsseabold@gmail....
Thu Jun 25 08:46:42 CDT 2009


On Thu, Jun 25, 2009 at 9:34 AM, Bruce Southey<bsouthey@gmail.com> wrote:
> On 06/25/2009 07:26 AM, wierob wrote:
>
> Hi,
>
> I'm trying to automate multiple linear regression (e.g. trend line does not
> fit y = m*x + n) analysis for a set of samples.
>
> For the simple case where the trend line adheres to y = m*x + n I found
> stats.linregress. polyfit can fit polynomials but does not calculate
> statistical values for goodness of fit. I could not find a function that
> does a regression analysis for e.g. a parabola.
>
> Specifically, I need:
>  - the parameters of the fitted function for plotting
>  - residuals for plotting and evaluation of the correlation
>  - Q-Q-plots ?
>  - statistics for evaluating the goodness of fit:
>        - (adjusted) coefficient of determination
>        - std errors / deviation
>        - F-test / p-values
>
> I managed to get some of these using lstsq as follows:
>
> from pylab import *
>
> # example data taken from
> http://en.wikipedia.org/wiki/Linear_regression#Example
> # Height (m)
> height = [1.47, 1.5, 1.52, 1.55, 1.57, 1.60, 1.63, 1.65,
>           1.68, 1.7, 1.73, 1.75, 1.78, 1.8, 1.83]
> # Weight (kg)
> weight = [52.21, 53.12, 54.48, 55.84, 57.2, 58.57, 59.93,
>           61.29, 63.11, 64.47, 66.28, 68.1, 69.92, 72.19, 74.46]
>
> observed = weight   # just to retain the terminology
> sample_size = len(height)    # = len(weight)
>
> # assume parabola -> weight = p0 + p1 * height + p2 * height**2
> number_of_params = 3 # p0, p1 and p2
> X = zeros((len(height), number_of_params), float)
> X[:,0] = 1
> X[:,1] = height
> X[:,2] = [h**2 for h in height]
>
> # p - paramter of parabola, sse - sum of squared errors ???
> params, sse = lstsq(X, weight)[:2]
> predicted = [ params[0] + params[1]*h + params[2]*h**2 for h in height ]
> predicted_error_u = [ params[0] + params[1]*h + params[2]*h**2 + sse for h
> in height ] # ???
> predicted_erro_l = [ params[0] + params[1]*h + params[2]*h**2 - sse for h in
> height ]  # ???
>
> # according to http://en.wikipedia.org/wiki/Coefficient_of_determination
> mean_observed = mean(observed)
> mean_predicted = mean(predicted)
> ss_tot = sum([ (o-mean_observed)**2 for o in observed ])
> ss_reg = sum([ (p-mean_observed)**2 for p in predicted ])
> ss_err = sum([ (o-p)**2 for o, p in zip(observed, predicted) ]) # == see ??
> r_squared = 1 - ss_err / ss_tot # coefficient_of_determination
>
> r_squared_adjusted = 1 - (1 - r_squared) * ((sample_size - 1) / (sample_size
> - number_of_params-1 - 1)) # ??
>
>
> clf()
> plot(height, weight, "b.", label="observed")
> plot(height, predicted, 'r', label='predicted')
> plot(height, predicted_error_u, 'r:', label='error')
> plot(height, predicted_erro_l, 'r:',)
> text(1.565, 77, "weight = %.2f + %.2f*height + %.2f*height^2" %
> tuple(params))
> text(1.565, 76, "R-squared = %f (adjusted = %f)" % (r_squared,
> r_squared_adjusted))
> text(1.565, 75, "ss_tot = %f" % ss_tot)
> text(1.565, 74, "ss_reg = %f" % ss_reg)
> text(1.565, 73, "ss_err = %f" % ss_err)
> title("Observed and predicted data")
> xlabel("Height (m)")
> ylabel("Weight (kg)")
> legend(loc=0)
> savefig("regression.png")
>
> residuals = [ o - e for  o, e in zip(observed, predicted) ]
>
> clf()
> plot(height, residuals, "b.", label="residuals")
> plot(gca().get_xlim(), [0, 0], "k")
> title("Residual plot")
> xlabel("Height (m)")
> ylabel("observed Weight - predicted Weiht (kg)")
> savefig("residual_plot.png")
>
> I compared the results to R:
>
> height <- c(1.47, 1.5, 1.52, 1.55, 1.57, 1.60, 1.63, 1.65, 1.68, 1.7, 1.73,
> 1.75, 1.78, 1.8, 1.83)
> weight <- c(52.21, 53.12, 54.48, 55.84, 57.2, 58.57, 59.93,61.29, 63.11,
> 64.47, 66.28, 68.1, 69.92, 72.19, 74.46)
> h <- height*height
> summary(lm(weight ~ height + h))
>
> and they match fairly good, except for the adjusted R-squared which is equal
> to R-squared in the python code. However, I have no idea how to compute the
> F-statistic.
>
> Does Scipy have a function (or more) that already does this? If not, can
> anybody give me hint on how to compute the missing values (stderr,
> F-statistic)?
>
> Thanks in advance.
>
> regards
> robert
>
> ________________________________
> _______________________________________________
> SciPy-user mailing list
> SciPy-user@scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-user
>
>
> Hi,
> Numerous options to stop reinventing the wheel plus do everything using a
> more general 'matrix' terms:
> http://www.scipy.org/Cookbook/OLS
> http://code.google.com/p/econpy/source/browse/trunk/pytrix/ls.py
> Josef's work:
> http://thread.gmane.org/gmane.comp.python.scientific.user/18990/
> Jonathan Taylor's statistical models also discussed on the list as part of
> Skipper Seabold's GSoC project:
> http://article.gmane.org/gmane.comp.python.scientific.devel/10625/
>
>
>
> Bruce
>

Those examples should get you there, but if you want help
understanding where they come from without looking at any code you
could take a look at the ANOVA table here
<http://www.ats.ucla.edu/stat/stata/output/reg_output.htm> and derive
the R-Squared, Adj R-Squared, MSE, and F Stat.

Cheers,

Skipper


More information about the SciPy-user mailing list