# [SciPy-user] How to perform a multiple linear regression analysis with Scipy?

Thu Jun 25 07:26:47 CDT 2009

```Hi,

I'm trying to automate multiple linear regression (e.g. trend line does
not fit y = m*x + n) analysis for a set of samples.

For the simple case where the trend line adheres to y = m*x + n I found
stats.linregress. polyfit can fit polynomials but does not calculate
statistical values for goodness of fit. I could not find a function that
does a regression analysis for e.g. a parabola.

Specifically, I need:
- the parameters of the fitted function for plotting
- residuals for plotting and evaluation of the correlation
- Q-Q-plots ?
- statistics for evaluating the goodness of fit:
- (adjusted) coefficient of determination
- std errors / deviation
- F-test / p-values

I managed to get some of these using lstsq as follows:

from pylab import *

# example data taken from http://en.wikipedia.org/wiki/Linear_regression#Example
# Height (m)
height = [1.47, 1.5, 1.52, 1.55, 1.57, 1.60, 1.63, 1.65,
1.68, 1.7, 1.73, 1.75, 1.78, 1.8, 1.83]
# Weight (kg)
weight = [52.21, 53.12, 54.48, 55.84, 57.2, 58.57, 59.93,
61.29, 63.11, 64.47, 66.28, 68.1, 69.92, 72.19, 74.46]

observed = weight   # just to retain the terminology
sample_size = len(height)    # = len(weight)

# assume parabola -> weight = p0 + p1 * height + p2 * height**2
number_of_params = 3 # p0, p1 and p2
X = zeros((len(height), number_of_params), float)
X[:,0] = 1
X[:,1] = height
X[:,2] = [h**2 for h in height]

# p - paramter of parabola, sse - sum of squared errors ???
params, sse = lstsq(X, weight)[:2]
predicted = [ params[0] + params[1]*h + params[2]*h**2 for h in height ]
predicted_error_u = [ params[0] + params[1]*h + params[2]*h**2 + sse for h in height ] # ???
predicted_erro_l = [ params[0] + params[1]*h + params[2]*h**2 - sse for h in height ]  # ???

# according to http://en.wikipedia.org/wiki/Coefficient_of_determination
mean_observed = mean(observed)
mean_predicted = mean(predicted)
ss_tot = sum([ (o-mean_observed)**2 for o in observed ])
ss_reg = sum([ (p-mean_observed)**2 for p in predicted ])
ss_err = sum([ (o-p)**2 for o, p in zip(observed, predicted) ]) # == see ??
r_squared = 1 - ss_err / ss_tot # coefficient_of_determination

r_squared_adjusted = 1 - (1 - r_squared) * ((sample_size - 1) / (sample_size - number_of_params-1 - 1)) # ??

clf()
plot(height, weight, "b.", label="observed")
plot(height, predicted, 'r', label='predicted')
plot(height, predicted_error_u, 'r:', label='error')
plot(height, predicted_erro_l, 'r:',)
text(1.565, 77, "weight = %.2f + %.2f*height + %.2f*height^2" % tuple(params))
text(1.565, 76, "R-squared = %f (adjusted = %f)" % (r_squared, r_squared_adjusted))
text(1.565, 75, "ss_tot = %f" % ss_tot)
text(1.565, 74, "ss_reg = %f" % ss_reg)
text(1.565, 73, "ss_err = %f" % ss_err)
title("Observed and predicted data")
xlabel("Height (m)")
ylabel("Weight (kg)")
legend(loc=0)
savefig("regression.png")

residuals = [ o - e for  o, e in zip(observed, predicted) ]

clf()
plot(height, residuals, "b.", label="residuals")
plot(gca().get_xlim(), [0, 0], "k")
title("Residual plot")
xlabel("Height (m)")
ylabel("observed Weight - predicted Weiht (kg)")
savefig("residual_plot.png")

I compared the results to R:

height <- c(1.47, 1.5, 1.52, 1.55, 1.57, 1.60, 1.63, 1.65, 1.68, 1.7, 1.73, 1.75, 1.78, 1.8, 1.83)
weight <- c(52.21, 53.12, 54.48, 55.84, 57.2, 58.57, 59.93,61.29, 63.11, 64.47, 66.28, 68.1, 69.92, 72.19, 74.46)
h <- height*height
summary(lm(weight ~ height + h))

and they match fairly good, except for the adjusted R-squared which is
equal to R-squared in the python code. However, I have no idea how to
compute the F-statistic.

Does Scipy have a function (or more) that already does this? If not, can
anybody give me hint on how to compute the missing values (stderr,
F-statistic)?