[SciPy-user] error estimate in stats.linregress
Mon Feb 23 14:01:45 CST 2009
Yes, the formula is incorrect. The reason is that the sum of squares
terms are not corrected by the means because the ss function just
computes the uncorrected sum of squares.
Thus the correct formula should :
sterrest = np.sqrt(((1-r*r)*(ss((y-ymean))))/(df*(ss(x-xmean))))
sterrest = np.sqrt((1-r*r)*(ss(y)-n*ymean*ymean)/ (ss(x)-n*xmean*xmean)
Note the formula is derived using the definition of R-squared:
The estimated variance of the slope = MSE/Sxx= ((1-R*R)*Syy)/(df*Sxx)
where Syy and Sxx are the corrected sums of squares for Y and X,
I was working with linear regression in scipy and met some problems
with value of standard error of the estimate returned by
scipy.stats.linregress() function. I could not compare it to similar
outputs of other linear regression routines (for example in Origin),
so I took a look in the source (stats.py).
In the source it is defined as
sterrest = np.sqrt((1-r*r)*ss(y) / ss(x) / df)
where r is correlation coefficient, df is degrees of freedom (N-2) and
ss() is sum of squares of elements.
After digging through literature the only formula looking somewhat the
same was found to be
stderrest = np.sqrt((1-r*r)*ss(y-y.mean())/df)
which gives the same result as a standard definition (in notation of
the source of linregress)
stderrest = np.sqrt(ss(y-slope*x-intercept)/df)
but the output of linregress is different.
I humbly suppose this is a bug, but maybe somebody could explain me
what is it if I'm wrong...
More information about the SciPy-user