[SciPy-user] error estimate in stats.linregress
Mon Feb 23 18:57:13 CST 2009
On Mon, Feb 23, 2009 at 4:45 PM, Bruce Southey <firstname.lastname@example.org> wrote:
> email@example.com wrote:
>> On Mon, Feb 23, 2009 at 3:01 PM, Bruce Southey <firstname.lastname@example.org> wrote:
>>> Yes, the formula is incorrect. The reason is that the sum of squares
>>> terms are not corrected by the means because the ss function just
>>> computes the uncorrected sum of squares.
>>> Thus the correct formula should :
>>> sterrest = np.sqrt(((1-r*r)*(ss((y-ymean))))/(df*(ss(x-xmean))))
>>> sterrest = np.sqrt((1-r*r)*(ss(y)-n*ymean*ymean)/ (ss(x)-n*xmean*xmean)
>>> / df)
>>> Note the formula is derived using the definition of R-squared:
>>> The estimated variance of the slope = MSE/Sxx= ((1-R*R)*Syy)/(df*Sxx)
>>> where Syy and Sxx are the corrected sums of squares for Y and X,
>>> Hi all,
>>> I was working with linear regression in scipy and met some problems
>>> with value of standard error of the estimate returned by
>>> scipy.stats.linregress() function. I could not compare it to similar
>>> outputs of other linear regression routines (for example in Origin),
>>> so I took a look in the source (stats.py).
>>> In the source it is defined as
>>> sterrest = np.sqrt((1-r*r)*ss(y) / ss(x) / df)
>>> where r is correlation coefficient, df is degrees of freedom (N-2) and
>>> ss() is sum of squares of elements.
>>> After digging through literature the only formula looking somewhat the
>>> same was found to be
>>> stderrest = np.sqrt((1-r*r)*ss(y-y.mean())/df)
>>> which gives the same result as a standard definition (in notation of
>>> the source of linregress)
>>> stderrest = np.sqrt(ss(y-slope*x-intercept)/df)
>>> but the output of linregress is different.
>>> I humbly suppose this is a bug, but maybe somebody could explain me
>>> what is it if I'm wrong...
>> Thank you for reporting and checking this.
>> I fixed it in trunk, but still have to add a test.
>> There are still small (1e-4) numerical differences to the
>> multivariate ols version in the example I tried
>> SciPy-user mailing list
> Assuming that linregress is different from expected, linregress is just
> a scalar implementation so it is probably prone to rounding error in
> numerous places.
> If you want, I can look at the example and see.
> SciPy-user mailing list
I just made up an example on the command line and compared it with my
version of the cookbook ols class using pinv . Until now I haven't
looked carefully at linregress because I think it should be completely
replaced with a full ols regression, or at least with a call to it. I
added and corrected a test. original tests only test constant and
>>> x = np.random.randn(50,2)
>>> y = (x*[1,0.5]).sum(axis=1)
>>> res = ols(y,x[:,0])
array([ 0.07458703, 0.08676139])
array([ 5.58246087e-01, 1.40776280e-13])
(0.88322505756806569, -0.043974044431550653, 0.8267055708772757,
More information about the SciPy-user