[SciPy-user] error estimate in stats.linregress

Bruce Southey bsouthey@gmail....
Tue Feb 24 08:17:30 CST 2009


josef.pktd@gmail.com wrote:
> On Mon, Feb 23, 2009 at 4:45 PM, Bruce Southey <bsouthey@gmail.com> wrote:
>   
>> josef.pktd@gmail.com wrote:
>>     
>>> On Mon, Feb 23, 2009 at 3:01 PM, Bruce Southey <bsouthey@gmail.com> wrote:
>>>
>>>       
>>>> Hi,
>>>> Yes, the formula is incorrect. The reason is that the sum of squares
>>>> terms are not corrected by the means because the ss function just
>>>> computes the uncorrected sum of squares.
>>>>
>>>> Thus the correct formula should :
>>>> sterrest = np.sqrt(((1-r*r)*(ss((y-ymean))))/(df*(ss(x-xmean))))
>>>> Alternatively:
>>>> sterrest = np.sqrt((1-r*r)*(ss(y)-n*ymean*ymean)/ (ss(x)-n*xmean*xmean)
>>>> / df)
>>>>
>>>> Note the formula is derived using the definition of R-squared:
>>>> The estimated variance of the slope = MSE/Sxx= ((1-R*R)*Syy)/(df*Sxx)
>>>> where Syy and Sxx are the corrected sums of squares for Y and X,
>>>> respectively,
>>>>
>>>> Regards
>>>> Bruce
>>>>
>>>> Hi all,
>>>> I was working with linear regression in scipy and met some problems
>>>> with value of standard error of the estimate returned by
>>>> scipy.stats.linregress() function. I could not compare it to similar
>>>> outputs of other linear regression routines (for example in Origin),
>>>> so I took a look in the source (stats.py).
>>>>
>>>> In the source it is defined as
>>>> sterrest = np.sqrt((1-r*r)*ss(y) / ss(x) / df)
>>>> where r is correlation coefficient, df is degrees of freedom (N-2) and
>>>> ss() is sum of squares of elements.
>>>>
>>>> After digging through literature the only formula looking somewhat the
>>>> same was found to be
>>>> stderrest = np.sqrt((1-r*r)*ss(y-y.mean())/df)
>>>> which gives the same result as a standard definition (in notation of
>>>> the source of linregress)
>>>> stderrest = np.sqrt(ss(y-slope*x-intercept)/df)
>>>> but the output of linregress is different.
>>>>
>>>> I humbly suppose this is a bug, but maybe somebody could explain me
>>>> what is it if I'm wrong...
>>>> Pavlo.
>>>>
>>>>
>>>>         
>>> Thank you for reporting and checking this.
>>>
>>> I fixed it in trunk, but still have to add a test.
>>> There are still small  (1e-4) numerical differences to the
>>> multivariate ols version in the example I tried
>>>
>>> Josef
>>> _______________________________________________
>>> SciPy-user mailing list
>>> SciPy-user@scipy.org
>>> http://projects.scipy.org/mailman/listinfo/scipy-user
>>>
>>>       
>> Assuming that linregress is different from expected, linregress is just
>> a scalar implementation so it is probably prone to rounding error in
>> numerous places.
>> If you want, I can look at the example and see.
>>
>> Bruce
>> _______________________________________________
>> SciPy-user mailing list
>> SciPy-user@scipy.org
>> http://projects.scipy.org/mailman/listinfo/scipy-user
>>
>>     
>
> I just made up an example on the command line and compared it with my
> version of the cookbook ols class using pinv . Until now I haven't
> looked carefully at linregress because I think it should be completely
> replaced with a full ols regression, or at least with a call to it. I
> added and corrected a test. original tests only test constant and
> slope estimates.
>
> Josef
>
>   
I have not tried the ols class but for an example based on your code 
(plus another I had very handy) I did not see any differences between 
linregress and SAS glm.

It would be inappropriate to replace it or have a call to other module 
because we would have to provide the same input and output requirements. 
Rather I think we just have to leave it alone until we get an acceptable 
regression/general linear model/ANOVA solution. Then we can just 
depreciate before removing it.

Bruce


More information about the SciPy-user mailing list