[SciPy-user] error estimate in stats.linregress

josef.pktd@gmai... josef.pktd@gmai...
Tue Feb 24 10:06:15 CST 2009


On Tue, Feb 24, 2009 at 9:17 AM, Bruce Southey <bsouthey@gmail.com> wrote:
> josef.pktd@gmail.com wrote:
>> On Mon, Feb 23, 2009 at 4:45 PM, Bruce Southey <bsouthey@gmail.com> wrote:
>>
>>> josef.pktd@gmail.com wrote:
>>>
>>>> On Mon, Feb 23, 2009 at 3:01 PM, Bruce Southey <bsouthey@gmail.com> wrote:
>>>>
>>>>
>>>>> Hi,
>>>>> Yes, the formula is incorrect. The reason is that the sum of squares
>>>>> terms are not corrected by the means because the ss function just
>>>>> computes the uncorrected sum of squares.
>>>>>
>>>>> Thus the correct formula should :
>>>>> sterrest = np.sqrt(((1-r*r)*(ss((y-ymean))))/(df*(ss(x-xmean))))
>>>>> Alternatively:
>>>>> sterrest = np.sqrt((1-r*r)*(ss(y)-n*ymean*ymean)/ (ss(x)-n*xmean*xmean)
>>>>> / df)
>>>>>
>>>>> Note the formula is derived using the definition of R-squared:
>>>>> The estimated variance of the slope = MSE/Sxx= ((1-R*R)*Syy)/(df*Sxx)
>>>>> where Syy and Sxx are the corrected sums of squares for Y and X,
>>>>> respectively,
>>>>>
>>>>> Regards
>>>>> Bruce
>>>>>
>>>>> Hi all,
>>>>> I was working with linear regression in scipy and met some problems
>>>>> with value of standard error of the estimate returned by
>>>>> scipy.stats.linregress() function. I could not compare it to similar
>>>>> outputs of other linear regression routines (for example in Origin),
>>>>> so I took a look in the source (stats.py).
>>>>>
>>>>> In the source it is defined as
>>>>> sterrest = np.sqrt((1-r*r)*ss(y) / ss(x) / df)
>>>>> where r is correlation coefficient, df is degrees of freedom (N-2) and
>>>>> ss() is sum of squares of elements.
>>>>>
>>>>> After digging through literature the only formula looking somewhat the
>>>>> same was found to be
>>>>> stderrest = np.sqrt((1-r*r)*ss(y-y.mean())/df)
>>>>> which gives the same result as a standard definition (in notation of
>>>>> the source of linregress)
>>>>> stderrest = np.sqrt(ss(y-slope*x-intercept)/df)
>>>>> but the output of linregress is different.
>>>>>
>>>>> I humbly suppose this is a bug, but maybe somebody could explain me
>>>>> what is it if I'm wrong...
>>>>> Pavlo.
>>>>>
>>>>>
>>>>>
>>>> Thank you for reporting and checking this.
>>>>
>>>> I fixed it in trunk, but still have to add a test.
>>>> There are still small  (1e-4) numerical differences to the
>>>> multivariate ols version in the example I tried
>>>>
>>>> Josef
>>>> _______________________________________________
>>>> SciPy-user mailing list
>>>> SciPy-user@scipy.org
>>>> http://projects.scipy.org/mailman/listinfo/scipy-user
>>>>
>>>>
>>> Assuming that linregress is different from expected, linregress is just
>>> a scalar implementation so it is probably prone to rounding error in
>>> numerous places.
>>> If you want, I can look at the example and see.
>>>
>>> Bruce
>>> _______________________________________________
>>> SciPy-user mailing list
>>> SciPy-user@scipy.org
>>> http://projects.scipy.org/mailman/listinfo/scipy-user
>>>
>>>
>>
>> I just made up an example on the command line and compared it with my
>> version of the cookbook ols class using pinv . Until now I haven't
>> looked carefully at linregress because I think it should be completely
>> replaced with a full ols regression, or at least with a call to it. I
>> added and corrected a test. original tests only test constant and
>> slope estimates.
>>
>> Josef
>>
>>
> I have not tried the ols class but for an example based on your code
> (plus another I had very handy) I did not see any differences between
> linregress and SAS glm.

I guess, I had an incomplete reload somewhere, I ran 1000 simulations
of my simple example and the maximum absolute difference is 1e-15
(1e-13 in some other cases)

The only difference I found is that the coefficient of determination R is
calculated as the correlation coefficient between x and y and has
a sign, compared to standard definition of R^squared.

>
> It would be inappropriate to replace it or have a call to other module
> because we would have to provide the same input and output requirements.
> Rather I think we just have to leave it alone until we get an acceptable
> regression/general linear model/ANOVA solution. Then we can just
> depreciate before removing it.
>
> Bruce

Ok with me, I didn't think of the output requirements. And if linregress is
properly tested (as it is now), we can leave it alone. The calculation could be
cleaned up, e.g. use r = np.corrcoef(x,y) or the use of betai.

Josef


More information about the SciPy-user mailing list