[SciPy-user] error estimate in stats.linregress

josef.pktd@gmai... josef.pktd@gmai...
Mon Feb 23 18:57:13 CST 2009


On Mon, Feb 23, 2009 at 4:45 PM, Bruce Southey <bsouthey@gmail.com> wrote:
> josef.pktd@gmail.com wrote:
>> On Mon, Feb 23, 2009 at 3:01 PM, Bruce Southey <bsouthey@gmail.com> wrote:
>>
>>> Hi,
>>> Yes, the formula is incorrect. The reason is that the sum of squares
>>> terms are not corrected by the means because the ss function just
>>> computes the uncorrected sum of squares.
>>>
>>> Thus the correct formula should :
>>> sterrest = np.sqrt(((1-r*r)*(ss((y-ymean))))/(df*(ss(x-xmean))))
>>> Alternatively:
>>> sterrest = np.sqrt((1-r*r)*(ss(y)-n*ymean*ymean)/ (ss(x)-n*xmean*xmean)
>>> / df)
>>>
>>> Note the formula is derived using the definition of R-squared:
>>> The estimated variance of the slope = MSE/Sxx= ((1-R*R)*Syy)/(df*Sxx)
>>> where Syy and Sxx are the corrected sums of squares for Y and X,
>>> respectively,
>>>
>>> Regards
>>> Bruce
>>>
>>> Hi all,
>>> I was working with linear regression in scipy and met some problems
>>> with value of standard error of the estimate returned by
>>> scipy.stats.linregress() function. I could not compare it to similar
>>> outputs of other linear regression routines (for example in Origin),
>>> so I took a look in the source (stats.py).
>>>
>>> In the source it is defined as
>>> sterrest = np.sqrt((1-r*r)*ss(y) / ss(x) / df)
>>> where r is correlation coefficient, df is degrees of freedom (N-2) and
>>> ss() is sum of squares of elements.
>>>
>>> After digging through literature the only formula looking somewhat the
>>> same was found to be
>>> stderrest = np.sqrt((1-r*r)*ss(y-y.mean())/df)
>>> which gives the same result as a standard definition (in notation of
>>> the source of linregress)
>>> stderrest = np.sqrt(ss(y-slope*x-intercept)/df)
>>> but the output of linregress is different.
>>>
>>> I humbly suppose this is a bug, but maybe somebody could explain me
>>> what is it if I'm wrong...
>>> Pavlo.
>>>
>>>
>>
>> Thank you for reporting and checking this.
>>
>> I fixed it in trunk, but still have to add a test.
>> There are still small  (1e-4) numerical differences to the
>> multivariate ols version in the example I tried
>>
>> Josef
>> _______________________________________________
>> SciPy-user mailing list
>> SciPy-user@scipy.org
>> http://projects.scipy.org/mailman/listinfo/scipy-user
>>
>
> Assuming that linregress is different from expected, linregress is just
> a scalar implementation so it is probably prone to rounding error in
> numerous places.
> If you want, I can look at the example and see.
>
> Bruce
> _______________________________________________
> SciPy-user mailing list
> SciPy-user@scipy.org
> http://projects.scipy.org/mailman/listinfo/scipy-user
>

I just made up an example on the command line and compared it with my
version of the cookbook ols class using pinv . Until now I haven't
looked carefully at linregress because I think it should be completely
replaced with a full ols regression, or at least with a call to it. I
added and corrected a test. original tests only test constant and
slope estimates.

Josef

>>> x = np.random.randn(50,2)
>>> y = (x*[1,0.5]).sum(axis=1)

>>> res = ols(y,x[:,0])
>>> res.b
array([-0.04397404,  0.88322506])
>>> res.se
array([ 0.07458703,  0.08676139])
>>> res.p
array([  5.58246087e-01,   1.40776280e-13])
>>> np.sqrt(res.R2)
0.82670557087727525

>>> stats.linregress(x[:,0],y)
(0.88322505756806569, -0.043974044431550653, 0.8267055708772757,
1.408136357829305e-013, 0.086581413890270145)


More information about the SciPy-user mailing list