# [SciPy-user] linear regression

josef.pktd@gmai... josef.pktd@gmai...
Wed May 27 19:24:36 CDT 2009

On Wed, May 27, 2009 at 7:35 PM, Robert Kern <robert.kern@gmail.com> wrote:
> On Wed, May 27, 2009 at 15:29,  <josef.pktd@gmail.com> wrote:
>> On Wed, May 27, 2009 at 3:37 PM, Robert Kern <robert.kern@gmail.com> wrote:
>
>>> For "y=f(x)" models, this is true. Both y and x can be multivariate,
>>> and you can express the covariance of the uncertainties for each, but
>>> not covariance between the y and x uncertainties. This is because of
>>> the numerical tricks used for efficient implementation
>>
>> In this case, OLS would still be unbiased in the linear case, but
>> maybe not efficient.
>
> Are you sure? I see significant deviations using a simple example
> (albeit one which is utterly rigged in ODR's favor). The X
> uncertainties start small and grow with increasing X. The Y
> uncertainties start large and shrink with increasing X. Plotting the
> estimates shows some strange structure in the OLS estimates.
>
>
> import numpy as np
>
> from scipy.odr import RealData, ODR
> from scipy.odr.models import unilinear
>
>
> beta_true = np.array([-0.47960828215176365,  5.47674024758398481])
> p_x = np.array([0.,.9,1.8,2.6,3.3,4.4,5.2,6.1,6.5,7.4])
> p_y = beta_true[0] * p_x + beta_true[1]
> p_sx = np.array([.03,.03,.04,.035,.07,.11,.13,.22,.74,1.])
> p_sy = np.array([1.,.74,.5,.35,.22,.22,.12,.12,.1,.04])
>
>
> def random_betas(n=500, prng=np.random):
>    """ Compute random parameter vectors from both ODR and OLS by generating
>    random data and fitting it.
>    """
>    odr_betas = []
>    ols_betas = []
>    for i in range(n):
>        x = np.random.normal(p_x, p_sx)
>        y = np.random.normal(p_y, p_sy)
>
>        # ODR:
>        data = RealData(x, y, sx=p_sx, sy=p_sy)
>        odr = ODR(data, unilinear, beta0=[1., 1.])
>        odr_out = odr.run()
>        odr_betas.append(odr_out.beta)
>
>        # Weighted OLS:
>        A = np.ones((len(x), 2))
>        A[:,0] = x
>        # Weight by the Y error.
>        A /= p_sy[:,np.newaxis]
>        b, res, rank, s = np.linalg.lstsq(A, y/p_sy)
>        ols_betas.append(b)
>
>        # Alternately:
>        #ols = ODR(data, unilinear, beta0=[1., 1.])
>        #ols.set_job(fit_type=2)
>        #ols_out = ols.run()
>        #ols_betas.append(ols_out.beta)
>    return np.array(odr_betas).T, np.array(ols_betas).T
>

after removing the weighting in your example to get plain OLS, I get

>>> bodr, bols = random_betas(5000)
>>> bols.mean(1)
array([-0.4757033 ,  5.46418868])
>>> bodr.mean(1)
array([-0.48364392,  5.49508047])
>>> bodr.mean(1)-beta_true
array([-0.00403564,  0.01834022])
>>> bols.mean(1)-beta_true
array([ 0.00390498, -0.01255157])

I don't see yet why the results with weighted ols are much worse. I
also confirmed with by inhouse econometrician, whether it's really
unbiased and not just asympotically unbiased.
As long as the measurement error in the x regressors are uncorrelated
with the regression error, ols is unbiased
y = X*b + u   E(X*u)=0  that's the part that is used in the proof of
unbiasedness.

Josef