# [SciPy-User] Generalized least square on large dataset

josef.pktd@gmai... josef.pktd@gmai...
Thu Mar 8 12:07:17 CST 2012

```On Thu, Mar 8, 2012 at 12:32 PM, Peter Cimermančič
<peter.cimermancic@gmail.com> wrote:
>>
>>
>> I would use SVD or eigenvalue decomposition to get the transformation
>> matrix. With reduced rank and dropping zero eigenvalues, I think, the
>> transformation will just drop some observations that are redundant.
>>
>> Or for normal equations, use X pinv(V) X beta = X pinv(V) y    which
>> uses SVD inside and requires less work writing the code.
>>
>> I'm reasonably sure that I have seen the pinv used this way before.
>>
>> That still leaves going from similarity matrix to covariance matrix.
>
>
> Yes, pinv() solved the compute problem (no errors anymore). I've also found
> some papers describing how to get from a similarity matrix to correlation.
> Do you maybe know, are p-values (from MSE calculation) fairly accurate this
> way?

While parameter estimates are pretty robust, the standard errors and
the pvalues depend a lot on additional assumptions.
If the assumptions are not satsified with a given datasets, then the
pvalues can be pretty far off. For example if your error covariance
matrix (from the V) is misspecified, then it could be the case that
the pvalues are not very accurate.
In a small sample assuming normal distribution might be a problem, but
I would expect that for 1000 observations (or close to it) asymptotic
normality will be accurate enough.

With only one regressor (plus constant) multicollinearity cannot have
a negative impact, so I wouldn't expect any other numerical problems.

If your pvalue is 0.04 or 0.11 then I would do some additional
specification checks. If the pvalue is 0.6 or 1.e-4, then I wouldn't
worry about pvalue accuracy.

Comparing the GLS standard errors with the (in this case incorrect)
standard errors from OLS might give some idea about how much p-values
can change with your data.

I would be interested in hearing how you get from a similarity matrix
to correlation matrix in your case. I would like to see if it is very
difficult to include something like this in statsmodels.

Josef

>
>
> Peter
>
> _______________________________________________
> SciPy-User mailing list
> SciPy-User@scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-user
>
```