[SciPy-User] Generalized least square on large dataset
Charles R Harris
charlesr.harris@gmail....
Thu Mar 8 12:35:08 CST 2012
On Thu, Mar 8, 2012 at 11:07 AM, <josef.pktd@gmail.com> wrote:
> On Thu, Mar 8, 2012 at 12:32 PM, Peter Cimermančič
> <peter.cimermancic@gmail.com> wrote:
> >>
> >>
> >> I would use SVD or eigenvalue decomposition to get the transformation
> >> matrix. With reduced rank and dropping zero eigenvalues, I think, the
> >> transformation will just drop some observations that are redundant.
> >>
> >> Or for normal equations, use X pinv(V) X beta = X pinv(V) y which
> >> uses SVD inside and requires less work writing the code.
> >>
> >> I'm reasonably sure that I have seen the pinv used this way before.
> >>
> >> That still leaves going from similarity matrix to covariance matrix.
> >
> >
> > Yes, pinv() solved the compute problem (no errors anymore). I've also
> found
> > some papers describing how to get from a similarity matrix to
> correlation.
> > Do you maybe know, are p-values (from MSE calculation) fairly accurate
> this
> > way?
>
> While parameter estimates are pretty robust, the standard errors and
> the pvalues depend a lot on additional assumptions.
> If the assumptions are not satsified with a given datasets, then the
> pvalues can be pretty far off. For example if your error covariance
> matrix (from the V) is misspecified, then it could be the case that
> the pvalues are not very accurate.
> In a small sample assuming normal distribution might be a problem, but
> I would expect that for 1000 observations (or close to it) asymptotic
> normality will be accurate enough.
>
> With only one regressor (plus constant) multicollinearity cannot have
> a negative impact, so I wouldn't expect any other numerical problems.
>
> If your pvalue is 0.04 or 0.11 then I would do some additional
> specification checks. If the pvalue is 0.6 or 1.e-4, then I wouldn't
> worry about pvalue accuracy.
>
> Comparing the GLS standard errors with the (in this case incorrect)
> standard errors from OLS might give some idea about how much p-values
> can change with your data.
>
> I would be interested in hearing how you get from a similarity matrix
> to correlation matrix in your case. I would like to see if it is very
> difficult to include something like this in statsmodels.
>
>
With a model this simple there are likely to be significant systematic
errors, which would make it even more difficult to interpret significance.
OTOH, this may be a case where the residuals are as interesting as the
parameter values.
Chuck
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.scipy.org/pipermail/scipy-user/attachments/20120308/1ec856a8/attachment.html
More information about the SciPy-User
mailing list