[SciPy-User] Generalized least square on large dataset

Nathaniel Smith njs@pobox....
Thu Mar 8 13:04:31 CST 2012


On Thu, Mar 8, 2012 at 6:07 PM,  <josef.pktd@gmail.com> wrote:
> While parameter estimates are pretty robust, the standard errors and
> the pvalues depend a lot on additional assumptions.
> If the assumptions are not satsified with a given datasets, then the
> pvalues can be pretty far off. For example if your error covariance
> matrix (from the V) is misspecified, then it could be the case that
> the pvalues are not very accurate.
> In a small sample assuming normal distribution might be a problem, but
> I would expect that for 1000 observations (or close to it) asymptotic
> normality will be accurate enough.
>
> With only one regressor (plus constant) multicollinearity cannot have
> a negative impact, so I wouldn't expect any other numerical problems.
>
> If your pvalue is 0.04 or 0.11 then I would do some additional
> specification checks. If the pvalue is 0.6 or 1.e-4, then I wouldn't
> worry about pvalue accuracy.
>
> Comparing the GLS standard errors with the (in this case incorrect)
> standard errors from OLS might give some idea about how much p-values
> can change with your data.

These kinds of GLS models are one of the places where having the wrong
model can give you arbitrarily spurious p values. To get an intuition,
consider the case where your errors are all very highly correlated, so
while you made N measurements, you really only effectively have 1.
Without proper correction, as N increases, your p value will get
arbitrarily small... even though you still only have 1 real data
point. Most cases aren't so extreme, of course, but that's the kind of
thing you have to be careful of -- underestimating your correlations =
overestimating your significance.

A good thing to do is check whether the resulting residuals "look
uncorrelated" -- if you have corrected for similarity in the analysis,
then bacteria that are similar to each other should not have similar
residuals, overall. A coarse check of this would be to come up with
some method for visualizing similarity spatially (like clustering your
bacteria into a dendrogram, or using factor analysis to plot coarse
similarity in 1 or 2 dimensions), and then using this to arrange your
residuals. Then you'd want to check that you don't see any overall
patterns, like one part of the plot has residuals that are
systematically larger than another part.

- N


More information about the SciPy-User mailing list