[SciPy-User] Generalized least square on large dataset

josef.pktd@gmai... josef.pktd@gmai...
Thu Mar 8 10:58:42 CST 2012

On Thu, Mar 8, 2012 at 11:31 AM, Nathaniel Smith <njs@pobox.com> wrote:
> On Thu, Mar 8, 2012 at 4:09 PM, Peter Cimermančič
> <peter.cimermancic@gmail.com> wrote:
>>> I agree with Josef, the first thing that comes to mind is controlling
>>> for spatial effects (which happens in various fields; ecology folks
>>> worry about this a lot too).
>>> In this case, though, I think you may need to think more carefully
>>> about whether your similarity measure is really appropriate. If your
>>> matrix is uninvertible, then IIUC that means you think that you
>>> effectively have less than 1000 distinct genomes -- some of your
>>> genomes are "so similar" to other ones that they can be predicted
>>> *exactly*.
>> That's exactly true - some of the bacteria are almost identical (I'll try
>> filtering those and see if it changes anything).
> You aren't just telling the computer that they're almost identical --
> that would be fine, the model would just mostly-but-not-entirely
> ignore the near-duplicates. You're telling the computer that they are
> exactly identical and you had no reason to even collect the data
> because you knew ahead of time exactly what it would be. This is the
> sort of thing that really confuses statistical programs :-).
>>> In terms of the underlying probabilistic model: you have some
>>> population of bacteria genomes, and you picked 1000 of them to study.
>>> Each genome you picked has some length, and it also has a number of
>>> genes. The number of genes is determined probabilistically by taking
>>> some linear function of the length, and then adding some Gaussian
>>> noise. Your goal is to figure out what that linear function looks
>>> like.
>>> In OLS, we assume that each of those Gaussian noise terms is IID. In
>>> GLS, we assume that they're correlated. The way to think about this is
>>> that we take 1000 uncorrelated IID gaussian samples, let's call this
>>> vector "g", and then we mix them together by multiplying by a matrix
>>> chol(V), chol(V)*g. (cholV) is the cholesky decomposition; it's
>>> triangular, and chol(V)*chol(V)' = V.) So the noise added to each
>>> measurement is a mixture of these underlying IID gaussian terms, and
>>> bacteria that are more similar have noise terms that overlap more.
>> I'm also unable to calculate chol of my V matrix, because it doesn't appear
>> to be a positive definite. Any suggestion here?
> Singular matrices can't be positive definite, by definition. They can
> be positive semi-definite. (The analogy is numbers -- a number that is
> zero cannot be greater than zero, by definition. But it can be >=
> zero.) Any well-defined covariance matrix is necessarily positive
> semi-definite. If your covariance matrix isn't positive semi-definite,
> then that's like claiming that you have three random variables where A
> and B have a correlation of 0.99, and B and C have a correlation of
> 0.99, but A and C are uncorrelated. That's impossible. ([[1, 0.99, 0],
> [0.99, 1, 0.99], [0, 0.99, 1]] is not a positive-definite matrix.)
> Singular, positive semi-definite matrices do *have* Cholesky
> decompositions, but your average off-the-shelf Cholesky routine can't
> compute them. (Again by analogy -- in theory you can compute the
> square root of zero, but in practice you can't reliably with floating
> point, because your "zero" may turn out to actually be represented as
> "-2.2e-16" or something, and an off-the-shelf square root routine will
> blow up on this because it looks negative.) You can look around for a
> "rank-revealing Cholesky", perhaps.

I would use SVD or eigenvalue decomposition to get the transformation
matrix. With reduced rank and dropping zero eigenvalues, I think, the
transformation will just drop some observations that are redundant.

Or for normal equations, use X pinv(V) X beta = X pinv(V) y    which
uses SVD inside and requires less work writing the code.

I'm reasonably sure that I have seen the pinv used this way before.

That still leaves going from similarity matrix to covariance matrix.


> Anyway, the question is whether your matrix is positive semi-definite.
> If it is, then this is all expected, and your problem is just that you
> need to fix your covariances to be more realistic, as discussed. If it
> isn't, then you don't even have a covariance matrix, and again you
> need to figure out how to get one :-). You can check for positive
> (semi-)definiteness by looking at the eigenvalues -- they should be
> all >= 0 for semi-definite, > 0 for definite.
> The easiest way to manufacture a positive-definite matrix on command
> is to take a non-singular matrix A and compute A'A.
> HTH,
> - N
> _______________________________________________
> SciPy-User mailing list
> SciPy-User@scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-user

More information about the SciPy-User mailing list