[SciPy-User] Generalized least square on large dataset
Thu Mar 8 10:31:52 CST 2012
On Thu, Mar 8, 2012 at 4:09 PM, Peter Cimermančič
>> I agree with Josef, the first thing that comes to mind is controlling
>> for spatial effects (which happens in various fields; ecology folks
>> worry about this a lot too).
>> In this case, though, I think you may need to think more carefully
>> about whether your similarity measure is really appropriate. If your
>> matrix is uninvertible, then IIUC that means you think that you
>> effectively have less than 1000 distinct genomes -- some of your
>> genomes are "so similar" to other ones that they can be predicted
> That's exactly true - some of the bacteria are almost identical (I'll try
> filtering those and see if it changes anything).
You aren't just telling the computer that they're almost identical --
that would be fine, the model would just mostly-but-not-entirely
ignore the near-duplicates. You're telling the computer that they are
exactly identical and you had no reason to even collect the data
because you knew ahead of time exactly what it would be. This is the
sort of thing that really confuses statistical programs :-).
>> In terms of the underlying probabilistic model: you have some
>> population of bacteria genomes, and you picked 1000 of them to study.
>> Each genome you picked has some length, and it also has a number of
>> genes. The number of genes is determined probabilistically by taking
>> some linear function of the length, and then adding some Gaussian
>> noise. Your goal is to figure out what that linear function looks
>> In OLS, we assume that each of those Gaussian noise terms is IID. In
>> GLS, we assume that they're correlated. The way to think about this is
>> that we take 1000 uncorrelated IID gaussian samples, let's call this
>> vector "g", and then we mix them together by multiplying by a matrix
>> chol(V), chol(V)*g. (cholV) is the cholesky decomposition; it's
>> triangular, and chol(V)*chol(V)' = V.) So the noise added to each
>> measurement is a mixture of these underlying IID gaussian terms, and
>> bacteria that are more similar have noise terms that overlap more.
> I'm also unable to calculate chol of my V matrix, because it doesn't appear
> to be a positive definite. Any suggestion here?
Singular matrices can't be positive definite, by definition. They can
be positive semi-definite. (The analogy is numbers -- a number that is
zero cannot be greater than zero, by definition. But it can be >=
zero.) Any well-defined covariance matrix is necessarily positive
semi-definite. If your covariance matrix isn't positive semi-definite,
then that's like claiming that you have three random variables where A
and B have a correlation of 0.99, and B and C have a correlation of
0.99, but A and C are uncorrelated. That's impossible. ([[1, 0.99, 0],
[0.99, 1, 0.99], [0, 0.99, 1]] is not a positive-definite matrix.)
Singular, positive semi-definite matrices do *have* Cholesky
decompositions, but your average off-the-shelf Cholesky routine can't
compute them. (Again by analogy -- in theory you can compute the
square root of zero, but in practice you can't reliably with floating
point, because your "zero" may turn out to actually be represented as
"-2.2e-16" or something, and an off-the-shelf square root routine will
blow up on this because it looks negative.) You can look around for a
"rank-revealing Cholesky", perhaps.
Anyway, the question is whether your matrix is positive semi-definite.
If it is, then this is all expected, and your problem is just that you
need to fix your covariances to be more realistic, as discussed. If it
isn't, then you don't even have a covariance matrix, and again you
need to figure out how to get one :-). You can check for positive
(semi-)definiteness by looking at the eigenvalues -- they should be
all >= 0 for semi-definite, > 0 for definite.
The easiest way to manufacture a positive-definite matrix on command
is to take a non-singular matrix A and compute A'A.
More information about the SciPy-User