# [SciPy-User] Generalized least square on large dataset

Peter Cimermančič peter.cimermancic@gmail....
Thu Mar 8 10:09:04 CST 2012

>
>
>
> I agree with Josef, the first thing that comes to mind is controlling
> for spatial effects (which happens in various fields; ecology folks
>
> In this case, though, I think you may need to think more carefully
> about whether your similarity measure is really appropriate. If your
> matrix is uninvertible, then IIUC that means you think that you
> effectively have less than 1000 distinct genomes -- some of your
> genomes are "so similar" to other ones that they can be predicted
> *exactly*.
>

That's exactly true - some of the bacteria are almost identical (I'll try
filtering those and see if it changes anything).

>
> In terms of the underlying probabilistic model: you have some
> population of bacteria genomes, and you picked 1000 of them to study.
> Each genome you picked has some length, and it also has a number of
> genes. The number of genes is determined probabilistically by taking
> some linear function of the length, and then adding some Gaussian
> noise. Your goal is to figure out what that linear function looks
> like.
>
> In OLS, we assume that each of those Gaussian noise terms is IID. In
> GLS, we assume that they're correlated. The way to think about this is
> that we take 1000 uncorrelated IID gaussian samples, let's call this
> vector "g", and then we mix them together by multiplying by a matrix
> chol(V), chol(V)*g. (cholV) is the cholesky decomposition; it's
> triangular, and chol(V)*chol(V)' = V.) So the noise added to each
> measurement is a mixture of these underlying IID gaussian terms, and
> bacteria that are more similar have noise terms that overlap more.
>
>
I'm also unable to calculate chol of my V matrix, because it doesn't appear
to be a positive definite. Any suggestion here?

> If V is singular, then this means that the last k rows of chol(V) are
> all-zero, which means that when you compute chol(V)*g, there are some
> elements of g that get ignored entirely -- they don't mixed in at all,
> and don't effect any bacteria. So, your choice of V is encoding an
> assumption that there are really only, like, 900 noise terms and 1000
> bacteria, and 'g' effectively only has 900 entries. So if you make the
> measurements for the first 900 bacteria, you should be able to
> reconstruct the full vector 'g', and then you can use it to compute
> *exactly* what measurements you will see for the last 100 bacteria.
> And also you can compute the linear relationship exactly. No need to
> do any hypothesis tests on the result (and in fact you can't do any
> hypothesis tests on the result, the math won't work), because you know
> The Truth!
>
> Of course none of these assumptions are actually true. Your bacteria
> are less similar to each other -- and your measurements more noisy --
> than your V matrix claims.
>
> So you need a better way to compute V. The nice thing about the above
> derivation -- and the reason I bothered to go through it -- is that it
> tells you what entries in V mean, numerically. Ideally you should
> figure out how to re-calibrate your similarity score so that bacteria
> which are 0.5 similar have a covariance of 0.5 in their noise --
> perhaps by calculating empirical covariances on other measures or
> something, figuring out the best way to do this calibration will take
> domain knowledge. The ecology folks might have some practical ideas on
> how to calibrate such things.
>
> Or, you could just replace V with V+\lambda*I. That'd solve the
> numerical problem, but you should be very suspicious of any p-values
> you get out, since they are based on lies ;-).
>

Yes, p-values are something I'd eventually like to come close to.