# [SciPy-User] Generalized least square on large dataset

Peter Cimermančič peter.cimermancic@gmail....
Wed Mar 7 22:25:36 CST 2012

```To describe my problem into more details, I have a list of ~1000 bacterial
genome lengths and number of certain genes for each one of them. I'd like
to see if there is any correlation between genome lengths and number of the
genes. It may look like an easy linear regression problem; however, one has
to be a bit more careful as the measurements aren't sampled independently.
Bacteria, whose genomes are similar, tend to also contain similar number of
the genes. Bacterial similarity is what is described with matrix V - it
contains similarity values for each pair of bacteria, ranging from 0 to 1.

On Wed, Mar 7, 2012 at 8:00 PM, Charles R Harris
<charlesr.harris@gmail.com>wrote:

>
>
> On Wed, Mar 7, 2012 at 8:46 PM, Charles R Harris <
> charlesr.harris@gmail.com> wrote:
>
>>
>>
>> On Wed, Mar 7, 2012 at 7:39 PM, Peter Cimermančič <
>> peter.cimermancic@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I'd like to linearly fit the data that were NOT sampled independently. I
>>> came across generalized least square method:
>>>
>>> b=(X'*V^(-1)*X)^(-1)*X'*V^(-1)*Y
>>>
>>> X and Y are coordinates of the data points, and V is a "variance matrix".
>>>
>>> The equation is Matlab format - I've tried solving problem there too,
>>> bit it didn't work - but eventually I'd like to be able to solve problems
>>> like that in python. The problem is that due to its size (1000 rows and
>>> columns), the V matrix becomes singular, thus un-invertable. Any
>>> suggestions for how to get around this problem? Maybe using a way of
>>> solving generalized linear regression problem other than GLS?
>>>
>>>
>> Plain old least squares will probably do a decent job for the fit, where
>> you will run into trouble is if you want to estimate the covariance. The
>> idea of using the variance matrix is to transform the data set into
>> independent observations of equal variance, but except in extreme cases
>> that shouldn't really be necessary if you have sufficient data points.
>> Weighting the data is a simple case of this that merely equalizes the
>> variance, and it often doesn't make that much difference.
>>
>>
> To expand a bit, if it is simply the case that the measurement errors
> aren't independent and you know their covariance, then you want to minimize
> (y - Ax)^T * cov^-1 * (y - ax) and if you factor cov^-1 into U^T * U, then
> you can solve the ordinary least squares problem U*A*x = U*y. I can't
> really tell what your data/problem is like without more details.
>
> Chuck
>
>
> _______________________________________________
> SciPy-User mailing list
> SciPy-User@scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-user
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.scipy.org/pipermail/scipy-user/attachments/20120307/10c8b9a8/attachment-0001.html
```