[SciPy-User] Generalized least square on large dataset
Wed Mar 7 23:36:35 CST 2012
On Wed, Mar 7, 2012 at 11:25 PM, Peter Cimermančič
> To describe my problem into more details, I have a list of ~1000 bacterial
> genome lengths and number of certain genes for each one of them. I'd like to
> see if there is any correlation between genome lengths and number of the
> genes. It may look like an easy linear regression problem; however, one has
> to be a bit more careful as the measurements aren't sampled independently.
> Bacteria, whose genomes are similar, tend to also contain similar number of
> the genes. Bacterial similarity is what is described with matrix V - it
> contains similarity values for each pair of bacteria, ranging from 0 to 1.
> Anybody encountered similar problem already?
The closest I can think of is spatial econometrics where V is the
But I never looked at the details of the spatial specification, and I
don't know pysal covers this case.
(The V is also similar to the correlation in Gaussian Processes, but
they do only local estimation as far as I have looked at it.)
(I have only vague ideas about this case, If you have the
distances/similarity, then you need to estimate the correlation as a
function of the similarity. If the correlation matrix is not
invertible, then it should be possible to just use the generalized
inverse, pinv, of V. 1000x1000 doesn't sound too big to pinv or to use
an svd. But I don't see any reason that the covariance matrix should
But as Chuck said, OLS would still be a consistent estimator, but for
standard errors a correction will be necessary. (If it's not in pysal,
then it might not be trivial to work out the correction of the
(...) == maybe on topic
> On Wed, Mar 7, 2012 at 8:00 PM, Charles R Harris <firstname.lastname@example.org>
>> On Wed, Mar 7, 2012 at 8:46 PM, Charles R Harris
>> <email@example.com> wrote:
>>> On Wed, Mar 7, 2012 at 7:39 PM, Peter Cimermančič
>>> <firstname.lastname@example.org> wrote:
>>>> I'd like to linearly fit the data that were NOT sampled independently. I
>>>> came across generalized least square method:
>>>> X and Y are coordinates of the data points, and V is a "variance
>>>> The equation is Matlab format - I've tried solving problem there too,
>>>> bit it didn't work - but eventually I'd like to be able to solve problems
>>>> like that in python. The problem is that due to its size (1000 rows and
>>>> columns), the V matrix becomes singular, thus un-invertable. Any suggestions
>>>> for how to get around this problem? Maybe using a way of solving generalized
>>>> linear regression problem other than GLS?
>>> Plain old least squares will probably do a decent job for the fit, where
>>> you will run into trouble is if you want to estimate the covariance. The
>>> idea of using the variance matrix is to transform the data set into
>>> independent observations of equal variance, but except in extreme cases that
>>> shouldn't really be necessary if you have sufficient data points. Weighting
>>> the data is a simple case of this that merely equalizes the variance, and it
>>> often doesn't make that much difference.
>> To expand a bit, if it is simply the case that the measurement errors
>> aren't independent and you know their covariance, then you want to minimize
>> (y - Ax)^T * cov^-1 * (y - ax) and if you factor cov^-1 into U^T * U, then
>> you can solve the ordinary least squares problem U*A*x = U*y. I can't really
>> tell what your data/problem is like without more details.
>> SciPy-User mailing list
> SciPy-User mailing list
More information about the SciPy-User