[SciPy-User] Generalized least square on large dataset

Sturla Molden sturla@molden...
Sat Mar 10 09:05:58 CST 2012

Den 10.03.2012 14:57, skrev josef.pktd@gmail.com:
> He explained the between sample correlation with the similarity (my
> analogy autocorrelation in time series, or spatial correlation).

Look at his attachment ives.tiff.

If the categories are known in advance (right panel in
ives.tiff), I think what he actually needs is computing
the likelihood ratio between the model

     log(lambda) = b[0] + b[1] * genome_length
           + np.dot(b[2:N+1], group[0:N-1])

and a reduced model

     log(lambda) = b[0] + np.dot(b[1:N], group[0:N-1])

That is, adding genome length as a predictor should not
improve the fit given that bacterial groups are already in
the model.

If he does not have groups, but some sort of dendrogram
(left panel in ives.tiff), perhaps he could preprocess the
data by clustering the bacteria based on his dendrogram?

A full dendrogram (e.g. used as nested log-linear model)
would overfit the data and explain it perfectly. So adding
genome length would always give zero improvement. But if
the dendrogram can be reduced into a few descrete categories,
he could use a likelihood ratio test for the genome length.


More information about the SciPy-User mailing list