[SciPy-User] Generalized least square on large dataset
Sturla Molden
sturla@molden...
Sat Mar 10 09:05:58 CST 2012
Den 10.03.2012 14:57, skrev josef.pktd@gmail.com:
>
> He explained the between sample correlation with the similarity (my
> analogy autocorrelation in time series, or spatial correlation).
>
>
Look at his attachment ives.tiff.
If the categories are known in advance (right panel in
ives.tiff), I think what he actually needs is computing
the likelihood ratio between the model
log(lambda) = b[0] + b[1] * genome_length
+ np.dot(b[2:N+1], group[0:N-1])
and a reduced model
log(lambda) = b[0] + np.dot(b[1:N], group[0:N-1])
That is, adding genome length as a predictor should not
improve the fit given that bacterial groups are already in
the model.
If he does not have groups, but some sort of dendrogram
(left panel in ives.tiff), perhaps he could preprocess the
data by clustering the bacteria based on his dendrogram?
A full dendrogram (e.g. used as nested log-linear model)
would overfit the data and explain it perfectly. So adding
genome length would always give zero improvement. But if
the dendrogram can be reduced into a few descrete categories,
he could use a likelihood ratio test for the genome length.
Sturla
More information about the SciPy-User
mailing list