[SciPy-User] Generalized least square on large dataset
Sturla Molden
sturla@molden...
Sat Mar 10 07:45:11 CST 2012
Den 09.03.2012 21:13, skrev josef.pktd@gmail.com:
> I think Sturla has a point in that both count and length are positive.
> It doesn't look like it's relevant for length, but in the counts there
> is a bunching just above zero, this creates either a non-linearity or
> requires another distribution log-normal (?) or Poisson (without
> zeros, or loc=1)? Josef
You can see that the dependent variable is counts with most of them
below 10. So I maintain that appropriate model is Poisson regression.
That is,
COX_count ~ Poission(lambda)
with
log(lambda) = b0 + b1 * genome_length
Or if there are N groups of bacteria,
log(lambda) = b[0] + b[1] * genome_length
+ np.dot(b[2:N+1], group[0:N-1])
with N-1 dummy indicator variables in the vector "group".
One could of course consider even more complicated models, such as
interaction terms between bacterial group and genome length. It's just a
matter of adding in the appropriate predictor variables.
Normally, the p-value of a Poisson regression model can be inferred from
the likelihood ratio against a reduced model if samples are independent.
But if samples are not independent, one cannot assume that the total
log-likelihood for the whole data is the sum of log-likelihoods for each
data point. So Peter would need to derive a correction for this. I
cannot be more specific because I don't know the specifics about how
this between-sample dependency is generated. Perhaps Peter could explain it?
Sturla
