# [SciPy-User] Generalized least square on large dataset

josef.pktd@gmai... josef.pktd@gmai...
Sat Mar 10 07:57:01 CST 2012

```On Sat, Mar 10, 2012 at 8:45 AM, Sturla Molden <sturla@molden.no> wrote:
> Den 09.03.2012 21:13, skrev josef.pktd@gmail.com:
>> I think Sturla has a point in that both count and length are positive.
>> It doesn't look like it's relevant for length, but in the counts there
>> is a bunching just above zero, this creates either a non-linearity or
>> requires another distribution log-normal (?) or Poisson (without
>> zeros, or loc=1)? Josef
>
> You can see that the dependent variable is counts with most of them
> below 10. So I maintain that appropriate model is Poisson regression.
>
> That is,
>
>    COX_count ~ Poission(lambda)
>
> with
>
>    log(lambda) = b0 + b1 * genome_length
>
> Or if there are N groups of bacteria,
>
>    log(lambda) = b[0] + b[1] * genome_length
>          + np.dot(b[2:N+1], group[0:N-1])
>
> with N-1 dummy indicator variables in the vector "group".
>
> One could of course consider even more complicated models, such as
> interaction terms between bacterial group and genome length. It's just a
> matter of adding in the appropriate predictor variables.
>
> Normally, the p-value of a Poisson regression model can be inferred from
> the likelihood ratio against a reduced model if samples are independent.
>
> But if samples are not independent, one cannot assume that the total
> log-likelihood for the whole data is the sum of log-likelihoods for each
> data point. So Peter would need to derive a correction for this. I
> cannot be more specific because I don't know the specifics about how
> this between-sample dependency is generated. Perhaps Peter could explain it?

He explained the between sample correlation with the similarity (my
analogy autocorrelation in time series, or spatial correlation).

The main problem I see with using Poisson is that I wouldn't know how
to include the correlation.
I never looked at this, and statsmodels doesn't implement it. (I
looked a bit at count processes for time series with serial
dependence, but not much.)
My guess is that log-linear or something like that would be easier

Is there a multivariate version of Poisson with correlated
observations similar to GLS for the linear model?

Josef

>
>
> Sturla
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> _______________________________________________
> SciPy-User mailing list
> SciPy-User@scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-user
```