# [SciPy-User] Generalized least square on large dataset

Sturla Molden sturla@molden...
Fri Mar 9 09:47:03 CST 2012

```On 08.03.2012 05:25, Peter Cimermančič wrote:

> To describe my problem into more details, I have a list of ~1000
> bacterial genome lengths and number of certain genes for each one of
> them. I'd like to see if there is any correlation between genome lengths
> and number of the genes.

> It may look like an easy linear regression
> problem;

No, it does not. If you are working with counts, the appropriate model
would usually be Poisson regression. I.e. Generalized linear model with
log-link function and Possion probability family. I have seen many
examples of microbiologists using linear regression when they should
actually use Poisson regression (e.g. counting genes) or logistic
regression (e.g. dose-response and titration curves).

This will do it for you:

MATLAB: glmfit from the statistics toolbox
R: glm
SAS: PROC GLIM
Python: statmodels scikit

Another example of inappropriate use of linear regression in
microbiology is the Lineweaver-Burk plot as substitute for non-linear
least-squares (usually Levenberg-Marquardt) to fit a Michelis-Menten
curve. Some microbiologists are bevare of this, but they seem to prefer
all sorts of ad hoc trickeries like linearizations and
variance-stabilizing transforms instead of "just doing it right".

As for samples that are not independent, that will affect the final
likelihood. If you want to optimize the log-likelhood yourself, to
control for this, getting ML estimates by maximizing the log-likelhood
is easy with fmin_powell or fmin_bgfs from scipy.optimize. (Powell's
method does not even need the gradient.) And if you need the "p-value",
you can either use the likelihood ratio or Monte Carlo (e.g. permutation
test).

Sturla

P.S. I think biostatistics courses that biologists are tought do not
cover the tools that are most commonly needed. Ronald Fisher (famous for
multiple regression and ANOVA) worked with quantitative genetics (e.g.
animal and plant breeding). But today most biologists work in molecular
biology labs, and other methods for data analysis are often needed. That
includes generalized linear models, non-linear regression, image
processing (for microscopy), and general signal processing (e.g.
electrophysiology).

```