[SciPy-User] Generalized least square on large dataset

Sturla Molden sturla@molden...
Fri Mar 9 09:47:03 CST 2012


On 08.03.2012 05:25, Peter Cimermančič wrote:

> To describe my problem into more details, I have a list of ~1000
> bacterial genome lengths and number of certain genes for each one of
> them. I'd like to see if there is any correlation between genome lengths
> and number of the genes.

> It may look like an easy linear regression
> problem;

No, it does not. If you are working with counts, the appropriate model 
would usually be Poisson regression. I.e. Generalized linear model with 
log-link function and Possion probability family. I have seen many 
examples of microbiologists using linear regression when they should 
actually use Poisson regression (e.g. counting genes) or logistic 
regression (e.g. dose-response and titration curves).

This will do it for you:

MATLAB: glmfit from the statistics toolbox
R: glm
SAS: PROC GLIM
Python: statmodels scikit

Another example of inappropriate use of linear regression in 
microbiology is the Lineweaver-Burk plot as substitute for non-linear 
least-squares (usually Levenberg-Marquardt) to fit a Michelis-Menten 
curve. Some microbiologists are bevare of this, but they seem to prefer 
all sorts of ad hoc trickeries like linearizations and 
variance-stabilizing transforms instead of "just doing it right".

As for samples that are not independent, that will affect the final 
likelihood. If you want to optimize the log-likelhood yourself, to 
control for this, getting ML estimates by maximizing the log-likelhood 
is easy with fmin_powell or fmin_bgfs from scipy.optimize. (Powell's 
method does not even need the gradient.) And if you need the "p-value", 
you can either use the likelihood ratio or Monte Carlo (e.g. permutation 
test).


Sturla


P.S. I think biostatistics courses that biologists are tought do not 
cover the tools that are most commonly needed. Ronald Fisher (famous for 
multiple regression and ANOVA) worked with quantitative genetics (e.g. 
animal and plant breeding). But today most biologists work in molecular 
biology labs, and other methods for data analysis are often needed. That 
includes generalized linear models, non-linear regression, image 
processing (for microscopy), and general signal processing (e.g. 
electrophysiology).


























More information about the SciPy-User mailing list