[SciPy-Dev] Deprecate stats.glm?

Nathaniel Smith njs@pobox....
Thu Jun 3 16:20:40 CDT 2010


On Thu, Jun 3, 2010 at 12:55 PM, Bruce Southey <bsouthey@gmail.com> wrote:
> On 06/03/2010 11:16 AM, Nathaniel Smith wrote:
>> But the general linear model is basically identical to a simple linear
>> model, both in interface and implementation.
> Depends what you mean by 'simple'. Stealing from the SAS manual, these
> are some of the models fitted by the GLM procedure which I would not
> call simple:
> simple regression
> multiple regression
> analysis of variance (ANOVA), especially for unbalanced data
> analysis of covariance
> multivariate analysis of variance (MANOVA)
> weighted regression
> polynomial regression

Well, I didn't mean to start an argument; certainly 'simple' is
underdefined, and there's a lot of conceptual richness to the linear
model framework. (And perhaps some obfuscation from the historical
tendency to use different names for mathematically equivalent ideas
when used in different contexts.) But at the implementation level,
everything in the above list is (1) solved in about 2 lines of code,
(2) they're the same 2 lines of code for all of them. Making a
friendly interface is more complicated than that, of course, but
that's just more reason to do it once for all of them, instead of
piece-meal.

> response surface models
> partial correlation

These I'm not sure about off-hand.

> repeated measures analysis of variance

And this is a very complicated area; I know of at least 3 totally
different approaches (traditional repeated measures ANOVA with or
without sphericity corrections, MANOVAs on contrasts, and multi-level
mixed-effect modelling), and none is very simple. I assume SAS has
picked one to implement (probably the first?). But these issues are
totally orthogonal to whether a "linear model" is "general" (in fact,
I don't know how to apply *any* of these techniques to *general*
linear models, i.e., multivariate ones, and would very much appreciate
references if you have them!). So I don't see how this argues for
treating "general linear models" separately from "linear models".

> These include interactions...
>>   There's no reason to have
>> a separate function for it, one should just accept a matrix for the
>> "y" variable in the OLS code. But *generalized* linear models are
>> different in interface, implementation, and are almost as much of a
>> stats workhorse as standard linear models. So every book I've ever
>> seen uses the abbreviation "glm" to refer to the generalized version.
>> (Also, this is what R calls the function ;-).)
>>
> Yeah, it is interesting that you forget older statistical packages (SAS,
> SPSS, don't remember what Genstat did ) and the first GLIM (the first?
> generalized linear model package).

I didn't forget them; I've just never used them.

Can I also mention that I'm finding your tone quite combative and
off-putting? If I've offended you somehow then I apologize, and would
appreciate hearing why.

If those packages have useful ideas, then I'm interested in hearing
them, but just hearing the list of names unfortunately doesn't give me
much to go on.

>> The implementation of dummy coding is kind of useful, but this is the
>> wrong place and the wrong name...
>>
> Why?
> That is exactly what is needed and what stats.glm does.

I'm sorry, I don't know how to explain better. My statement is that
dummy coding is (1) useful, (2) neither called "glm" in any context,
nor in any way specific to the general linear model. Do you disagree
with any of this...?

>> (Also, its least squares implementation calls inv -- the textbook
>> example of bad numerics!)
>>
> Actually it should call pinv() here but you going to have to prove that
> this is 'bad numerics'! Especially given how the numpy computes it and
> that design matrices tend to have poor numerics to start with
> (especially if you do anova and use condition number to assess
> numerics). [I strong dislike people complaining of the apparent bad
> numerics just because they see the word inverse.]

Not sure I follow here either. If design matrices have poor numerics
to start with, then that's exactly the case where forming the inverse
is *bad*! If not, then it doesn't make much difference either way, but
since it's no more effort to write code that is both faster and more
robust, doing otherwise is just irresponsible in a widely-used
library. But in any case, this was a side point.

>> ...Okay, you know all that anyway, the question is what to do with it.
>> If the problem were just that it needed a better implementation and
>> some new features added, then maybe we would keep it and let it be
>> improved incrementally. But the interface is just wrong, so we'll be
>> removing it sooner or later, and it might as well be sooner, rather
>> than prolong the agony.
>
> The simple reason is that there is no alternative for users to use yet
> such as pystatsmodels.

Well, and this is a philosophical difference, I guess. Personally, as
a user, if given the choice between a stats library that was missing
many things, but everything there was well-engineered, reliable,
documented, etc., versus one that technically had more code but half
the things I started to use turned out to be broken, or do something
similar-but-different from what I expected, or just weren't
documented, then I would choose the first library, no question. And
I'd be more likely to contribute to make it more complete, too. It's
just easier to work in an area that's not cluttered with broken
machinery. "Add missing stuff following existing style" is a much
easier goal to work on than "pick your way through the rubble to find
usable pieces and cobble stuff out of them". But that's just me; I can
see your perspective too, and don't know the SciPy community's
preference.

-- Nathaniel


More information about the SciPy-Dev mailing list