[Numpy-discussion] consensus (was: NA masks in the next numpy release?)

Matthew Brett matthew.brett@gmail....
Fri Oct 28 16:43:07 CDT 2011


On Fri, Oct 28, 2011 at 2:41 PM, Charles R Harris
<charlesr.harris@gmail.com> wrote:
> On Fri, Oct 28, 2011 at 3:16 PM, Nathaniel Smith <njs@pobox.com> wrote:
>> On Tue, Oct 25, 2011 at 2:56 PM, Travis Oliphant <oliphant@enthought.com>
>> wrote:
>> > I think Nathaniel and Matthew provided very
>> > specific feedback that was helpful in understanding other perspectives
>> > of a
>> > difficult problem.     In particular, I really wanted bit-patterns
>> > implemented.    However, I also understand that Mark did quite a bit of
>> > work
>> > and altered his original designs quite a bit in response to community
>> > feedback.   I wasn't a major part of the pull request discussion, nor
>> > did I
>> > merge the changes, but I support Charles if he reviewed the code and
>> > felt
>> > like it was the right thing to do.  I likely would have done the same
>> > thing
>> > rather than let Mark Wiebe's work languish.
>> My connectivity is spotty this week, so I'll stay out of the technical
>> discussion for now, but I want to share a story.
>> Maybe a year ago now, Jonathan Taylor and I were debating what the
>> best API for describing statistical models would be -- whether we
>> wanted something like R's "formulas" (which I supported), or another
>> approach based on sympy (his idea). To summarize, I thought his API
>> was confusing, pointlessly complicated, and didn't actually solve the
>> problem; he thought R-style formulas were superficially simpler but
>> hopelessly confused and inconsistent underneath. Now, obviously, I was
>> right and he was wrong. Well, obvious to me, anyway... ;-) But it
>> wasn't like I could just wave a wand and make his arguments go away,
>> no matter how annoying and wrong-headed I thought they were... I could
>> write all the code I wanted but no-one would use it unless I could
>> convince them it's actually the right solution, so I had to engage
>> with him, and dig deep into his arguments.
>> What I discovered was that (as I thought) R-style formulas *do* have a
>> solid theoretical basis -- but (as he thought) all the existing
>> implementations *are* broken and inconsistent! I'm still not sure I
>> can actually convince Jonathan to go my way, but, because of his
>> stubbornness, I had to invent a better way of handling these formulas,
>> and so my library[1] is actually the first implementation of these
>> things that has a rigorous theory behind it, and in the process it
>> avoids two fundamental, decades-old bugs in R. (And I'm not sure the R
>> folks can fix either of them at this point without breaking a ton of
>> code, since they both have API consequences.)
>> --
>> It's extremely common for healthy FOSS projects to insist on consensus
>> for almost all decisions, where consensus means something like "every
>> interested party has a veto"[2]. This seems counterintuitive, because
>> if everyone's vetoing all the time, how does anything get done? The
>> trick is that if anyone *can* veto, then vetoes turn out to actually
>> be very rare. Everyone knows that they can't just ignore alternative
>> points of view -- they have to engage with them if they want to get
>> anything done. So you get buy-in on features early, and no vetoes are
>> necessary. And by forcing people to engage with each other, like me
>> with Jonathan, you get better designs.
>> But what about the cost of all that code that doesn't get merged, or
>> written, because everyone's spending all this time debating instead?
>> Better designs are nice and all, but how does that justify letting
>> working code languish?
>> The greatest risk for a FOSS project is that people will ignore you.
>> Projects and features live and die by community buy-in. Consider the
>> "NA mask" feature right now. It works (at least the parts of it that
>> are implemented). It's in mainline. But IIRC, Pierre said last time
>> that he doesn't think the current design will help him improve or
>> replace numpy.ma. Up-thread, Wes McKinney is leaning towards ignoring
>> this feature in favor of his library pandas' current hacky NA support.
>> Members of the neuroimaging crowd are saying that the memory overhead
>> is too high and the benefits too marginal, so they'll stick with NaNs.
>> Together these folk a huge proportion of the this feature's target
>> audience. So what have we actually accomplished by merging this to
>> mainline? Are we going to be stuck supporting a feature that only a
>> fraction of the target audience actually uses? (Maybe they're being
>> dumb, but if people are ignoring your code for dumb reasons... they're
>> still ignoring your code.)
>> The consensus rule forces everyone to do the hardest and riskiest part
>> -- building buy-in -- up front. Because you *have* to do it sooner or
>> later, and doing it sooner doesn't just generate better designs. It
>> drastically reduces the risk of ending up in a huge trainwreck.
>> --
>> In my story at the beginning, I wished I had a magic wand to skip this
>> annoying debate and political stuff. But giving it to me would have
>> been a bad idea. I think that's went wrong with the NA discussion in
>> the first place. Mark's an excellent programmer, and he tried his best
>> to act in the good of everyone in the project -- but in the end, he
>> did have a wand like that. He didn't have that sense that he *had* to
>> get everyone on board (even the people who were saying dumb things),
>> or he'd just be wasting his time. He didn't ask Pierre if the NA
>> design would actually work for numpy.ma's purposes -- I did.
>> You may have noticed that I do have some ideas for about how NA
>> support should work. But my ideas aren't really the important thing.
>> The alter-NEP was my attempt to find common ground between the
>> different needs people were bringing up, so we could discuss whether
>> it would work for people or not. I'm not wedded to anything in it. But
>> this is a complicated issue with a lot of conflicting interests, and
>> we need to find something that actually does work for everyone (or as
>> large a subset as is practical).
>> So here's what I think we should do:
>>  1) I will submit a pull request backing Mark's NA work out of
>> mainline, for now. (This is more or less done, I just need to get it
>> onto github, see above re: connectivity)
>>  2) I will also put together a new branch containing that work,
>> rebased against current mainline, so it doesn't get lost. (Ditto.)
>>  3) And we'll decide what to do with it *after* we hammer out a
>> design that the various NA-supporting groups all find convincing. Or
>> at least a design for some of the less controversial pieces (like the
>> 'where=' ufunc argument?), get those merged, and then iterate
>> incrementally.
>> What do you all think?
> Why don't you and Matthew work up an alternative implementation so we can
> compare the two?

Do you have comments on the changes I suggested?



More information about the NumPy-Discussion mailing list