[Numpy-discussion] consensus (was: NA masks in the next numpy release?)

Matthew Brett matthew.brett@gmail....
Fri Oct 28 16:56:23 CDT 2011


Hi,

On Fri, Oct 28, 2011 at 2:43 PM, Matthew Brett <matthew.brett@gmail.com> wrote:
> Hi,
>
> On Fri, Oct 28, 2011 at 2:41 PM, Charles R Harris
> <charlesr.harris@gmail.com> wrote:
>>
>>
>> On Fri, Oct 28, 2011 at 3:16 PM, Nathaniel Smith <njs@pobox.com> wrote:
>>>
>>> On Tue, Oct 25, 2011 at 2:56 PM, Travis Oliphant <oliphant@enthought.com>
>>> wrote:
>>> > I think Nathaniel and Matthew provided very
>>> > specific feedback that was helpful in understanding other perspectives
>>> > of a
>>> > difficult problem.     In particular, I really wanted bit-patterns
>>> > implemented.    However, I also understand that Mark did quite a bit of
>>> > work
>>> > and altered his original designs quite a bit in response to community
>>> > feedback.   I wasn't a major part of the pull request discussion, nor
>>> > did I
>>> > merge the changes, but I support Charles if he reviewed the code and
>>> > felt
>>> > like it was the right thing to do.  I likely would have done the same
>>> > thing
>>> > rather than let Mark Wiebe's work languish.
>>>
>>> My connectivity is spotty this week, so I'll stay out of the technical
>>> discussion for now, but I want to share a story.
>>>
>>> Maybe a year ago now, Jonathan Taylor and I were debating what the
>>> best API for describing statistical models would be -- whether we
>>> wanted something like R's "formulas" (which I supported), or another
>>> approach based on sympy (his idea). To summarize, I thought his API
>>> was confusing, pointlessly complicated, and didn't actually solve the
>>> problem; he thought R-style formulas were superficially simpler but
>>> hopelessly confused and inconsistent underneath. Now, obviously, I was
>>> right and he was wrong. Well, obvious to me, anyway... ;-) But it
>>> wasn't like I could just wave a wand and make his arguments go away,
>>> no matter how annoying and wrong-headed I thought they were... I could
>>> write all the code I wanted but no-one would use it unless I could
>>> convince them it's actually the right solution, so I had to engage
>>> with him, and dig deep into his arguments.
>>>
>>> What I discovered was that (as I thought) R-style formulas *do* have a
>>> solid theoretical basis -- but (as he thought) all the existing
>>> implementations *are* broken and inconsistent! I'm still not sure I
>>> can actually convince Jonathan to go my way, but, because of his
>>> stubbornness, I had to invent a better way of handling these formulas,
>>> and so my library[1] is actually the first implementation of these
>>> things that has a rigorous theory behind it, and in the process it
>>> avoids two fundamental, decades-old bugs in R. (And I'm not sure the R
>>> folks can fix either of them at this point without breaking a ton of
>>> code, since they both have API consequences.)
>>>
>>> --
>>>
>>> It's extremely common for healthy FOSS projects to insist on consensus
>>> for almost all decisions, where consensus means something like "every
>>> interested party has a veto"[2]. This seems counterintuitive, because
>>> if everyone's vetoing all the time, how does anything get done? The
>>> trick is that if anyone *can* veto, then vetoes turn out to actually
>>> be very rare. Everyone knows that they can't just ignore alternative
>>> points of view -- they have to engage with them if they want to get
>>> anything done. So you get buy-in on features early, and no vetoes are
>>> necessary. And by forcing people to engage with each other, like me
>>> with Jonathan, you get better designs.
>>>
>>> But what about the cost of all that code that doesn't get merged, or
>>> written, because everyone's spending all this time debating instead?
>>> Better designs are nice and all, but how does that justify letting
>>> working code languish?
>>>
>>> The greatest risk for a FOSS project is that people will ignore you.
>>> Projects and features live and die by community buy-in. Consider the
>>> "NA mask" feature right now. It works (at least the parts of it that
>>> are implemented). It's in mainline. But IIRC, Pierre said last time
>>> that he doesn't think the current design will help him improve or
>>> replace numpy.ma. Up-thread, Wes McKinney is leaning towards ignoring
>>> this feature in favor of his library pandas' current hacky NA support.
>>> Members of the neuroimaging crowd are saying that the memory overhead
>>> is too high and the benefits too marginal, so they'll stick with NaNs.
>>> Together these folk a huge proportion of the this feature's target
>>> audience. So what have we actually accomplished by merging this to
>>> mainline? Are we going to be stuck supporting a feature that only a
>>> fraction of the target audience actually uses? (Maybe they're being
>>> dumb, but if people are ignoring your code for dumb reasons... they're
>>> still ignoring your code.)
>>>
>>> The consensus rule forces everyone to do the hardest and riskiest part
>>> -- building buy-in -- up front. Because you *have* to do it sooner or
>>> later, and doing it sooner doesn't just generate better designs. It
>>> drastically reduces the risk of ending up in a huge trainwreck.
>>>
>>> --
>>>
>>> In my story at the beginning, I wished I had a magic wand to skip this
>>> annoying debate and political stuff. But giving it to me would have
>>> been a bad idea. I think that's went wrong with the NA discussion in
>>> the first place. Mark's an excellent programmer, and he tried his best
>>> to act in the good of everyone in the project -- but in the end, he
>>> did have a wand like that. He didn't have that sense that he *had* to
>>> get everyone on board (even the people who were saying dumb things),
>>> or he'd just be wasting his time. He didn't ask Pierre if the NA
>>> design would actually work for numpy.ma's purposes -- I did.
>>>
>>> You may have noticed that I do have some ideas for about how NA
>>> support should work. But my ideas aren't really the important thing.
>>> The alter-NEP was my attempt to find common ground between the
>>> different needs people were bringing up, so we could discuss whether
>>> it would work for people or not. I'm not wedded to anything in it. But
>>> this is a complicated issue with a lot of conflicting interests, and
>>> we need to find something that actually does work for everyone (or as
>>> large a subset as is practical).
>>>
>>> So here's what I think we should do:
>>>  1) I will submit a pull request backing Mark's NA work out of
>>> mainline, for now. (This is more or less done, I just need to get it
>>> onto github, see above re: connectivity)
>>>  2) I will also put together a new branch containing that work,
>>> rebased against current mainline, so it doesn't get lost. (Ditto.)
>>>  3) And we'll decide what to do with it *after* we hammer out a
>>> design that the various NA-supporting groups all find convincing. Or
>>> at least a design for some of the less controversial pieces (like the
>>> 'where=' ufunc argument?), get those merged, and then iterate
>>> incrementally.
>>>
>>> What do you all think?
>>>
>>
>> Why don't you and Matthew work up an alternative implementation so we can
>> compare the two?
>
> Do you have comments on the changes I suggested?

Sorry - this was too short and a little rude.  I'm sorry.

I was reacting to what I perceived as intolerance for discussing the
issues, and I may be wrong in that perception.

I think what Nathaniel is saying, is that it is not in the best
interests of numpy to push through code where there is not good
agreement.  In reverting the change, he is, I think, appealing for a
commitment to that process, for the good of numpy.

I have in the past taken some of your remarks to imply that if someone
is prepared to write code then that overrides most potential
disagreement.

The reason I think Nathaniel is the more right, is because most of us,
I believe, do honestly have the interests of numpy at heart, and, want
to fully understand the problem, and are prepared to be proven wrong.
In that situation, in my experience of writing code at least, by far
the most fruitful way to proceed is by letting all voices be heard.
On the other hand, if the rule becomes 'unless I see an implementation
I'm not listening to you' - then we lose the great benefits, to the
code, of having what is fundamentally a good and strong community.

Best,

Matthew


More information about the NumPy-Discussion mailing list