[Numpy-discussion] consensus (was: NA masks in the next numpy release?)

Matthew Brett matthew.brett@gmail....
Sat Oct 29 19:20:31 CDT 2011


Hi,

On Sat, Oct 29, 2011 at 11:14 AM, Wes McKinney <wesmckinn@gmail.com> wrote:
> On Fri, Oct 28, 2011 at 9:32 PM, Charles R Harris
> <charlesr.harris@gmail.com> wrote:
>>
>>
>> On Fri, Oct 28, 2011 at 6:45 PM, Wes McKinney <wesmckinn@gmail.com> wrote:
>>>
>>> On Fri, Oct 28, 2011 at 7:53 PM, Benjamin Root <ben.root@ou.edu> wrote:
>>> >
>>> >
>>> > On Friday, October 28, 2011, Matthew Brett <matthew.brett@gmail.com>
>>> > wrote:
>>> >> Hi,
>>> >>
>>> >> On Fri, Oct 28, 2011 at 4:21 PM, Ralf Gommers
>>> >> <ralf.gommers@googlemail.com> wrote:
>>> >>>
>>> >>>
>>> >>> On Sat, Oct 29, 2011 at 12:37 AM, Matthew Brett
>>> >>> <matthew.brett@gmail.com>
>>> >>> wrote:
>>> >>>>
>>> >>>> Hi,
>>> >>>>
>>> >>>> On Fri, Oct 28, 2011 at 3:14 PM, Charles R Harris
>>> >>>> <charlesr.harris@gmail.com> wrote:
>>> >>>> >
>>> >>>> >
>>> >>>> > On Fri, Oct 28, 2011 at 3:56 PM, Matthew Brett
>>> >>>> > <matthew.brett@gmail.com>
>>> >>>> > wrote:
>>> >>>> >>
>>> >>>> >> Hi,
>>> >>>> >>
>>> >>>> >> On Fri, Oct 28, 2011 at 2:43 PM, Matthew Brett
>>> >>>> >> <matthew.brett@gmail.com>
>>> >>>> >> wrote:
>>> >>>> >> > Hi,
>>> >>>> >> >
>>> >>>> >> > On Fri, Oct 28, 2011 at 2:41 PM, Charles R Harris
>>> >>>> >> > <charlesr.harris@gmail.com> wrote:
>>> >>>> >> >>
>>> >>>> >> >>
>>> >>>> >> >> On Fri, Oct 28, 2011 at 3:16 PM, Nathaniel Smith
>>> >>>> >> >> <njs@pobox.com>
>>> >>>> >> >> wrote:
>>> >>>> >> >>>
>>> >>>> >> >>> On Tue, Oct 25, 2011 at 2:56 PM, Travis Oliphant
>>> >>>> >> >>> <oliphant@enthought.com>
>>> >>>> >> >>> wrote:
>>> >>>> >> >>> > I think Nathaniel and Matthew provided very
>>> >>>> >> >>> > specific feedback that was helpful in understanding other
>>> >>>> >> >>> > perspectives
>>> >>>> >> >>> > of a
>>> >>>> >> >>> > difficult problem.     In particular, I really wanted
>>> >>>> >> >>> > bit-patterns
>>> >>>> >> >>> > implemented.    However, I also understand that Mark did
>>> >>>> >> >>> > quite
>>> >>>> >> >>> > a
>>> >>>> >> >>> > bit
>>> >>>> >> >>> > of
>>> >>>> >> >>> > work
>>> >>>> >> >>> > and altered his original designs quite a bit in response to
>>> >>>> >> >>> > community
>>> >>>> >> >>> > feedback.   I wasn't a major part of the pull request
>>> >>>> >> >>> > discussion,
>>> >>>> >> >>> > nor
>>> >>>> >> >>> > did I
>>> >>>> >> >>> > merge the changes, but I support Charles if he reviewed the
>>> >>>> >> >>> > code
>>> >>>> >> >>> > and
>>> >>>> >> >>> > felt
>>> >>>> >> >>> > like it was the right thing to do.  I likely would have done
>>> >>>> >> >>> > the
>>> >>>> >> >>> > same
>>> >>>> >> >>> > thing
>>> >>>> >> >>> > rather than let Mark Wiebe's work languish.
>>> >>>> >> >>>
>>> >>>> >> >>> My connectivity is spotty this week, so I'll stay out of the
>>> >>>> >> >>> technical
>>> >>>> >> >>> discussion for now, but I want to share a story.
>>> >>>> >> >>>
>>> >>>> >> >>> Maybe a year ago now, Jonathan Taylor and I were debating what
>>> >>>> >> >>> the
>>> >>>> >> >>> best API for describing statistical models would be -- whether
>>> >>>> >> >>> we
>>> >>>> >> >>> wanted something like R's "formulas" (which I supported), or
>>> >>>> >> >>> another
>>> >>>> >> >>> approach based on sympy (his idea). To summarize, I thought
>>> >>>> >> >>> his
>>> >>>> >> >>> API
>>> >>>> >> >>> was confusing, pointlessly complicated, and didn't actually
>>> >>>> >> >>> solve
>>> >>>> >> >>> the
>>> >>>> >> >>> problem; he thought R-style formulas were superficially
>>> >>>> >> >>> simpler
>>> >>>> >> >>> but
>>> >>>> >> >>> hopelessly confused and inconsistent underneath. Now,
>>> >>>> >> >>> obviously,
>>> >>>> >> >>> I
>>> >>>> >> >>> was
>>> >>>> >> >>> right and he was wrong. Well, obvious to me, anyway... ;-) But
>>> >>>> >> >>> it
>>> >>>> >> >>> wasn't like I could just wave a wand and make his arguments go
>>> >>>> >> >>> away,
>>> >>>> >> >>> no I should point out that the implementation hasn't - as far
>>> >>>> >> >>> as
>>> >>>> >> >>> I can
>>> >> see - changed the discussion.  The discussion was about the API.
>>> >> Implementations are useful for agreed APIs because they can point out
>>> >> where the API does not make sense or cannot be implemented.  In this
>>> >> case, the API Mark said he was going to implement - he did implement -
>>> >> at least as far as I can see.  Again, I'm happy to be corrected.
>>> >>
>>> >>>> In saying that we are insisting on our way, you are saying,
>>> >>>> implicitly,
>>> >>>> 'I
>>> >>>> am not going to negotiate'.
>>> >>>
>>> >>> That is only your interpretation. The observation that Mark
>>> >>> compromised
>>> >>> quite a bit while you didn't seems largely correct to me.
>>> >>
>>> >> The problem here stems from our inability to work towards agreement,
>>> >> rather than standing on set positions.  I set out what changes I think
>>> >> would make the current implementation OK.  Can we please, please have
>>> >> a discussion about those points instead of trying to argue about who
>>> >> has given more ground.
>>> >>
>>> >>> That commitment would of course be good. However, even if that were
>>> >>> possible
>>> >>> before writing code and everyone agreed that the ideas of you and
>>> >>> Nathaniel
>>> >>> should be implemented in full, it's still not clear that either of you
>>> >>> would
>>> >>> be willing to write any code. Agreement without code still doesn't
>>> >>> help
>>> >>> us
>>> >>> very much.
>>> >>
>>> >> I'm going to return to Nathaniel's point - it is a highly valuable
>>> >> thing to set ourselves the target of resolving substantial discussions
>>> >> by consensus.   The route you are endorsing here is 'implementor
>>> >> wins'.   We don't need to do it that way.  We're a mature sensible
>>> >> bunch of adults who can talk out the issues until we agree they are
>>> >> ready for implementation, and then implement.  That's all Nathaniel is
>>> >> saying.  I think he's obviously right, and I'm sad that it isn't as
>>> >> clear to y'all as it is to me.
>>> >>
>>> >> Best,
>>> >>
>>> >> Matthew
>>> >>
>>> >
>>> > Everyone, can we please not do this?! I had enough of adults doing
>>> > finger
>>> > pointing back over the summer during the whole debt ceiling debate.  I
>>> > think
>>> > we can all agree that we are better than the US congress?
>>> >
>>> > Forget about rudeness or decision processes.
>>> >
>>> > I will start by saying that I am willing to separate ignore and absent,
>>> > but
>>> > only on the write side of things.  On read, I want a single way to
>>> > identify
>>> > the missing values.  I also want only a single way to perform
>>> > calculations
>>> > (either skip or propagate).
>>> >
>>> > An indicator of success would be that people stop using NaNs and magic
>>> > numbers (-9999, anyone?) and we could even deprecate nansum(), or at
>>> > least
>>> > strongly suggest in its docs to use NA.
>>>
>>> Well, I haven't completely made up my mind yet, will have to do some
>>> more prototyping and playing (and potentially have some of my users
>>> eat the differently-flavored dogfood), but I'm really not very
>>> satisfied with the API at the moment. I'm mainly worried about the
>>> abstraction leaking through to pandas users (this is a pretty large
>>> group of people judging by # of downloads).
>>>
>>> The basic position I'm in is that I'm trying to push Python into a new
>>> space, namely mainstream data analysis and statistical computing, one
>>> that is solidly occupied by R and other such well-known players. My
>>> target users are not computer scientists. They are not going to invest
>>> in understanding dtypes very deeply or the internals of ndarray. In
>>> fact I've spent a great deal of effort making it so that pandas users
>>> can be productive and successful while having very little
>>> understanding of NumPy. Yes, I essentially "protect" my users from
>>> NumPy because using it well requires a certain level of sophistication
>>> that I think is unfair to demand of people. This might seem totally
>>> bizarre to some of you but it is simply the state of affairs. So far I
>>> have been successful because more people are using Python and pandas
>>> to do things that they used to do in R. The NA concept in R is dead
>>> simple and I don't see why we are incapable of also implementing
>>> something that is just as dead simple. To we, the scipy elite let's
>>> call us, it seems simple: "oh, just pass an extra flag to all my array
>>> constructors!" But this along with the masked array concept is going
>>> to have two likely outcomes:
>>>
>>> 1) Create a great deal more complication in my already very large codebase
>>>
>>> and/or
>>>
>>> 2) force pandas users to understand the new masked arrays after I've
>>> carefully made it so they can be largely ignorant of NumPy
>>>
>>> The mostly-NaN-based solution I've cobbled together and tweaked over
>>> the last 42 months actually *works really well*, amazingly, with
>>> relatively little cost in code complexity. Having found a reasonably
>>> stable equilibrium I'm extremely resistant to upset the balance.
>>>
>>> So I don't know. After watching these threads bounce back and forth
>>> I'm frankly not all that hopeful about a solution arising that
>>> actually addresses my needs.
>>
>> But Wes, what *are* your needs? You keep saying this, but we need examples
>> of how you want to operate and how numpy fails. As to dtypes, internals, and
>> all that, I don't see any of that in the current implementation, unless you
>> mean the maskna and skipna keywords. I believe someone on the previous
>> thread mentioned a way to deal with that.
>>
>> Chuck
>>
>
> Here are my needs:
>
> 1) How NAs are implemented cannot be end user visible. Having to pass
> maskna=True is a problem. I suppose a solution is to set the flag to
> true on every array inside of pandas so the user never knows (you
> mentioned someone else had some other solution, i could go back and
> dig it up?)

I guess this would be the same with bitpatterns, in that the user
would have to specify a custom dtype.

Is it possible to add a bitpattern NA (in the NaN values) to the
current floating point types, at least in principle?  So that np.float
etc would have bitpattern NAs without a custom dtype?

See you,

Matthew


More information about the NumPy-Discussion mailing list