[Numpy-discussion] consensus (was: NA masks in the next numpy release?)
Sun Oct 30 01:19:52 CDT 2011
On Oct 29, 2011, at 7:24 PM, Eric Firing wrote:
> On 10/29/2011 12:57 PM, Charles R Harris wrote:
>> On Sat, Oct 29, 2011 at 4:47 PM, Eric Firing <firstname.lastname@example.org
>> <mailto:email@example.com>> wrote:
>> On 10/29/2011 12:02 PM, Olivier Delalleau wrote:
>>> I haven't been following the discussion closely, but wouldn't it
>> be instead:
>>> a.mask[0:2] = True?
>> That would be consistent with numpy.ma <http://numpy.ma> and the
>> opposite of Mark's
>> I can live with either, but I much prefer the numpy.ma
>> <http://numpy.ma> version because
>> it fits with the use of bit-flags for editing data; set bit 1 if it
>> fails check A, set bit 2 if it fails check B, etc. So, if it evaluates
>> as True, there is a problem, and the value is masked *out*.
>> Similarly, in Marks implementation, 7 bits are available for a payload
>> to describe what kind of masking is meant. This seems more consistent
>> with True as masked (or NA) than with False as masked.
>> I wouldn't rely on the 7 bits yet. Mark left them available to keep open
>> possible future use, but didn't implement anything using them yet. If
>> memory use turns out to exclude whole sectors of application we will
>> have to go to bit masks.
> Right; I was only commenting on a subjective sense of internal
> consistency. A minor point.
> The larger context of all this is how users end up being able to work
> with all the different types and specifications of "NA" (in the most
> general sense) data:
> 1) nans
> 2) numpy.ma
> 3) masks in the core (Mark's new code)
> 4) bit patterns
> Substantial code now in place--including matplotlib--relies on numpy.ma.
> It has some rough edges, it can be slow, it is a pain having it as a
> bolted-on module, it may be more complicated than it needs to be, but it
> fits a lot of use cases pretty well. There are many users. Everyone
> using matplotlib is using it, whether they know it or not.
> The ideal from my numpy.ma-user's standpoint would an NA-handling
> implementation in the core that would do two things:
> (1) allow a gradual transition away from numpy.ma, so that the latter
> would become redundant.
> (2) allow numpy.ma to be reasonably easily modified to use the in-core
> facilities for greater efficiency during the long transition. Implicit
> is the hope that someone (most likely not me, although I might be able
> to help a bit) would actually perform this modification.
> Mark's mission, paid for by Enthought, was not to please numpy.ma users,
> but to add NA-handling that would be comfortable for R-users. He chose
> to do so with the idea that two possible implementations (masks and
> bitpatterns) were desirable, each with strengths and weaknesses, and
> that so as to get *something* done in the very short time he had left,
> he would start with the mask implementation. We now have the result,
> incomplete, but not breaking anything. Additional development (coding
> as well as designing) will be needed.
> The main question raised by Matthew and Nathaniel is, I think, whether
> Mark's code should develop in a direction away from the R-compatibility
> model, with the idea that the latter would be handled via a bit-pattern
> implementation, some day, when someone codes it; or whether it should
> remain as the prototype and first implementation of an API to handle the
> R-compatible use case, minimizing any divergence from any eventual
> bit-pattern implementation.
> The answer to this depends on several questions, including:
> 1) Who is available to do how much implementation of any of the
> possibilities? My reading of Travis's blog and rare posts to this list
> suggest that he hopes and expects to be able to free up coding time.
> Perhaps he will clarify that soon.
> 2) What sorts of changes would actually be needed to make the present
> implementation good enough for the R use case? Evolutionary, or
> 3) What sorts of changes would help with the numpy.ma use case?
> Evolutionary, or revolutionary.
> 4) Given available resources, how can we maximize progress: making numpy
> more capable, easier to use, etc.
> Unless the answers to questions 2 *and* 3 are "revolutionary", I don't
> see the point in pulling Mark's changes out of master. At most, the
> documentation might be changed to mark the NA API as "experimental" for
> a release or two.
I appreciate Nathaniel's idea to pull the changes and I can respect his desire to do that. It seemed like there was a lot more heat than light in the discussion this summer. The differences seemed to be enflamed by the discussion instead of illuminated by it. Perhaps, that is why Nathaniel felt like merging Mark's pull request was too strong-armed and not a proper resolution.
However, I did not interpret Matthew or Nathaniel's explanations of their position as manipulative or inappropriate. Nonetheless, I don't think removing Mark's changes are a productive direction to take at this point. I agree, it would have been much better to reach a rough consensus before the code was committed. At least, those who felt like their ideas where not accounted for should have felt like there was some plan to either accommodate them, or some explanation of why that was not a good idea. The only thing I recall being said was that there was nobody to implement their ideas. I wish that weren't the case. I think we can still continue to discuss their concerns and look for ways to reasonably incorporate their use-cases if possible.
I have probably contributed in the past to the idea that "he who writes the code gets the final say". In early-stage efforts, this is approximately right, but success of anything relies on satisfied users and as projects mature the voice of users becomes more relevant than the voice of contributors in my mind. I've certainly had to learn that in terms of ABI changes to NumPy.
Personally, I am very, very interested in users of NumPy and their ideas about how things should be done. I have my own use cases from my experience, but I've always found that the code is better if it incorporates use-cases of others. In the end, I'm much more interested in users of NumPy and their use-cases and experience then even contributors. Historically, contributors to NumPy have been scarce and development slow. I am working to change that right now. I will say more when I have more to say in that direction.
To be clear, in this particular case I know that there are multiple users, and the best I can tell there is some disagreement between those users about the appropriate APIs. But, this disagreement is actually lost in some of the discussion. In fact, it seems to me that the different perspectives are not all that different and their ought to be a way to work it out. Perhaps this is hopeless naivete, but it's my current perspective.
I really appreciate the efforts of people who have been active on NumPy development and maintenance for the past 4 years. I also appreciate the activity of all the users of NumPy: matplotlib, Pandas, scikits, SciPy, statsmodels, and so on. The larger NumPy community is much broader than the discussions that take place on this list (or even on the SciPy list). I have seen NumPy in use in a lot of places over the past 4 years. I have also seen NumPy *not* in use where it really could be (with some adaptations).
I'm still hopeful that we will continue to make this forum a place where even "just users" of NumPy always feel able to raise their voice and say, "Hey, I wish things were done this way." It is rare when all voices can be satisfied, of course, but a priori it is worth a college try. If anything I hope for emerges, the user-base of NumPy will be growing significantly over the coming months and years and I really hope this list continues to be a place where I can be comfortable sending them.
More to come,
More information about the NumPy-Discussion