[Numpy-discussion] NEP mask code and the 1.7 release
Charles R Harris
Sun Apr 22 19:49:44 CDT 2012
On Sun, Apr 22, 2012 at 6:26 PM, Charles R Harris <email@example.com
> On Sun, Apr 22, 2012 at 4:15 PM, Nathaniel Smith <firstname.lastname@example.org> wrote:
>> We need to decide what to do with the NA masking code currently in
>> master, vis-a-vis the 1.7 release. While this code is great at what it
>> is, we don't actually have consensus yet that it's the best way to
>> give our users what they want/need -- or even an appropriate way. So
>> we need to figure out how to release 1.7 without committing ourselves
>> to supporting this design in the future.
>> Background: what does the code currently in master do?
>> It adds 3 pointers at the end of the PyArrayObject struct (which is
>> better known as the numpy.ndarray object). These new struct members,
>> and some accessors for them, are exposed as part of the public API.
>> There are also a few additions to the Python-level API (mask= argument
>> to np.array, skipna= argument to ufuncs, etc.)
>> What does this mean for compatibility?
>> The change in the ndarray struct is not as problematic as it might
>> seem, compatibility-wise, since Python objects are almost always
>> referred to by pointers. Since the initial part of the struct will
>> continue to have the same memory layout, existing source and binary
>> code that works with PyArrayObject *pointers* will continue to work
>> One place where the actual struct size matters is for any C-level
>> ndarray subclasses, which will have their memory layout change, and
>> thus will need to be recompiled. (Python-level ndarray subclasses will
>> have their memory layout change as well -- e.g., they will have
>> different __dictoffset__ values -- but it's unlikely that any existing
>> Python code depends on such details.)
>> What if we want to change our minds later?
>> For the same reasons as given above, any new code which avoids
>> referencing the new struct fields referring to masks, or using the new
>> masking APIs, will continue to work even if the masking is later
>> Any new code which *does* refer to the new masking APIs, or references
>> the fields directly, will break if masking is later removed.
>> Specifically, source will fail to compile, and existing binaries will
>> silently access memory that is past the end of the PyArrayObject
>> struct, which will have unpredictable consequences. (Most likely
>> segfaults, but no guarantees.) This applies even to code which simply
>> tries to check whether a mask is present.
>> So I think the preconditions for leaving this code as-is for 1.7 are
>> that we must agree:
>> * We are willing to require a recompile of any C-level ndarray
>> subclasses (do any exist?)
>> * We are willing to make absolutely no guarantees about future
>> compatibility for code which uses APIs marked "experimental"
>> * We are willing for this breakage to occur in the form of random
>> * We are okay with the extra 3 pointers worth of memory overhead on
>> each ndarray
>> Personally I can live with all of these if everyone else can, but I'm
>> nervous about reducing our compatibility guarantees like that, and
>> we'd probably need, at a minimum, a flashier EXPERIMENTAL sign than we
>> currently have. (Maybe we should resurrect the weasels ;-) )
>> Any other options?
>> Alternative 1: The obvious other option is to go through and move all
>> the strictly mask-related code out of master and into a branch.
>> Presumably this wouldn't include all the infrastructure that Mark
>> added, since a lot of it is e.g. shared with where=, and that would
>> stay. Even so, this would be a big and possibly time-consuming change.
>> Alternative 2: After auditing the code a bit, the cleanest third
>> option I can think of is:
>> 1. Go through and make sure that all numpy-internal access to the new
>> maskna fields happens via the accessor functions. (This patch would
>> produce no functionality change.)
>> 2. Move the accessors into some numpy-internal header file, so that
>> user code can't call them.
>> 3. Remove the mask= argument to Python-level ndarray constructors,
>> remove the new maskna_ fields from PyArrayObject, and modify the
>> accessors so that they always return NULL, 0, etc., as if the array
>> does not have a mask.
>> This would make 1.7 completely compatible with 1.6 API and ABI-wise.
>> But it would also be a minimal code change, leaving the mask-related
>> code paths in place but inaccessible. If we decided to re-enable them,
>> it would just be matter of reverting steps (3) and (2).
>> The main downside I see with this approach is that leaving a bunch of
>> inaccessible code paths lying around might make it harder to maintain
>> 1.7 as a "long term support" release.
>> I'm personally willing to implement either of these changes. Or
>> perhaps there's another option that I'm not thinking of!
> I'm not deeply invested in the current version of masked NA. OTOH, code
> development usually goes through several cycles of implementation and
> trial. My own rule of thumb is that everything needs to be rewritten three
> times, which in fact has happened with Numpy with Numeric and Numarray as
> precursors. I think a fourth rewrite of much of the code is going to happen
> in the future. What I do disagree with is the idea that everything has to
> be planned and designed up front based on consensus. I prefer a certain
> amount of trial and error leading to evolution. Numpy does need some way to
> experiment, and unless someone is willing to develop and maintain separate
> trees (which happened in Linux), there needs to be some wiggle room. Which
> is why I proposed a LTS release.
> In any case, I think a good topic for discussion is what we have learned
> from the current prototype exclusive of politics. Hopefully you have used
> it and can give us some feedback based on your own experience. I'd also
> like to hear from anyone else who is using it at the moment. Then we can
> discuss at a technical level what should be changed, alternative API's, and
> what works/sucks about what we have. I thought there were some good points
> along those lines made in the thread following Travis' first post.
To expand on the last, I think that if we are going to have masked arrays,
all arrays should be masked, but that an actual mask doesn't get allocated
until it is used, so the masked keyword would go away. Ignored values also
need better support, i.e., erasure. Re the ndarray structure, it needs to
be hidden at some point and that has been under discussion since 1.3, but
doing that will take time and needs planning and a preliminary timeline
since it will effect a lot of people. I'm thinking several years at least.
OT, it also seems that several folks would like more efficient small
arrays. It might be worth devoting some time to profiling the current code
and removing bottlenecks.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the NumPy-Discussion