[Numpy-discussion] Missing data again

Nathaniel Smith njs@pobox....
Tue Mar 6 15:51:45 CST 2012


On Tue, Mar 6, 2012 at 9:14 PM, Ralf Gommers
<ralf.gommers@googlemail.com> wrote:
> On Tue, Mar 6, 2012 at 9:25 PM, Nathaniel Smith <njs@pobox.com> wrote:
>> On Sat, Mar 3, 2012 at 8:30 PM, Travis Oliphant <travis@continuum.io>
>> wrote:
>> > Hi all,
>>
>> Hi Travis,
>>
>> Thanks for bringing this back up.
>>
>> Have you looked at the summary from the last thread?
>>  https://github.com/njsmith/numpy/wiki/NA-discussion-status
>
> Re-reading that summary and the main documents and threads linked from it, I
> could find either examples of statistical software that treats missing and
> ignored data explicitly separately, or links to relevant literature. Those
> would probably help the discussion a lot.

(I think you mean "couldn't find"?)

I'm not aware of any software that supports the IGNORED concept at
all, whether in combination with missing data or not. np.ma is
probably the closest example. I think we'd be breaking new ground
there. This is also probably why it is less clear how it should work
:-).

IIUC, the basic reason that people want IGNORED in the core is that it
provides convenience and syntactic sugar for efficient "in place"
operation on subsets of large arrays. So there are actually two parts
there -- the efficient operation, and the convenience/syntactic sugar.
The key feature for efficient operation is the where= feature, which
is not controversial at all. So, there's an argument that for now we
should focus on where=, give people some time to work with it, and
then use that experience to decide what kind of convenience/sugar
would be useful, if any. But, that's just my own idea; I definitely
can't claim any consensus on it.

>> In project management terms, I see three options:
>> 1) Put a big warning label on the functionality and leave it for now
>> ("If this option is given, np.asarray returns a masked array. NOTE: IN
>> THE NEXT RELEASE, IT MAY INSTEAD RETURN A BAG OF RABID, HUNGRY
>> WEASELS. NO GUARANTEES.")
>
> I've opened http://projects.scipy.org/numpy/ticket/2072 for that.

Cool, thanks.

> Assuming
> we stick with this option, I'd appreciate it if you could check in the first
> beta that comes out whether or not the warnings are obvious enough and in
> all the right places. There probably won't be weasels though:)

Of course. I've added myself to the CC list. (Err, if the beta won't
be for a bit, though, then please remind me if you remember? I'm
juggling a lot of balls right now.)

>> 2) Move the code back out of mainline and into a branch until until
>> there's consensus.
>> 3) Hold up the release until this is all sorted.
>>
>> I come from the project-management school that says you should always
>> have a releasable mainline, keep unready code in branches, and never
>> hold up the release for features, so (2) seems obvious to me.
>
> While it may sound obvious, I hope you've understood why in practice it's
> not at all obvious and why you got such strong reactions to your proposal of
> taking out all that code. If not, just look at what happened with the
> numpy-refactor work.

Of course, and that's why I'm not pressing the point. These trade-offs
might be worth talking about at some point -- there are reasons that
basically all the major FOSS projects have moved towards time-based
releases :-) -- but that'd be a huge discussion at a time when we
already have more than enough of those on our plate...

>> But I seem to be very much in the minority on that[1], so oh well :-). I
>> don't have any objection to (1), personally. (3) seems like a bad
>> idea. Just my 2 pence.
>
>
> Agreed that (3) is a bad idea. +1 for (1).
>
> Ralf
>
>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>

Cheers,
-- Nathaniel


More information about the NumPy-Discussion mailing list