[Numpy-discussion] Missing data again

Nathaniel Smith njs@pobox....
Wed Mar 7 17:10:51 CST 2012


On Wed, Mar 7, 2012 at 7:37 PM, Charles R Harris
<charlesr.harris@gmail.com> wrote:
>
>
> On Wed, Mar 7, 2012 at 12:26 PM, Nathaniel Smith <njs@pobox.com> wrote:
>> When it comes to "missing data", bitpatterns can do everything that
>> masks can do, are no more complicated to implement, and have better
>> performance characteristics.
>>
>
> Maybe for float, for other things, no. And we have lots of otherthings.

It would be easier to discuss this if you'd, like, discuss :-(. If you
know of some advantage that masks have over bitpatterns when it comes
to missing data, can you please share it, instead of just asserting
it?

Not that I'm immune... I perhaps should have been more explicit
myself, when I said "performance characteristics", let me clarify that
I was thinking of both speed (for floats) and memory (for
most-but-not-all things).

> The
> performance is a strawman,

How many users need to speak up to say that this is a serious problem
they have with the current implementation before you stop calling it a
strawman? Because when Wes says that it's not going to fly for his
stats/econometics cases, and the neuroimaging folk like Gary and Matt
say it's not going to fly for their use cases... surely just waving
that away is a bit dismissive?

I'm not saying that we *have* to implement bitpatterns because
performance is *the most important feature* -- I'm just saying, well,
what I said. For *missing data use* cases, bitpatterns have better
performance characteristics than masks. If we decide that these use
cases are important, then we should take this into account and weigh
it against other considerations. Maybe what you think is that these
use cases shouldn't be the focus of this feature and it should focus
on the "ignored" use cases instead? That would be a legitimate
argument... but if that's what you want to say, say it, don't just
dismiss your users!

> and it *isn't* easier to implement.

If I thought bitpatterns would be easier to implement, I would have
said so... What I said was that they're not harder. You have some
extra complexity, mostly in casting, and some reduced complexity -- no
need to allocate and manipulate the mask. (E.g., simple same-type
assignments and slicing require special casing for masks, but not for
bitpatterns.) In many places the complexity is identical -- printing
routines need to check for either special bitpatterns or masked
values, whatever. Ufunc loops need to either find the appropriate part
of the mask, or create a temporary mask buffer by calling a dtype
func, whatever. On net they seem about equivalent, complexity-wise.

...I assume you disagree with this analysis, since I've said it
before, wrote up a sketch for how the implementation would work at the
C level, etc., and you continue to claim that simplicity is a
compelling advantage for the masked approach. But I still don't know
why you think that :-(.

>> > Also, different folks adopt different values
>> > for 'missing' data, and distributing one or several masks along with the
>> > data is another common practice.
>>
>> True, but not really relevant to the current debate, because you have
>> to handle such issues as part of your general data import workflow
>> anyway, and none of these is any more complicated no matter which
>> implementations are available.
>>
>> > One inconvenience I have run into with the current API is that is should
>> > be
>> > easier to clear the mask from an "ignored" value without taking a new
>> > view
>> > or assigning known data. So maybe two types of masks (different
>> > payloads),
>> > or an additional flag could be helpful. The process of assigning masks
>> > could
>> > also be made a bit easier than using fancy indexing.
>>
>> So this, uh... this was actually the whole goal of the "alterNEP"
>> design for masks -- making all this stuff easy for people (like you,
>> apparently?) that want support for ignored values, separately from
>> missing data, and want a nice clean API for it. Basically having a
>> separate .mask attribute which was an ordinary, assignable array
>> broadcastable to the attached array's shape. Nobody seemed interested
>> in talking about it much then but maybe there's interest now?
>>
>
> Come off it, Nathaniel, the problem is minor and fixable. The intent of the
> initial implementation was to discover such things.

Implementation can be wonderful, I absolutely agree. But you
understand that I'd be more impressed by this example if your
discovery weren't something I had been arguing for since before the
implementation began :-).

> These things are less
> accessible with the current API *precisely* because of the feedback from R
> users. It didn't start that way.
>
> We now have something to evolve into what we want. That is a heck of a lot
> more useful than endless discussion.

No, you are still missing the point completely! There is no "what *we*
want", because what you want is different than what I want. The
masking stuff in the alterNEP was an attempt to give people like you
who wanted "ignored" support what they wanted, and the bitpattern
stuff was to satisfy people like me who want "missing data" support.
The NEP took a different approach to trying to make everyone happy...
unfortunately it sounds like it made no-one happy. Blaming the R users
for this isn't *wrong*, exactly, but it's a bit one-sided.

If you have a proposal for how the current code can be "evolved" into
something that will make the neuro/econ/stats people happy, then
please tell us. But I don't see how it's possible, and your current
proposals are going in the wrong direction. Unless we can actually
talk about these disagreements, we're just going to have more endless
discussion.

-- Nathaniel


More information about the NumPy-Discussion mailing list