[Numpy-discussion] Medians that ignore values

Anne Archibald peridot.faceted@gmail....
Sat Sep 20 00:02:16 CDT 2008


2008/9/19 David Cournapeau <david@ar.media.kyoto-u.ac.jp>:

> I guess my formulation was poor: I never use NaN as missing values
> because I never use missing values, which is why I wanted the opinion of
> people who use NaN in a different manner (because I don't have a good
> idea on how those people would like to see numpy behave). I was
> certainly not arguing they should not be use for the purpose of missing
> value.

I, on the other hand, was making specifically that suggestion: users
should not use nans to indicate missing values. Users should use
masked arrays to indicate missing values.

> The problem with NaN is that you cannot mix the missing value behavior
> and the error behavior. Dealing with them in a consistent manner is
> difficult. Because numpy is a general numerical computation tool, I
> think that NaN should be propagated and never ignored *by default*. If
> you have NaN because of divide by 0, etc... it should not be ignored at
> all. But if you want it to ignore, then numpy should make it possible:
>
>    - max, min: should return NaN if NaN is in the array, or maybe even
> fail ?
>    - argmax, argmin ?
>    - sort: should fail ?
>    - mean, std, variance: should return Nan
>    - median: should fail (to be consistent if sort fails) ? Should
> return NaN ?

This part I pretty much agree with.

> We could then add an argument to failing functions to tell them either
> to ignore NaN/put them at some special location (like R does, for
> example). The ones I am not sure are median and argmax/argmin. For
> median, failing when sort does is consistent; but this can break a lot
> of code. For argmin/argmax, failing is the most logical, but OTOH,
> making argmin/argmax failing and not max/min is not consistent either.
> Breaking the code is maybe not that bad because currently, neither
> max/min nor argmax/argmin nor sort does return a meaningful function.
> Does that sound reasonable to you ?

The problem with this approach is that all those decisions need to be
made and all that code needs to be implemented for masked arrays. In
fact I suspect that it has already been done in that case. So really
what you are suggesting here is that we duplicate all this effort to
implement the same functions for nans as we have for masked arrays.
It's important, too, that the masked array implementation and the nan
implementation behave the same way, or users will become badly
confused. Who gets the task of keeping the two implementations in
sync?

The current situation is that numpy has two ways to indicate bad data
for floating-point arrays: nans and masked arrays. We can't get rid of
either: nans appear on their own, and masked arrays are the only way
to mark bad data in non-floating-point arrays. We can try to make them
behave the same, which will be a lot of work to provide redundant
capabilities. Or we can make them behave drastically differently.
Masked arrays clearly need to be able to handle masked values flexibly
and explicitly. So I think nans should be handled simply and
conservatively: propagate them if possible, raise if not.

If users are concerned about performance, it's worth noting that on
some machines nans force a fallback to software floating-point
handling, with a corresponding very large performance hit. This
includes some but not all x86 (and I think x86-64) CPUs. How this
compares to the performance of masked arrays is not clear.

Anne


More information about the Numpy-discussion mailing list