[Numpy-discussion] Medians that ignore values

David Cournapeau david@ar.media.kyoto-u.ac...
Fri Sep 19 22:44:00 CDT 2008


Alan G Isaac wrote:
> On 9/19/2008 4:35 AM David Cournapeau apparently wrote:
>> I never use NaN as missing value
>
> What do you use?
>
> Recently I needed to fill a 2d array with values
> from computations that could "go wrong".
> I created an array of NaN and then replaced
> the elements where the computation produced
> a useful value.  I then applied ``nanmax``,
> to get the maximum of the useful values.
>
> What should I have done?

I guess my formulation was poor: I never use NaN as missing values
because I never use missing values, which is why I wanted the opinion of
people who use NaN in a different manner (because I don't have a good
idea on how those people would like to see numpy behave). I was
certainly not arguing they should not be use for the purpose of missing
value.

The problem with NaN is that you cannot mix the missing value behavior
and the error behavior. Dealing with them in a consistent manner is
difficult. Because numpy is a general numerical computation tool, I
think that NaN should be propagated and never ignored *by default*. If
you have NaN because of divide by 0, etc... it should not be ignored at
all. But if you want it to ignore, then numpy should make it possible:

    - max, min: should return NaN if NaN is in the array, or maybe even
fail ?
    - argmax, argmin ?
    - sort: should fail ?
    - mean, std, variance: should return Nan
    - median: should fail (to be consistent if sort fails) ? Should
return NaN ?

We could then add an argument to failing functions to tell them either
to ignore NaN/put them at some special location (like R does, for
example). The ones I am not sure are median and argmax/argmin. For
median, failing when sort does is consistent; but this can break a lot
of code. For argmin/argmax, failing is the most logical, but OTOH,
making argmin/argmax failing and not max/min is not consistent either.
Breaking the code is maybe not that bad because currently, neither
max/min nor argmax/argmin nor sort does return a meaningful function.
Does that sound reasonable to you ?

cheer,

David


More information about the Numpy-discussion mailing list