[Numpy-discussion] min() of array containing NaN
Joe Harrington
jh@physics.ucf....
Tue Aug 12 02:22:36 CDT 2008
Masked arrays are a bit clunky for something as simple and standard as
NaN handling. They also have the inverse of the standard truth sense,
at least as used in my field. 1 (or True) usually means the item is
allowed, not denied, so that you can multiply the mask by the data to
zero all bad values, add and subtract masks in sensible ways and get
what's expected, etc. For example, in the "stacked, masked mean"
image processing algorithm, you sum the data along an axis, sum the
masks along that axis, and divide the results to get the mean image
without bad pixels. This is much more accurate than taking a median,
and admits to error analysis, which the median does not (easily).
While the regular behavior is "just a ~ away", as Stefan pointed out
to me once, that's not acceptable if the image cube is large and
memory or speed are at issue, and it's also very prone to bugs if
you're negating everything all the time.
Further, with ma you have to convert to using an entirely different
and redundant set of routines instead of having the very standard
handling of NaNs found in our competitor programs, such as IDL. The
issue of not having an in-place method in ma was also raised earlier.
I'll add the difficulty of converting code if a standard thing like
NaN handling has to be simulated in multiple calls.
So, I endorse extending min() and all other statistical routines to
handle NaNs, possibly with a switch to turn it on if a suitably fast
algorithm cannot be found (which is competitor IDL's solution).
Certainly without a switch the default behavior should be to return
NaN, not to return some random value, if a NaN is present. Otherwise
the user may never know a NaN is present, and therefore has to check
every use for NaNs. That constand manual NaN checking is slower and
more error-prone than any numerical speed advantage.
So to sum, proposed for statistical routnes:
if NaN is not present, return value
if NaN is present, return NaN
if NaN is present and nan=True, return value ignoring all NaNs
OR:
if NaN is not present, return value
if NaN is present, return value ignoring all NaNs
if NaN is present and nan=True, return NaN
I'd prefer the latter. IDL does the former and it is a pain to do
/nan all the time. However, the latter might trip up the unwary,
whereas the former never does.
This would apply at least to:
min
max
sum
prod
mean
median
std
and possibly many others.
--jh--
More information about the Numpy-discussion
mailing list