[Numpy-discussion] sorting -inf, nan, inf
tim.hochberg at ieee.org
Tue Sep 19 20:45:36 CDT 2006
Charles R Harris wrote:
> On 9/19/06, *A. M. Archibald* <peridot.faceted at gmail.com
> <mailto:peridot.faceted at gmail.com>> wrote:
> On 19/09/06, Charles R Harris <charlesr.harris at gmail.com
> <mailto:charlesr.harris at gmail.com>> wrote:
> > For floats we could use something like:
> > lessthan(a,b) := a < b || (a == nan && b != nan)
I believe this would have to be some sort of isnan macros since
everything compares not equal to nan. I forget the correct spelling at
the moment though,
> > Which would put all the nans at one end and might not add too
> much overhead.
> You could put an any(isnan()) out front and run this slower version
> only if there are any NaNs (also, you can't use == for NaNs, you have
> to use C isNaN). But I'm starting to see the wisdom in simply throwing
> an exception, since sorting is not well-defined with NaNs.
> Looks like mergesort can be modified to sort around the NaNs without
> too much trouble if there is a good isnan function available: just
> cause the pointers to skip over them. I see that most of the isnan
> stuff seems to be in the ufunc source and isn't terribly simple. Could
> be broken out into a separate include, I suppose.
> I still wonder if it is worth the trouble. As to raising an exception,
> I seem to recall reading somewhere that exception code tends to be
> expensive, I haven't done any benchmarks myself.
I'm still somewhat mystified by the desire to move the nans to one end
of the sorted object. I see two scenarios:
1. The user is not expecting to have nans in the data. In this case
moving the nans to end is not helpful. The resulting array is
still not sorted in the sense that a[i-1]<=a[i]<=a[i+1] does not
hold and thus is likely to break code that relies on the array
being sorted. The most prominent example of which is searchsorted.
In this case you really want to raise an exception if possible
since no good will come from letting the code continue to run. In
this case the time in involved in throwing and catching an
exception is irrelevant.
2. The user *is* expecting to have nans in the data. This is
presumably the case that the sorting-nans-to-the-end idea is aimed
at. So far at least the suggested use has been to sort and then
strip the nans. I suggest that a better approach is to test for
and strip the nans before the sort. For example:
# a is an array that may have some nans
# you can do this more pithily, but I'm aiming to minimize isnan calls
# note that this *sometimes* makes a copy.
nanmask = isnan(a)
a = a[not nanmask]
I presume that isnan is somewhat more expensive than the basic '<'
operator. In the proposed sort to end version we need N*log(N) isnan
calls versus just N in the above case. The sort to end case probably
won't look any cleaner than the above either since you still need to
count the nans to determine how many to strip.
Perhaps there's some use for the sort to end behaviour that I'm missing,
but the raise an exception behaviour sure looks a lot more appealing to me.
Here's a strawman proposal:
Sort the array. Then examine numpy.geterr()['invalid']. If it is not
'ignore', then check examine sometrue(isnan(thearray)). If the
latter is true then raise and error, issue a warning or call the
error reporting functioni as appropriate. Note that we always sort
the array to be consistent with the behaviour of the ufuncs that
proceed even when they end up raising an exception.
More information about the Numpy-discussion