[SciPy-dev] Statistics toolbox and nans
Pearu Peterson
pearu at cens.ioc.ee
Fri Nov 1 15:56:28 CST 2002
On Fri, 1 Nov 2002, Travis Oliphant wrote:
> >
> > On 1 Nov 2002, A.J. Rossini wrote:
> >
> > > >>>>> "travis" == Travis Oliphant <oliphant.travis at ieee.org> writes:
> > >
> > > travis> Hello developers.
> > > travis> What should we do about nan's and the stats toolbox. Stats is one
> > > travis> package where people may use nans to represent missing values.
> > >
> > > Yech. This is a hard issue, but NAN isn't the solution.
> >
> > I think so too that using NANs for representing missing values cannot be
> > reliable. There's too much weirdness going on with NaNs depending on the
> > local C library. For example, on linux
>
> Well, MATLAB is cross-platform and it uses NANs like this extensively. So
> I'm not sure I buy this argument.
The main problem is that Python does not support NANs. So, anything weird
with NANs follows from that, like the example NaN==1 -> 1 on linux but 0
on win32. Few other examples:
>>> int(NaN)
Traceback (most recent call last):
File "<stdin>", line 1, in ?
OverflowError: float too large to convert
>>> NaN/0
Traceback (most recent call last):
File "<stdin>", line 1, in ?
ZeroDivisionError: float division
> > >>> nan=float('nan')
> > >>> nan==nan
> > 1
> > >>> nan==1
> > 1
>
> So don't use nan's that way. That's why we have isnan(x) to test where
> the nan's are in an array. This function should work on the platforms
> where scipy works.
>
> I agree that equality testing of nans against another float should not be
> used in an algorithm.
I know. But most scipy users will probably shoot at least once their foot
when trying to use NaN provided by scipy. Many of them will probably
consider this as a bug in SciPy/Python/...
My point is that it's not good to provide a support to something
that does not work as expected. Sure, isnan(..) works but everybody
would assume that also NaN==.. works.
For example, the codelet
if r==2:
r = 1
else:
r = 2
would behave unexpectetly if r happens to be NaN. In Linux r becomes 1
while in Windows r becomes 2. To get the above codelet platform
independent, it should read
if r==2 and not isnan(r):
r = 1
else:
r = 2
Assuming that one is free to use NaN or scipy functions are
expected to return NaN (sometimes), then all scipy or users codes should
be floated with these isnan checks, otherwise these codes are potentially
buggy.
Note that the whole issue is just due to the fact that currently Python
does not support NaN values, and trying to support it in Scipy is too
painful if it is defined as in scipy_base. See below for
possible alternative.
> > Tim Peters has been explained these NAN issues several times on the
> > Usenet, google for 'Tim Peters NaN'.
>
> Sure, but he hasn't gone into enough detail. Matlab successfully does it
> so obviously it can be done (especially on modern machines that use
> IEEE754)
There's no doubt that it can be done, just it has not been done in Python
yet. For more discussion about Python state on the IEEE standard, see
http://www.python.org/dev/summary/2000-10-1.html
> > Since "all IEEE-754 behavior visible from Python is a platform-dependent
> > accident" [T.P.], I don't see that NaNs could be used in SciPy for
> > anything useful in an platform independent way.
> > I would avoid using NaNs
> > and Infs as much as possible until they become less platform-dependent,
> > may be by implementing special objects for Python instead of using
> > float('nan'), float('inf') (that even should not work on Win32).
>
> Right now, to me this is a straw man (a hypothetical argument).
>
> We already are supporting nan's in scipy. See what scipy_base.nan or
> scipy_base.inf gives you on your platform.
>
> I would prefer specific examples that show where whay scipy is doing now
> is not working on specific platforms that we want to support, then
> general arguments that refer to T.P.'s apparent distaste of nan's. We
> have already borrowed heavily from the ideas T.P. espoused. Look deeper
> into scipy_base to see what I'm talking about.
>
> In short, I don't agree with the statements that nans don't or can't work.
nan's can work but they do not work with the current Python, see
the example above.
I remember tracking down a difficult bug that turned out to be
trivially caused by NaN==1 -> 1 "feature". It was difficult to track
because all tests passed (seemingly) but acctually one of the tests should
have been failed as the bug was giving NaN result. As a result I spent
quite a time to find out which of the tests should have been failed...
And this convinced me not to use NaNs with the current Python.
> Now, I agree that treating missing values using NaNs is somewhat of a
> kludge. And there are perhaps better ways to handle it. It is a rather
> efficient kludge that works much of the time.
>
> Even if you don't officially bless nan's as "missing values," If they
> every show up in your calculation, they essentially are missing values and
> the question still remains as to how to deal with them (should you ignore
> them or let them ruin the rest of your calculation?)
Interpretation of nan's is application dependent. They can be interpreted
as "missing values" but certainly they are not "essentially missing
values". NaN means "Not a Number" or "Not any Number" which is given as a
result of an operation when its argument is out of the range.
See
http://www.cs.berkeley.edu/~wkahan/ieee754status/ieee754.ps
http://grouper.ieee.org/groups/754/
Depending on situation, certain nan's can be ignored but nan's can also
ruin the whole calculations when ignored.
Here follows an idea how to improve NaN support in scipy.
Instead of using
NaN=Inf-Inf
as a definition of nan, let's define it as follows
# File nan.py
class NaN_class(float):
def __eq__(self,other): return 0
__lt__ = __gt__ = __le__ = __eq__
def __add__(self,other): return self
__sub__ = __mul__ = __div__ = __pow__ = __radd__ = __rsub__ \
= __rmul__ = __rdiv__ = __rpow__ = __add__
def __neg__(self): return self
__pos__ = __abs__ = __neg__
def __str__(self): return 'NaN'
def __repr__(self): return 'NaN'
def __float__(self): return self
def __int__(self): raise ValueError,'cannot convert NaN to integer'
# ...
NaN = NaN_class()
def test():
assert not NaN==1
assert not NaN==NaN
assert not NaN<1
assert not NaN>1
assert not NaN>=1.0
assert not NaN<=1
assert not 1<NaN
assert not 1<=NaN
assert not 1>NaN
assert not 1>=NaN
assert not NaN<NaN
assert not NaN>NaN
assert NaN+1 is NaN
assert 1+NaN is NaN
if __name__ == "__main__":
test()
# eof
Probably also Inf should be similarly constructed. In addition, these
classes should be implemented in C in order other extension modules could
use them. There are number of other issues that I haven't considered
(NaN's in Numeric arrays, complex NaN's, etc) and resolving all of them is
not a trivial project.
Pearu
More information about the Scipy-dev
mailing list