[Numpy-discussion] Missing data again

Nathaniel Smith njs@pobox....
Tue Mar 6 14:59:03 CST 2012


On Tue, Mar 6, 2012 at 4:38 PM, Mark Wiebe <mwwiebe@gmail.com> wrote:
> On Tue, Mar 6, 2012 at 5:48 AM, Pierre Haessig <pierre.haessig@crans.org>
> wrote:
>> >From a potential user perspective, I feel it would be nice to have NA
>> and non-NA cases look as similar as possible. Your code example is
>> particularly striking : two different dtypes to store (from a user
>> perspective) the exact same content ! If this *could* be avoided, it
>> would be great...
>
> The biggest reason to keep the two types separate is performance. The
> straight float dtypes map directly to hardware floating-point operations,
> which can be very fast. The NA-float dtypes have to use additional logic to
> handle the NA values correctly. NA is treated as a particular NaN, and if
> the hardware float operations were used directly, NA would turn into NaN.
> This additional logic usually means more branches, so is slower.

Actually, no -- hardware float operations preserve NA-as-NaN. You
might well need to be careful around more exotic code like optimized
BLAS kernels, but all the basic ufuncs should Just Work at full speed.
Demo:

>>> def hexify(x): return hex(np.float64(x).view(np.int64))
>>> hexify(np.nan)
'0x7ff8000000000000L'
# IIRC this is R's NA bitpattern (presumably 1974 is someone's birthday)
>>> NA = np.int64(0x7ff8000000000000 + 1974).view(np.float64)
# It is an NaN...
>>> NA
nan
# But it has a distinct bitpattern:
>>> hexify(NA)
'0x7ff80000000007b6L'
# Like any NaN, it propagates through floating point operations:
>>> NA + 3
nan
# But, critically, so does the bitpattern; ordinary Python "+" is
returning NA on this operation:
>>> hexify(NA + 3)
'0x7ff80000000007b6L'

This is how R does it, which is more evidence that this actually works
on real hardware.

There is one place where it fails. In a binary operation with *two*
NaN values, there's an ambiguity about which payload should be
returned. IEEE754 recommends just returning the first one. This means
that NA + NaN = NA, NaN + NA = NaN. This is ugly, but it's an obscure
case that nobody cares about, so it's probably worth it for the speed
gain. (In fact, if you type those two expressions at the R prompt,
then that's what you get, and I can't find any reference to anyone
even noticing this.)

>> I don't know how the NA machinery is working R. Does it works with a
>> kind of "nafloat64" all the time or is there some type inference
>> mechanics involved in choosing the appropriate type ?
>
> My understanding of R is that it works with the "nafloat64" for all its
> operations, yes.

Right -- R has a very impoverished type system as compared to numpy.
There's basically four types: "numeric" (meaning double precision
float), "integer", "logical" (boolean), and "character" (string). And
in practice the integer type is essentially unused, because R parses
numbers like "1" as being floating point, not integer; the only way to
get an integer value is to explicitly cast to it. Each of these types
has a specific bit-pattern set aside for representing NA. And...
that's it. It's very simple when it works, but also very limited.

I'm still skeptical that we could make the floating point types
NA-aware by default -- until we have an implementation in hand, I'm
nervous there'd be some corner case that broke everything. (Maybe
ufuncs are fine but np.dot has an unavoidable overhead, or maybe it
would mess up casting from float types to non-NA-aware types, etc.)
But who knows. Probably not something we can really make a meaningful
decision about yet.

-- Nathaniel


More information about the NumPy-Discussion mailing list