[Numpy-discussion] Missing data again
Sat Mar 3 15:46:29 CST 2012
On Sat, Mar 3, 2012 at 12:30 PM, Travis Oliphant <email@example.com>wrote:
> First of all, I want to be clear that I think there is much great work
> that has been done in the current missing data code. There are some nice
> features in the where clause of the ufunc and the machinery for the
> iterator that allows re-using ufunc loops that are not re-written to check
> for missing data. I'm sure there are other things as well that I'm not
> quite aware of yet. However, I don't think the API presented to the
> numpy user presently is the correct one for NumPy 1.X.
I thought I might chime in with some implementation-detail notes, as while
Travis has dug into the code, I'm still the person who knows it best.
A few particulars:
> * the reduction operations need to default to "skipna" --- this is
> the most common use case which has been re-inforced again to me today by a
> new user to Python who is using masked arrays presently
This is a completely trivial change. I went with the default as I did
because it's what R, the primary inspiration for the NA design, does. We'll
have to be sure this is well-marked in the documentation about "NumPy NA
for R users".
> * the mask needs to be visible to the user if they use that
> approach to missing data (people should be able to get a hold of the mask
> and work with it in Python)
This is relatively easy. Probably the way to do it is with an
ndarray.maskna property. It could be in 1.7 if we really push. For the
multi-NA future, I think the NPY_MASK dtype, currently an alias for
NPY_UBYTE, would need to become its own dtype with separate .exposed and
> * bit-pattern approaches to missing data (at least for float64 and
> int32) need to be implemented.
I strongly wanted to do masks first, because of the greater generality and
because the bit-patterns would best be implemented sharing mask
implementation details. I still believe this was the correct choice, and it
set the stage for bit-patterns. It will be possible to make inner loops
that specialize for the default hard-coded bit-pattern dtypes. I paid very
careful attention in the design making sure high performance is possible
without significant rework. The immense scale of the required code changes
meant I couldn't actually implement high performance in the time frame.
The place I think this affects 1.7 the most is in the default choice for
what np.array([1.0, np.NA, 3.0]) and np.array([1, np.NA, 3]) mean. In 1.7,
both mean an NA-masked array. In 1.8, I can see a strong case that the
first should mean an NA-dtype, and the second an NA-masked array.
Also, here's a thought for the usability of NA-float64. As much as global
state is a bad idea, something which determines whether implicit float
dtypes are NA-float64 or float64 could help. In IPython, "pylab" mode would
default to float64, and "statlab" or "pystat" would default to NA-float64.
One way to write this might be:
>>> np.array([1.0, 2.0, 3.0])
array([ 1., 2., 3.], dtype=nafloat64)
>>> np.array([1.0, 2.0, 3.0])
array([ 1., 2., 3.], dtype=float64)
> * there should be some way when using "masks" (even if it's hidden
> from most users) for missing data to separate the low-level ufunc operation
> from the operation
> on the masks...
This is completely trivial to implement. Maybe
ndarray.view(maskna='ignore') is a reasonable way to spell direct access
without a mask.
> I have heard from several users that they will *not use the missing data*
> in NumPy as currently implemented, and I can now see why. For better or
> for worse, my approach to software is generally very user-driven and very
> pragmatic. On the other hand, I'm also a mathematician and appreciate the
> cognitive compression that can come out of well-formed structure.
> None-the-less, I'm an *applied* mathematician and am ultimately motivated
> by applications.
> I will get a hold of the NEP and spend some time with it to discuss some
> of this in that document. This will take several weeks (as PyCon is next
> week and I have a tutorial I'm giving there). For now, I do not think
> 1.7 can be released unless the masked array is labeled *experimental*.
> NumPy-Discussion mailing list
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the NumPy-Discussion