[Numpy-discussion] Missing data again
Mark Wiebe
mwwiebe@gmail....
Sat Mar 3 15:46:29 CST 2012
On Sat, Mar 3, 2012 at 12:30 PM, Travis Oliphant <travis@continuum.io>wrote:
> <snip>
>
> First of all, I want to be clear that I think there is much great work
> that has been done in the current missing data code. There are some nice
> features in the where clause of the ufunc and the machinery for the
> iterator that allows re-using ufunc loops that are not re-written to check
> for missing data. I'm sure there are other things as well that I'm not
> quite aware of yet. However, I don't think the API presented to the
> numpy user presently is the correct one for NumPy 1.X.
>
I thought I might chime in with some implementation-detail notes, as while
Travis has dug into the code, I'm still the person who knows it best.
A few particulars:
>
> * the reduction operations need to default to "skipna" --- this is
> the most common use case which has been re-inforced again to me today by a
> new user to Python who is using masked arrays presently
>
This is a completely trivial change. I went with the default as I did
because it's what R, the primary inspiration for the NA design, does. We'll
have to be sure this is well-marked in the documentation about "NumPy NA
for R users".
> * the mask needs to be visible to the user if they use that
> approach to missing data (people should be able to get a hold of the mask
> and work with it in Python)
>
This is relatively easy. Probably the way to do it is with an
ndarray.maskna property. It could be in 1.7 if we really push. For the
multi-NA future, I think the NPY_MASK dtype, currently an alias for
NPY_UBYTE, would need to become its own dtype with separate .exposed and
.payload attributes.
> * bit-pattern approaches to missing data (at least for float64 and
> int32) need to be implemented.
>
I strongly wanted to do masks first, because of the greater generality and
because the bit-patterns would best be implemented sharing mask
implementation details. I still believe this was the correct choice, and it
set the stage for bit-patterns. It will be possible to make inner loops
that specialize for the default hard-coded bit-pattern dtypes. I paid very
careful attention in the design making sure high performance is possible
without significant rework. The immense scale of the required code changes
meant I couldn't actually implement high performance in the time frame.
The place I think this affects 1.7 the most is in the default choice for
what np.array([1.0, np.NA, 3.0]) and np.array([1, np.NA, 3]) mean. In 1.7,
both mean an NA-masked array. In 1.8, I can see a strong case that the
first should mean an NA-dtype, and the second an NA-masked array.
Also, here's a thought for the usability of NA-float64. As much as global
state is a bad idea, something which determines whether implicit float
dtypes are NA-float64 or float64 could help. In IPython, "pylab" mode would
default to float64, and "statlab" or "pystat" would default to NA-float64.
One way to write this might be:
>>> np.set_default_float(np.nafloat64)
>>> np.array([1.0, 2.0, 3.0])
array([ 1., 2., 3.], dtype=nafloat64)
>>> np.set_default_float(np.float64)
>>> np.array([1.0, 2.0, 3.0])
array([ 1., 2., 3.], dtype=float64)
> * there should be some way when using "masks" (even if it's hidden
> from most users) for missing data to separate the low-level ufunc operation
> from the operation
> on the masks...
>
This is completely trivial to implement. Maybe
ndarray.view(maskna='ignore') is a reasonable way to spell direct access
without a mask.
Cheers,
Mark
> I have heard from several users that they will *not use the missing data*
> in NumPy as currently implemented, and I can now see why. For better or
> for worse, my approach to software is generally very user-driven and very
> pragmatic. On the other hand, I'm also a mathematician and appreciate the
> cognitive compression that can come out of well-formed structure.
> None-the-less, I'm an *applied* mathematician and am ultimately motivated
> by applications.
>
> I will get a hold of the NEP and spend some time with it to discuss some
> of this in that document. This will take several weeks (as PyCon is next
> week and I have a tutorial I'm giving there). For now, I do not think
> 1.7 can be released unless the masked array is labeled *experimental*.
>
> Thanks,
>
> -Travis
>
>
>
>
>
>
>
>
>
>
>
>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.scipy.org/pipermail/numpy-discussion/attachments/20120303/dd349ace/attachment.html
More information about the NumPy-Discussion
mailing list