[Numpy-discussion] alterNEP - was: missing data discussion round 2

Charles R Harris charlesr.harris@gmail....
Thu Jun 30 09:17:04 CDT 2011


On Thu, Jun 30, 2011 at 7:31 AM, Matthew Brett <matthew.brett@gmail.com>wrote:

> Hi,
>
> On Tue, Jun 28, 2011 at 4:06 PM, Nathaniel Smith <njs@pobox.com> wrote:
> > Anyway, it's pretty clear that in this particular case, there are two
> > distinct features that different people want: the missing data
> > feature, and the masked array feature. The more I think about it, the
> > less I see how they can be combined into one dessert topping + floor
> > wax solution. Here are three particular points where they seem to
> > contradict each other:
> ...
> [some proposals]
>
> In the interest of making the discussion as concrete as possible, here
> is my draft of an alternative proposal for NAs and masking, based on
> Nathaniel's comments.  Writing it, it seemed to me that Nathaniel is
> right, that the ideas become much clearer when the NA idea and the
> MASK idea are separate.   Please do pitch in for things I may have
> missed or misunderstood:
>
> ###############################################
> A alternative-NEP on masking and missing values
> ###############################################
>
> The principle of this aNEP is to separate the APIs for masking and for
> missing
> values, according to
>
> * The current implementation of masked arrays
> * Nathaniel Smith's proposal.
>
> This discussion is only of the API, and not of the implementation.
>
> **************
> Initialization
> **************
>
> First, missing values can be set and be displayed as ``np.NA, NA``::
>
>    >>> np.array([1.0, 2.0, np.NA, 7.0], dtype='NA[f8]')
>    array([1., 2., NA, 7.], dtype='NA[<f8]')
>
> As the initialization is not ambiguous, this can be written without the NA
> dtype::
>
>    >>> np.array([1.0, 2.0, np.NA, 7.0])
>    array([1., 2., NA, 7.], dtype='NA[<f8]')
>
> Masked values can be set and be displayed as ``np.MASKED, MASKED``::
>
>    >>> np.array([1.0, 2.0, np.MASKED, 7.0], masked=True)
>    array([1., 2., MASKED, 7.], masked=True)
>
> As the initialization is not ambiguous, this can be written without
> ``masked=True``::
>
>    >>> np.array([1.0, 2.0, np.MASKED, 7.0])
>    array([1., 2., MASKED, 7.], masked=True)
>
> ******
> Ufuncs
> ******
>
> By default, NA values propagate::
>
>    >>> na_arr = np.array([1.0, 2.0, np.NA, 7.0])
>    >>> np.sum(na_arr)
>    NA('float64')
>
> unless the ``skipna`` flag is set::
>
>    >>> np.sum(na_arr, skipna=True)
>    10.0
>
> By default, masking does not propagate::
>
>    >>> masked_arr = np.array([1.0, 2.0, np.MASKED, 7.0])
>    >>> np.sum(masked_arr)
>    10.0
>
> unless the ``propmsk`` flag is set::
>
>    >>> np.sum(masked_arr, propmsk=True)
>    MASKED
>
> An array can be masked, and contain NA values::
>
>    >>> both_arr = np.array([1.0, 2.0, np.MASKED, np.NA, 7.0])
>
> In the default case, the behavior is obvious::
>
>    >>> np.sum(both_arr)
>    NA('float64')
>
> It's also obvious what to do with ``skipna=True``::
>
>    >>> np.sum(both_arr, skipna=True)
>    10.0
>    >>> np.sum(both_arr, skipna=True, propmsk=True)
>    MASKED
>
> To break the tie between NA and MSK, NAs propagate harder::
>
>    >>> np.sum(both_arr, propmsk=True)
>    NA('float64')
>
> **********
> Assignment
> **********
>
> is obvious in the NA case::
>
>    >>> arr = np.array([1.0, 2.0, 7.0])
>    >>> arr[2] = np.NA
>    TypeError('dtype does not support NA')
>    >>> na_arr = np.array([1.0, 2.0, 7.0], dtype='NA[f8]')
>    >>> na_arr[2] = np.NA
>    >>> na_arr
>    array([1., 2., NA], dtype='NA[<f8]')
>
> Direct assignnent in the masked case is magic and confusing, and so happens
> only
> via the mask::
>
>    >>> masked_array = np.array([1.0, 2.0, 7.0], masked=True)
>    >>> masked_arr[2] = np.NA
>    TypeError('dtype does not support NA')
>    >>> masked_arr[2] = np.MASKED
>    TypeError('float() argument must be a string or a number')
>    >>> masked_arr.visible[2] = False
>    >>> masked_arr
>    array([1., 2., MASKED], masked=True)
>
> See y'all,
>
>
I honestly don't see the problem here. The difference isn't between
masked_values/missing_values, it is between masked arrays and masked views
of unmasked arrays. I think the view concept is central to what is going on.
It may not be what folks are used to, but it strikes me as a clarifying
advance rather than a mixed up confusion. Admittedly, it depends on the
numpy centric ability to have views, but views are a wonderful thing.

Chuck
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.scipy.org/pipermail/numpy-discussion/attachments/20110630/c6f5a513/attachment.html 


More information about the NumPy-Discussion mailing list