[Numpy-discussion] alterNEP - was: missing data discussion round 2

Charles R Harris charlesr.harris@gmail....
Thu Jun 30 09:27:43 CDT 2011


On Thu, Jun 30, 2011 at 8:17 AM, Charles R Harris <charlesr.harris@gmail.com
> wrote:

>
>
> On Thu, Jun 30, 2011 at 7:31 AM, Matthew Brett <matthew.brett@gmail.com>wrote:
>
>> Hi,
>>
>> On Tue, Jun 28, 2011 at 4:06 PM, Nathaniel Smith <njs@pobox.com> wrote:
>> > Anyway, it's pretty clear that in this particular case, there are two
>> > distinct features that different people want: the missing data
>> > feature, and the masked array feature. The more I think about it, the
>> > less I see how they can be combined into one dessert topping + floor
>> > wax solution. Here are three particular points where they seem to
>> > contradict each other:
>> ...
>> [some proposals]
>>
>> In the interest of making the discussion as concrete as possible, here
>> is my draft of an alternative proposal for NAs and masking, based on
>> Nathaniel's comments.  Writing it, it seemed to me that Nathaniel is
>> right, that the ideas become much clearer when the NA idea and the
>> MASK idea are separate.   Please do pitch in for things I may have
>> missed or misunderstood:
>>
>> ###############################################
>> A alternative-NEP on masking and missing values
>> ###############################################
>>
>> The principle of this aNEP is to separate the APIs for masking and for
>> missing
>> values, according to
>>
>> * The current implementation of masked arrays
>> * Nathaniel Smith's proposal.
>>
>> This discussion is only of the API, and not of the implementation.
>>
>> **************
>> Initialization
>> **************
>>
>> First, missing values can be set and be displayed as ``np.NA, NA``::
>>
>>    >>> np.array([1.0, 2.0, np.NA, 7.0], dtype='NA[f8]')
>>    array([1., 2., NA, 7.], dtype='NA[<f8]')
>>
>> As the initialization is not ambiguous, this can be written without the NA
>> dtype::
>>
>>    >>> np.array([1.0, 2.0, np.NA, 7.0])
>>    array([1., 2., NA, 7.], dtype='NA[<f8]')
>>
>> Masked values can be set and be displayed as ``np.MASKED, MASKED``::
>>
>>    >>> np.array([1.0, 2.0, np.MASKED, 7.0], masked=True)
>>    array([1., 2., MASKED, 7.], masked=True)
>>
>> As the initialization is not ambiguous, this can be written without
>> ``masked=True``::
>>
>>    >>> np.array([1.0, 2.0, np.MASKED, 7.0])
>>    array([1., 2., MASKED, 7.], masked=True)
>>
>> ******
>> Ufuncs
>> ******
>>
>> By default, NA values propagate::
>>
>>    >>> na_arr = np.array([1.0, 2.0, np.NA, 7.0])
>>    >>> np.sum(na_arr)
>>    NA('float64')
>>
>> unless the ``skipna`` flag is set::
>>
>>    >>> np.sum(na_arr, skipna=True)
>>    10.0
>>
>> By default, masking does not propagate::
>>
>>    >>> masked_arr = np.array([1.0, 2.0, np.MASKED, 7.0])
>>    >>> np.sum(masked_arr)
>>    10.0
>>
>> unless the ``propmsk`` flag is set::
>>
>>    >>> np.sum(masked_arr, propmsk=True)
>>    MASKED
>>
>> An array can be masked, and contain NA values::
>>
>>    >>> both_arr = np.array([1.0, 2.0, np.MASKED, np.NA, 7.0])
>>
>> In the default case, the behavior is obvious::
>>
>>    >>> np.sum(both_arr)
>>    NA('float64')
>>
>> It's also obvious what to do with ``skipna=True``::
>>
>>    >>> np.sum(both_arr, skipna=True)
>>    10.0
>>    >>> np.sum(both_arr, skipna=True, propmsk=True)
>>    MASKED
>>
>> To break the tie between NA and MSK, NAs propagate harder::
>>
>>    >>> np.sum(both_arr, propmsk=True)
>>    NA('float64')
>>
>> **********
>> Assignment
>> **********
>>
>> is obvious in the NA case::
>>
>>    >>> arr = np.array([1.0, 2.0, 7.0])
>>    >>> arr[2] = np.NA
>>    TypeError('dtype does not support NA')
>>    >>> na_arr = np.array([1.0, 2.0, 7.0], dtype='NA[f8]')
>>    >>> na_arr[2] = np.NA
>>    >>> na_arr
>>    array([1., 2., NA], dtype='NA[<f8]')
>>
>> Direct assignnent in the masked case is magic and confusing, and so
>> happens only
>> via the mask::
>>
>>    >>> masked_array = np.array([1.0, 2.0, 7.0], masked=True)
>>    >>> masked_arr[2] = np.NA
>>    TypeError('dtype does not support NA')
>>    >>> masked_arr[2] = np.MASKED
>>    TypeError('float() argument must be a string or a number')
>>    >>> masked_arr.visible[2] = False
>>    >>> masked_arr
>>    array([1., 2., MASKED], masked=True)
>>
>> See y'all,
>>
>>
> I honestly don't see the problem here. The difference isn't between
> masked_values/missing_values, it is between masked arrays and masked views
> of unmasked arrays. I think the view concept is central to what is going on.
> It may not be what folks are used to, but it strikes me as a clarifying
> advance rather than a mixed up confusion. Admittedly, it depends on the
> numpy centric ability to have views, but views are a wonderful thing.
>
>
OK, I can see a problem in that currently the only way to unmask a value is
by assignment of a valid value to the underlying data array, that is the
missing data idea. For masked data, it might be convenient to have something
that only affected the mask instead of having to take another view of the
unmasked data and reconstructing the mask with some modifications. So that
could maybe be done with a "soft" np.CLEAR that only worked on views of
unmasked arrays.

Chuck
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.scipy.org/pipermail/numpy-discussion/attachments/20110630/ad70f6ff/attachment-0001.html 


More information about the NumPy-Discussion mailing list