[Numpy-discussion] alterNEP - was: missing data discussion round 2

Matthew Brett matthew.brett@gmail....
Thu Jun 30 08:31:53 CDT 2011


Hi,

On Tue, Jun 28, 2011 at 4:06 PM, Nathaniel Smith <njs@pobox.com> wrote:
> Anyway, it's pretty clear that in this particular case, there are two
> distinct features that different people want: the missing data
> feature, and the masked array feature. The more I think about it, the
> less I see how they can be combined into one dessert topping + floor
> wax solution. Here are three particular points where they seem to
> contradict each other:
...
[some proposals]

In the interest of making the discussion as concrete as possible, here
is my draft of an alternative proposal for NAs and masking, based on
Nathaniel's comments.  Writing it, it seemed to me that Nathaniel is
right, that the ideas become much clearer when the NA idea and the
MASK idea are separate.   Please do pitch in for things I may have
missed or misunderstood:

###############################################
A alternative-NEP on masking and missing values
###############################################

The principle of this aNEP is to separate the APIs for masking and for missing
values, according to

* The current implementation of masked arrays
* Nathaniel Smith's proposal.

This discussion is only of the API, and not of the implementation.

**************
Initialization
**************

First, missing values can be set and be displayed as ``np.NA, NA``::

    >>> np.array([1.0, 2.0, np.NA, 7.0], dtype='NA[f8]')
    array([1., 2., NA, 7.], dtype='NA[<f8]')

As the initialization is not ambiguous, this can be written without the NA
dtype::

    >>> np.array([1.0, 2.0, np.NA, 7.0])
    array([1., 2., NA, 7.], dtype='NA[<f8]')

Masked values can be set and be displayed as ``np.MASKED, MASKED``::

    >>> np.array([1.0, 2.0, np.MASKED, 7.0], masked=True)
    array([1., 2., MASKED, 7.], masked=True)

As the initialization is not ambiguous, this can be written without
``masked=True``::

    >>> np.array([1.0, 2.0, np.MASKED, 7.0])
    array([1., 2., MASKED, 7.], masked=True)

******
Ufuncs
******

By default, NA values propagate::

    >>> na_arr = np.array([1.0, 2.0, np.NA, 7.0])
    >>> np.sum(na_arr)
    NA('float64')

unless the ``skipna`` flag is set::

    >>> np.sum(na_arr, skipna=True)
    10.0

By default, masking does not propagate::

    >>> masked_arr = np.array([1.0, 2.0, np.MASKED, 7.0])
    >>> np.sum(masked_arr)
    10.0

unless the ``propmsk`` flag is set::

    >>> np.sum(masked_arr, propmsk=True)
    MASKED

An array can be masked, and contain NA values::

    >>> both_arr = np.array([1.0, 2.0, np.MASKED, np.NA, 7.0])

In the default case, the behavior is obvious::

    >>> np.sum(both_arr)
    NA('float64')

It's also obvious what to do with ``skipna=True``::

    >>> np.sum(both_arr, skipna=True)
    10.0
    >>> np.sum(both_arr, skipna=True, propmsk=True)
    MASKED

To break the tie between NA and MSK, NAs propagate harder::

    >>> np.sum(both_arr, propmsk=True)
    NA('float64')

**********
Assignment
**********

is obvious in the NA case::

    >>> arr = np.array([1.0, 2.0, 7.0])
    >>> arr[2] = np.NA
    TypeError('dtype does not support NA')
    >>> na_arr = np.array([1.0, 2.0, 7.0], dtype='NA[f8]')
    >>> na_arr[2] = np.NA
    >>> na_arr
    array([1., 2., NA], dtype='NA[<f8]')

Direct assignnent in the masked case is magic and confusing, and so happens only
via the mask::

    >>> masked_array = np.array([1.0, 2.0, 7.0], masked=True)
    >>> masked_arr[2] = np.NA
    TypeError('dtype does not support NA')
    >>> masked_arr[2] = np.MASKED
    TypeError('float() argument must be a string or a number')
    >>> masked_arr.visible[2] = False
    >>> masked_arr
    array([1., 2., MASKED], masked=True)

See y'all,

Matthew


More information about the NumPy-Discussion mailing list