[Numpy-discussion] alterNEP - was: missing data discussion round 2
Dag Sverre Seljebotn
d.s.seljebotn@astro.uio...
Thu Jun 30 09:26:31 CDT 2011
On 06/30/2011 04:17 PM, Charles R Harris wrote:
>
>
> On Thu, Jun 30, 2011 at 7:31 AM, Matthew Brett <matthew.brett@gmail.com
> <mailto:matthew.brett@gmail.com>> wrote:
>
> Hi,
>
> On Tue, Jun 28, 2011 at 4:06 PM, Nathaniel Smith <njs@pobox.com
> <mailto:njs@pobox.com>> wrote:
> > Anyway, it's pretty clear that in this particular case, there are two
> > distinct features that different people want: the missing data
> > feature, and the masked array feature. The more I think about it, the
> > less I see how they can be combined into one dessert topping + floor
> > wax solution. Here are three particular points where they seem to
> > contradict each other:
> ...
> [some proposals]
>
> In the interest of making the discussion as concrete as possible, here
> is my draft of an alternative proposal for NAs and masking, based on
> Nathaniel's comments. Writing it, it seemed to me that Nathaniel is
> right, that the ideas become much clearer when the NA idea and the
> MASK idea are separate. Please do pitch in for things I may have
> missed or misunderstood:
>
> ###############################################
> A alternative-NEP on masking and missing values
> ###############################################
>
> The principle of this aNEP is to separate the APIs for masking and
> for missing
> values, according to
>
> * The current implementation of masked arrays
> * Nathaniel Smith's proposal.
>
> This discussion is only of the API, and not of the implementation.
>
> **************
> Initialization
> **************
>
> First, missing values can be set and be displayed as ``np.NA, NA``::
>
> >>> np.array([1.0, 2.0, np.NA, 7.0], dtype='NA[f8]')
> array([1., 2., NA, 7.], dtype='NA[<f8]')
>
> As the initialization is not ambiguous, this can be written without
> the NA
> dtype::
>
> >>> np.array([1.0, 2.0, np.NA, 7.0])
> array([1., 2., NA, 7.], dtype='NA[<f8]')
>
> Masked values can be set and be displayed as ``np.MASKED, MASKED``::
>
> >>> np.array([1.0, 2.0, np.MASKED, 7.0], masked=True)
> array([1., 2., MASKED, 7.], masked=True)
>
> As the initialization is not ambiguous, this can be written without
> ``masked=True``::
>
> >>> np.array([1.0, 2.0, np.MASKED, 7.0])
> array([1., 2., MASKED, 7.], masked=True)
>
> ******
> Ufuncs
> ******
>
> By default, NA values propagate::
>
> >>> na_arr = np.array([1.0, 2.0, np.NA, 7.0])
> >>> np.sum(na_arr)
> NA('float64')
>
> unless the ``skipna`` flag is set::
>
> >>> np.sum(na_arr, skipna=True)
> 10.0
>
> By default, masking does not propagate::
>
> >>> masked_arr = np.array([1.0, 2.0, np.MASKED, 7.0])
> >>> np.sum(masked_arr)
> 10.0
>
> unless the ``propmsk`` flag is set::
>
> >>> np.sum(masked_arr, propmsk=True)
> MASKED
>
> An array can be masked, and contain NA values::
>
> >>> both_arr = np.array([1.0, 2.0, np.MASKED, np.NA, 7.0])
>
> In the default case, the behavior is obvious::
>
> >>> np.sum(both_arr)
> NA('float64')
>
> It's also obvious what to do with ``skipna=True``::
>
> >>> np.sum(both_arr, skipna=True)
> 10.0
> >>> np.sum(both_arr, skipna=True, propmsk=True)
> MASKED
>
> To break the tie between NA and MSK, NAs propagate harder::
>
> >>> np.sum(both_arr, propmsk=True)
> NA('float64')
>
> **********
> Assignment
> **********
>
> is obvious in the NA case::
>
> >>> arr = np.array([1.0, 2.0, 7.0])
> >>> arr[2] = np.NA
> TypeError('dtype does not support NA')
> >>> na_arr = np.array([1.0, 2.0, 7.0], dtype='NA[f8]')
> >>> na_arr[2] = np.NA
> >>> na_arr
> array([1., 2., NA], dtype='NA[<f8]')
>
> Direct assignnent in the masked case is magic and confusing, and so
> happens only
> via the mask::
>
> >>> masked_array = np.array([1.0, 2.0, 7.0], masked=True)
> >>> masked_arr[2] = np.NA
> TypeError('dtype does not support NA')
> >>> masked_arr[2] = np.MASKED
> TypeError('float() argument must be a string or a number')
> >>> masked_arr.visible[2] = False
> >>> masked_arr
> array([1., 2., MASKED], masked=True)
>
> See y'all,
>
>
> I honestly don't see the problem here. The difference isn't between
> masked_values/missing_values, it is between masked arrays and masked
> views of unmasked arrays. I think the view concept is central to what is
> going on. It may not be what folks are used to, but it strikes me as a
> clarifying advance rather than a mixed up confusion. Admittedly, it
> depends on the numpy centric ability to have views, but views are a
> wonderful thing.
So a) how do you propose that reductions behave?, b) what semantics for
the []= operator do you propose?
That would clarify why you don't see a problem..
Dag Sverre
More information about the NumPy-Discussion
mailing list