[Numpy-discussion] missing data discussion round 2

Mark Wiebe mwwiebe@gmail....
Thu Jun 30 13:31:38 CDT 2011


On Thu, Jun 30, 2011 at 11:42 AM, Matthew Brett <matthew.brett@gmail.com>wrote:

> Hi,
>
> On Thu, Jun 30, 2011 at 5:13 PM, Mark Wiebe <mwwiebe@gmail.com> wrote:
> > On Thu, Jun 30, 2011 at 11:04 AM, Gary Strangman
> > <strang@nmr.mgh.harvard.edu> wrote:
> >>
> >>>      Clearly there are some overlaps between what masked arrays are
> >>>      trying to achieve and what Rs NA mechanisms are trying to achieve.
> >>>       Are they really similar enough that they should function using
> >>>      the same API?
> >>>
> >>> Yes.
> >>>
> >>>      And if so, won't that be confusing?
> >>>
> >>> No, I don't believe so, any more than NA's in R, NaN's, or Inf's are
> >>> already
> >>> confusing.
> >>
> >> As one who's been silently following (most of) this thread, and a heavy
> R
> >> and numpy user, perhaps I should chime in briefly here with a use case.
> I
> >> more-or-less always work with partially masked data, like Matthew, but
> not
> >> numpy masked arrays because the memory overhead is prohibitive. And, sad
> to
> >> say, my experiments don't always go perfectly. I therefore have arrays
> in
> >> which there is /both/ (1) data that is simply missing (np.NA?)--it never
> had
> >> a value and never will--as well as simultaneously (2) data that that is
> >> temporarily masked (np.IGNORE? np.MASKED?) where I want to mask/unmask
> >> different portions for different purposes/analyses. I consider these two
> >> separate, completely independent issues and I unfortunately currently
> have
> >> to kluge a lot to handle this.
> >>
> >> Concretely, consider a list of 100,000 observations (rows), with 12
> >> measures per observation-row (a 100,000 x 12 array). Every now and then,
> >> sprinkled throughout this array, I have missing values (someone didn't
> >> answer a question, or a computer failed to record a response, or
> whatever).
> >> For some analyses I want to mask the whole row (e.g., complete-case
> >> analysis), leaving me with array entries that should be tagged with all
> 4
> >> possible labels:
> >>
> >> 1) not masked, not missing
> >> 2) masked, not missing
> >> 3) not masked, missing
> >> 4) masked, missing
> >>
> >> Obviously #4 is "overkill" ... but only until I want to unmask that row.
> >> At that point, I need to be sure that missing values remain missing when
> >> unmasked. Can a single API really handle this?
> >
> > The single API does support a masked array with an NA dtype, and the
> > behavior in this case will be that the value is considered NA if either
> it
> > is masked or the value is the NA bit pattern. So you could add a mask to
> an
> > array with an NA dtype to temporarily treat the data as if more values
> were
> > missing.
>
> Right - but I think the separated API is cleaner and easier to
> explain.  Do you disagree?
>

Kind of, yeah. I think the important things to understand from the Python
perspective are that there are two ways of doing missing values with NA that
look exactly the same except for how you create the arrays. Since you know
that the mask way takes more memory, and that's important for your
application, you can decide to use the NA dtype without any additional
depth.

Understanding that one of them has a special signal for NA while the other
uses masks in the background probably isn't even that important to
understand to be able to use it. I bet lots of people who use R regularly
couldn't come up with a correct explanation of how it works there.

If someone doesn't understand masks, they can use their intuition based on
the special signal idea without any difficulty. The idea that you can
temporarily make some values NA without overwriting your data may not be
intuitive at first glance, but I expect people will find it useful even if
they don't fully understand the subtle details of the masking mechanism.

> One important reason I'm doing it this way is so that each NumPy algorithm
> > and any 3rd party code only needs to be updated once to support both
> forms
> > of missing data.
>
> Could you explain what you mean?  Maybe a couple of examples?
>

Yeah, I've started adding some implementation notes to the NEP. First I need
volunteers to review my current pull requests though. ;)

-Mark


>
> Whatever API results, it will surely be with us for a long time, and
> so it would be good to make sure we have the right one even if it
> costs a bit more to update current code.
>
> Cheers,
>
> Matthew
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.scipy.org/pipermail/numpy-discussion/attachments/20110630/aea072c9/attachment-0001.html 


More information about the NumPy-Discussion mailing list