[Numpy-discussion] missing data discussion round 2
Thu Jun 30 11:13:00 CDT 2011
On Thu, Jun 30, 2011 at 11:04 AM, Gary Strangman <firstname.lastname@example.org
> Clearly there are some overlaps between what masked arrays are
>> trying to achieve and what Rs NA mechanisms are trying to achieve.
>> Are they really similar enough that they should function using
>> the same API?
>> And if so, won't that be confusing?
>> No, I don't believe so, any more than NA's in R, NaN's, or Inf's are
> As one who's been silently following (most of) this thread, and a heavy R
> and numpy user, perhaps I should chime in briefly here with a use case. I
> more-or-less always work with partially masked data, like Matthew, but not
> numpy masked arrays because the memory overhead is prohibitive. And, sad to
> say, my experiments don't always go perfectly. I therefore have arrays in
> which there is /both/ (1) data that is simply missing (np.NA?)--it never had
> a value and never will--as well as simultaneously (2) data that that is
> temporarily masked (np.IGNORE? np.MASKED?) where I want to mask/unmask
> different portions for different purposes/analyses. I consider these two
> separate, completely independent issues and I unfortunately currently have
> to kluge a lot to handle this.
> Concretely, consider a list of 100,000 observations (rows), with 12
> measures per observation-row (a 100,000 x 12 array). Every now and then,
> sprinkled throughout this array, I have missing values (someone didn't
> answer a question, or a computer failed to record a response, or whatever).
> For some analyses I want to mask the whole row (e.g., complete-case
> analysis), leaving me with array entries that should be tagged with all 4
> possible labels:
> 1) not masked, not missing
> 2) masked, not missing
> 3) not masked, missing
> 4) masked, missing
> Obviously #4 is "overkill" ... but only until I want to unmask that row. At
> that point, I need to be sure that missing values remain missing when
> unmasked. Can a single API really handle this?
The single API does support a masked array with an NA dtype, and the
behavior in this case will be that the value is considered NA if either it
is masked or the value is the NA bit pattern. So you could add a mask to an
array with an NA dtype to temporarily treat the data as if more values were
One important reason I'm doing it this way is so that each NumPy algorithm
and any 3rd party code only needs to be updated once to support both forms
of missing data. The C API with masks is also a lot cleaner to work with
than one for NA dtypes with the ability to have different NA bit patterns.
> The information in this e-mail is intended only for the person to whom it
> addressed. If you believe this e-mail was sent to you in error and the
> contains patient information, please contact the Partners Compliance
> HelpLine at
> http://www.partners.org/**complianceline<http://www.partners.org/complianceline>. If the e-mail was sent to you in error
> but does not contain patient information, please contact the sender and
> dispose of the e-mail.
> NumPy-Discussion mailing list
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the NumPy-Discussion