[Numpy-discussion] missing data discussion round 2
Mon Jun 27 12:44:21 CDT 2011
On Mon, Jun 27, 2011 at 6:55 PM, Mark Wiebe <email@example.com> wrote:
> First I'd like to thank everyone for all the feedback you're providing,
> clearly this is an important topic to many people, and the discussion has
> helped clarify the ideas for me. I've renamed and updated the NEP, then
> placed it into the master NumPy repository so it has a more permanent home
> In the NEP, I've tried to address everything that was raised in the
> original thread and in Nathaniel's followup 'Concepts' thread. To deal with
> the issue of whether a mask is True or False for a missing value, I've
> removed the 'mask' attribute entirely, except for ufunc-like functions
> np.ismissing and np.isavail which return the two styles of masks. Here's a
> high level summary of how I'm thinking of the topic, and what I will
> *Missing Data Abstraction*
> There appear to be two useful ways to think about missing data that are
> worth supporting.
> 1) Unknown yet existing data
> 2) Data that doesn't exist
> In 1), an NA value causes outputs to become NA except in a small number of
> exceptions such as boolean logic, and in 2), operations treat the data as if
> there were a smaller array without the NA values.
> *Temporarily Ignoring Data*
> In some cases, it is useful to flag data as NA temporarily, possibly in
> several different ways, for particular calculations or testing out different
> ways of throwing away outliers. This is independent of the missing data
> abstraction, still requiring a choice of 1) or 2) above.
> *Implementation Techniques*
> There are two mechanisms generally used to implement missing data
> 1) An NA bit pattern
> 2) A mask
> I've described a design in the NEP which can include both techniques using
> the same interface. The mask approach is strictly more general than the NA
> bit pattern approach, except for a few things like the idea of supporting
> the dtype 'NA[f8,InfNan]' which you can read about in the NEP.
> My intention is to implement the mask-based design, and possibly also
> implement the NA bit pattern design, but if anything gets cut it will be the
> NA bit patterns.
> Thanks again for all your input so far, and thanks in advance for your
> suggestions for improving this new revision of the NEP.
A very impressive PEP indeed.
However, how would corner cases, like
>>> a = np.array([np.NA, np.NA], dtype='f8', masked=True)
>>> np.mean(a, skipna=True)
My concern here is that there always seems to be such corner cases which can
only be handled with specific context knowledge. Thus producing 100% generic
code to handle 'missing data' is not doable.
> NumPy-Discussion mailing list
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the NumPy-Discussion