[Numpy-discussion] missing data discussion round 2
eat
e.antero.tammi@gmail....
Mon Jun 27 12:44:21 CDT 2011
Hi,
On Mon, Jun 27, 2011 at 6:55 PM, Mark Wiebe <mwwiebe@gmail.com> wrote:
> First I'd like to thank everyone for all the feedback you're providing,
> clearly this is an important topic to many people, and the discussion has
> helped clarify the ideas for me. I've renamed and updated the NEP, then
> placed it into the master NumPy repository so it has a more permanent home
> here:
>
> https://github.com/numpy/numpy/blob/master/doc/neps/missing-data.rst
>
> In the NEP, I've tried to address everything that was raised in the
> original thread and in Nathaniel's followup 'Concepts' thread. To deal with
> the issue of whether a mask is True or False for a missing value, I've
> removed the 'mask' attribute entirely, except for ufunc-like functions
> np.ismissing and np.isavail which return the two styles of masks. Here's a
> high level summary of how I'm thinking of the topic, and what I will
> implement:
>
> *Missing Data Abstraction*
>
> There appear to be two useful ways to think about missing data that are
> worth supporting.
>
> 1) Unknown yet existing data
> 2) Data that doesn't exist
>
> In 1), an NA value causes outputs to become NA except in a small number of
> exceptions such as boolean logic, and in 2), operations treat the data as if
> there were a smaller array without the NA values.
>
> *Temporarily Ignoring Data*
> *
> *
> In some cases, it is useful to flag data as NA temporarily, possibly in
> several different ways, for particular calculations or testing out different
> ways of throwing away outliers. This is independent of the missing data
> abstraction, still requiring a choice of 1) or 2) above.
>
> *Implementation Techniques*
> *
> *
> There are two mechanisms generally used to implement missing data
> abstractions,
> *
> *
> 1) An NA bit pattern
> 2) A mask
>
> I've described a design in the NEP which can include both techniques using
> the same interface. The mask approach is strictly more general than the NA
> bit pattern approach, except for a few things like the idea of supporting
> the dtype 'NA[f8,InfNan]' which you can read about in the NEP.
>
> My intention is to implement the mask-based design, and possibly also
> implement the NA bit pattern design, but if anything gets cut it will be the
> NA bit patterns.
>
> Thanks again for all your input so far, and thanks in advance for your
> suggestions for improving this new revision of the NEP.
>
A very impressive PEP indeed.
However, how would corner cases, like
>>> a = np.array([np.NA, np.NA], dtype='f8', masked=True)
>>> np.mean(a, skipna=True)
>>> np.mean(a)
be handled?
My concern here is that there always seems to be such corner cases which can
only be handled with specific context knowledge. Thus producing 100% generic
code to handle 'missing data' is not doable.
Thanks,
- eat
>
> -Mark
>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.scipy.org/pipermail/numpy-discussion/attachments/20110627/3be59dcb/attachment-0001.html
More information about the NumPy-Discussion
mailing list