[Numpy-discussion] missing data discussion round 2
Keith Goodman
kwgoodman@gmail....
Mon Jun 27 19:07:18 CDT 2011
On Mon, Jun 27, 2011 at 8:55 AM, Mark Wiebe <mwwiebe@gmail.com> wrote:
> First I'd like to thank everyone for all the feedback you're providing,
> clearly this is an important topic to many people, and the discussion has
> helped clarify the ideas for me. I've renamed and updated the NEP, then
> placed it into the master NumPy repository so it has a more permanent home
> here:
> https://github.com/numpy/numpy/blob/master/doc/neps/missing-data.rst
> In the NEP, I've tried to address everything that was raised in the original
> thread and in Nathaniel's followup 'Concepts' thread. To deal with the issue
> of whether a mask is True or False for a missing value, I've removed the
> 'mask' attribute entirely, except for ufunc-like functions np.ismissing and
> np.isavail which return the two styles of masks. Here's a high level summary
> of how I'm thinking of the topic, and what I will implement:
> Missing Data Abstraction
> There appear to be two useful ways to think about missing data that are
> worth supporting.
> 1) Unknown yet existing data
> 2) Data that doesn't exist
> In 1), an NA value causes outputs to become NA except in a small number of
> exceptions such as boolean logic, and in 2), operations treat the data as if
> there were a smaller array without the NA values.
> Temporarily Ignoring Data
> In some cases, it is useful to flag data as NA temporarily, possibly in
> several different ways, for particular calculations or testing out different
> ways of throwing away outliers. This is independent of the missing data
> abstraction, still requiring a choice of 1) or 2) above.
> Implementation Techniques
> There are two mechanisms generally used to implement missing data
> abstractions,
> 1) An NA bit pattern
> 2) A mask
> I've described a design in the NEP which can include both techniques using
> the same interface. The mask approach is strictly more general than the NA
> bit pattern approach, except for a few things like the idea of supporting
> the dtype 'NA[f8,InfNan]' which you can read about in the NEP.
> My intention is to implement the mask-based design, and possibly also
> implement the NA bit pattern design, but if anything gets cut it will be the
> NA bit patterns.
> Thanks again for all your input so far, and thanks in advance for your
> suggestions for improving this new revision of the NEP.
I'm trying to understand this part of the missing data NEP:
"While numpy.NA works to mask values, it does not itself have a dtype.
This means that returning the numpy.NA singleton from an operation
like 'arr[0]' would be throwing away the dtype, which is still
valuable to retain, so 'arr[0]' will return a zero-dimensional array
either with its value masked, or containing the NA bit pattern for the
array's dtype."
If I do something like this in Cython:
cdef np.float64_t ai
for i in range(n):
ai = a[i]
...
Then I need to specify the type of ai, say float64 as above.
What happens when a[i] is np.NA? Is ai still a float64? If NA is a bit
pattern taken from float64 then a[i] could be float64, but if it is a
0d array then it would not be float64 and I assume I would run into
problems or have to cast.
So what does all this mean for iterating over each element of an array
in Cython or C? Would I need to check the mask of element i first and
only assign to ai if the mask is True (meaning not missing)?
More information about the NumPy-Discussion
mailing list