[Numpy-discussion] missing data discussion round 2
Mon Jun 27 16:03:25 CDT 2011
On Mon, Jun 27, 2011 at 12:18 PM, Matthew Brett <email@example.com>wrote:
> On Mon, Jun 27, 2011 at 5:53 PM, Charles R Harris
> <firstname.lastname@example.org> wrote:
> > On Mon, Jun 27, 2011 at 9:55 AM, Mark Wiebe <email@example.com> wrote:
> >> First I'd like to thank everyone for all the feedback you're providing,
> >> clearly this is an important topic to many people, and the discussion
> >> helped clarify the ideas for me. I've renamed and updated the NEP, then
> >> placed it into the master NumPy repository so it has a more permanent
> >> here:
> >> https://github.com/numpy/numpy/blob/master/doc/neps/missing-data.rst
> >> In the NEP, I've tried to address everything that was raised in the
> >> original thread and in Nathaniel's followup 'Concepts' thread. To deal
> >> the issue of whether a mask is True or False for a missing value, I've
> >> removed the 'mask' attribute entirely, except for ufunc-like functions
> >> np.ismissing and np.isavail which return the two styles of masks. Here's
> >> high level summary of how I'm thinking of the topic, and what I will
> >> implement:
> >> Missing Data Abstraction
> >> There appear to be two useful ways to think about missing data that are
> >> worth supporting.
> >> 1) Unknown yet existing data
> >> 2) Data that doesn't exist
> >> In 1), an NA value causes outputs to become NA except in a small number
> >> exceptions such as boolean logic, and in 2), operations treat the data
> as if
> >> there were a smaller array without the NA values.
> >> Temporarily Ignoring Data
> >> In some cases, it is useful to flag data as NA temporarily, possibly in
> >> several different ways, for particular calculations or testing out
> >> ways of throwing away outliers. This is independent of the missing data
> >> abstraction, still requiring a choice of 1) or 2) above.
> >> Implementation Techniques
> >> There are two mechanisms generally used to implement missing data
> >> abstractions,
> >> 1) An NA bit pattern
> >> 2) A mask
> >> I've described a design in the NEP which can include both techniques
> >> the same interface. The mask approach is strictly more general than the
> >> bit pattern approach, except for a few things like the idea of
> >> the dtype 'NA[f8,InfNan]' which you can read about in the NEP.
> >> My intention is to implement the mask-based design, and possibly also
> >> implement the NA bit pattern design, but if anything gets cut it will be
> >> NA bit patterns.
> > I have the impression that the mask-based design would be easier. Perhaps
> > you could do that one first and folks could try out the API and see how
> > like it and discover whether the memory overhead is a problem in
> That seems like a risky strategy to me, as the most likely outcome is
> that people worried about memory will avoid masked arrays because they
> know they use more memory. The memory usage is predictable and we
> won't learn any more about it from use. We most of us already know if
> we're having to optimize code for memory.
> You won't get complaints, you'll just lose a group of users, who will,
> I suspect, stick to NaNs, unsatisfactory as they are.
This blade cuts both ways, we'd lose a group of users if we don't support
masking semantics, too.
That said, Travis favors doing both, so there's a good chance there will be
time for it.
> See you,
> NumPy-Discussion mailing list
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the NumPy-Discussion