[Numpy-discussion] in the NA discussion, what can we agree on?

Benjamin Root ben.root@ou....
Wed Nov 2 19:25:15 CDT 2011


On Wed, Nov 2, 2011 at 6:37 PM, Nathaniel Smith <njs@pobox.com> wrote:

> Hi again,
>
> Okay, here's my attempt at an *uncontroversial* email!
>
> Specifically, I think it'll be easier to talk about this NA stuff if
> we can establish some common ground, and easier for people to follow
> if the basic points of agreement are laid out in one place. So I'm
> going to try and summarize just the things that we can agree about.
>
> Note that right now I'm *only* talking about what kind of tools we
> want to give the user -- i.e., what kind of problems we are trying to
> solve. AFAICT we don't have as much consensus on implementation
> matters, and anyway it's hard to make implementation decisions without
> knowing what we're trying to accomplish.
>
> 1) I think we have consensus that there are (at least) two different
> possible ways of thinking about this problem, with somewhat different
> constituencies. Let's call these two concepts "MISSING data" and
> "IGNORED data".
>
> 2) I also think we have at least a rough consensus on what these
> concepts mean, and what their supporters want from them:
>
> MISSING data:
> - Conceptually, MISSINGness acts like a property of a datum --
> assigning MISSING to a location is like assigning any other value to
> that location
> - Ufuncs and other operations must propagate these values by default,
> and there must be an option to cause them to be ignored
> - Must be competitive with NaNs in terms of speed and memory usage (or
> else people will just use NaNs)
> - Compatibility with R is valuable
> - To avoid user confusion, ideally it should *not* be possible to
> 'unmask' a missing value, since this is inconsistent with the "missing
> value" metaphor (e.g., see Wes's comment about "leaky abstractions")
> - Possible useful extension: having different classes of missing
> values (similar to Stata)
> - Target audience: data analysis with missing data, neuroimaging,
> econometrics, former R users, ...
>
> IGNORED data:
> - Conceptually, IGNOREDness acts like a property of the array --
> toggling a location to be IGNORED is kind of vaguely similar to
> changing an array's shape
> - Ufuncs and other operations must ignore these values by default, and
> there doesn't really need to be a way to propagate them, even as an
> option (though it probably wouldn't hurt either)
> - Some memory overhead is inevitable and acceptable
> - Compatibility with R neither possible nor valuable
> - Ability to toggle the IGNORED state of a location is critical, and
> should be as convenient as possible
> - Possible useful extension: having not just different types of
> ignored values, but richer ways to combine them -- e.g., the example
> of combining astronomical images with some kind of associated
> per-pixel quality scores, where one might want the 'mask' to be not
> just a boolean IGNORED/not-IGNORED flag, but an integer (perhaps a
> multi-byte integer) or even a float, and to allow these 'masks' to be
> combined in some more complex way than just logical_and.
> - Target audience: anyone who's already doing this kind of thing by
> hand using a second mask array + boolean indexing, former numpy.ma
> users, matplotlib, ...
>
> 3) And perhaps we can all agree that the biggest *un*resolved question
> is whether we want to:
> - emphasize the similarities between these two use cases and build a
> single interface that can handle both concepts, with some compromises
> - or, treat these at two mostly-separate features that can each become
> exactly what the respective constituency wants without compromise --
> but with some potential redundancy and extra code.
> Each approach has advantages and disadvantages.
>
> Does that seem like a fair summary? Anything more we can add? Most
> importantly, anything here that you disagree with? Did I summarize
> your needs well? Do you have a use case that you feel doesn't fit
> naturally into either category?
>
> [Also, I thought this might make the start of a good wiki page for
> people to reference during these discussions, but I don't seem to have
> edit rights. If other people agree, maybe someone could put it up, or
> give me access? My trac id is njs@pobox.com.]
>
> Thanks,
> -- Nathaniel
>

I want to pare this down even more.  I think the above lists makes too many
unneeded extrapolations.

MISSING data:
- Conceptually, MISSINGness acts like a property of a datum --
assigning MISSING to a location is like assigning any other value to
that location
- Ufuncs and other operations must propagate these values by default,
and there must be an option to cause them to be ignored
- Assigning MISSING is destructive
- Must be competitive with NaNs in terms of speed and memory usage (or
else people will just use NaNs)
- Target audience: data analysis with missing data, neuroimaging,
econometrics, former R users, ...


- Possible useful extension: having different classes of missing
values (similar to Stata)


IGNORED data:
- Conceptually, IGNOREDness acts like a property of the array --
toggling a location to be IGNORED is kind of vaguely similar to
changing an array's shape
- Ufuncs and other operations must ignore these values by default, and
there doesn't really need to be a way to propagate them, even as an
option (though it probably wouldn't hurt either)
- Assigning IGNORE is non-destructive
- Must be competitive with np.ma for speed and memory (or else users would
just use np.ma)
- Target audience: anyone who's already doing this kind of thing by
hand using a second mask array + boolean indexing, former numpy.ma
users, matplotlib, ...


- Possible useful extension: having not just different types of
ignored values, but richer ways to combine them -- e.g., the example
of combining astronomical images with some kind of associated
per-pixel quality scores, where one might want the 'mask' to be not
just a boolean IGNORED/not-IGNORED flag, but an integer (perhaps a
multi-byte integer) or even a float, and to allow these 'masks' to be
combined in some more complex way than just logical_and.



Then, as a third-party module developer, I can tell you that having
separate and independent ways to detect "MISSING"/"IGNORED" would likely
make support more difficult and would greatly benefit from a common (or
easily combinable) method of identification.

Ben Root

P.S. - I took out the phrase "compatibility with R" not as a slight against
R, but because of the vagueness of the statement.  Does it mean raw binary
data format compatibility? Some sort of ABI compatibility (does R or python
have the ability to call and pass data to each other?). Rather, I find the
declaration of R-users being the target audience *much* more important and
allows for more flexibility in achieving that goal for both forms of data.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.scipy.org/pipermail/numpy-discussion/attachments/20111102/0f04a354/attachment.html 


More information about the NumPy-Discussion mailing list