[Numpy-discussion] feedback request: proposal to add masks to the core ndarray
Mark Wiebe
mwwiebe@gmail....
Sat Jun 25 15:35:54 CDT 2011
On Sat, Jun 25, 2011 at 9:44 AM, Wes McKinney <wesmckinn@gmail.com> wrote:
> On Sat, Jun 25, 2011 at 10:25 AM, Charles R Harris
> <charlesr.harris@gmail.com> wrote:
> > On Sat, Jun 25, 2011 at 8:14 AM, Wes McKinney <wesmckinn@gmail.com>
> wrote:
> >>
> >> On Sat, Jun 25, 2011 at 12:42 AM, Charles R Harris
> >> <charlesr.harris@gmail.com> wrote:
> >> >
> >> >
> >> > On Fri, Jun 24, 2011 at 10:06 PM, Wes McKinney <wesmckinn@gmail.com>
> >> > wrote:
> >> >>
> >> >> On Fri, Jun 24, 2011 at 11:59 PM, Nathaniel Smith <njs@pobox.com>
> >> >> wrote:
> >> >> > On Fri, Jun 24, 2011 at 6:57 PM, Benjamin Root <ben.root@ou.edu>
> >> >> > wrote:
> >> >> >> On Fri, Jun 24, 2011 at 8:11 PM, Nathaniel Smith <njs@pobox.com>
> >> >> >> wrote:
> >> >> >>> This is a situation where I would just... use an array and a
> mask,
> >> >> >>> rather than a masked array. Then lots of things -- changing fill
> >> >> >>> values, temporarily masking/unmasking things, etc. -- come from
> >> >> >>> free,
> >> >> >>> just from knowing how arrays and boolean indexing work?
> >> >> >>
> >> >> >> With a masked array, it is "for free". Why re-invent the wheel?
> It
> >> >> >> has
> >> >> >> already been done for me.
> >> >> >
> >> >> > But it's not for free at all. It's an additional concept that has
> to
> >> >> > be maintained, documented, and learned (with the last cost, which
> is
> >> >> > multiplied by the number of users, being by far the greatest). It's
> >> >> > not reinventing the wheel, it's saying hey, I have wheels and
> axles,
> >> >> > but what I really need the library to provide is a wheel+axle
> >> >> > assembly!
> >> >>
> >> >> You're communicating my argument better than I am.
> >> >>
> >> >> >>> Do we really get much advantage by building all these complex
> >> >> >>> operations in? I worry that we're trying to anticipate and write
> >> >> >>> code
> >> >> >>> for every situation that users find themselves in, instead of
> just
> >> >> >>> giving them some simple, orthogonal tools.
> >> >> >>>
> >> >> >>
> >> >> >> This is the danger, and which is why I advocate retaining the
> >> >> >> MaskedArray
> >> >> >> type that would provide the high-level "intelligent" operations,
> >> >> >> meanwhile
> >> >> >> having in the core the basic data structures for pairing a mask
> >> >> >> with
> >> >> >> an
> >> >> >> array, and to recognize a special np.NA value that would act upon
> >> >> >> the
> >> >> >> mask
> >> >> >> rather than the underlying data. Users would get very basic
> >> >> >> functionality,
> >> >> >> while the MaskedArray would continue to provide the interface that
> >> >> >> we
> >> >> >> are
> >> >> >> used to.
> >> >> >
> >> >> > The interface as described is quite different... in particular, all
> >> >> > aggregate operations would change their behavior.
> >> >> >
> >> >> >>> As a corollary, I worry that learning and keeping track of how
> >> >> >>> masked
> >> >> >>> arrays work is more hassle than just ignoring them and writing
> the
> >> >> >>> necessary code by hand as needed. Certainly I can imagine that
> *if
> >> >> >>> the
> >> >> >>> mask is a property of the data* then it's useful to have tools to
> >> >> >>> keep
> >> >> >>> it aligned with the data through indexing and such. But some of
> >> >> >>> these
> >> >> >>> other things are quicker to reimplement than to look up the docs
> >> >> >>> for,
> >> >> >>> and the reimplementation is easier to read, at least for me...
> >> >> >>
> >> >> >> What you are advocating is similar to the "tried-n-true" coding
> >> >> >> practice of
> >> >> >> Matlab users of using NaNs. You will hear from Matlab programmers
> >> >> >> about how
> >> >> >> it is the greatest idea since sliced bread (and I was one of
> them).
> >> >> >> Then I
> >> >> >> was introduced to Numpy, and I while I do sometimes still do the
> NaN
> >> >> >> approach, I realized that the masked array is a "better" way.
> >> >> >
> >> >> > Hey, no need to go around calling people Matlab programmers, you
> >> >> > might
> >> >> > hurt someone's feelings.
> >> >> >
> >> >> > But seriously, my argument is that every abstraction and new
> concept
> >> >> > has a cost, and I'm dubious that the full masked array abstraction
> >> >> > carries its weight and justifies this cost, because it's highly
> >> >> > redundant with existing abstractions. That has nothing to do with
> how
> >> >> > tried-and-true anything is.
> >> >>
> >> >> +1. I think I will personally only be happy if "masked array" can be
> >> >> implemented while incurring near-zero cost from the end user
> >> >> perspective. If what we end up with is a faster implementation of
> >> >> numpy.ma in C I'm probably going to keep on using NaN... That's why
> >> >> I'm entirely insistent that whatever design be dogfooded on
> non-expert
> >> >> users. If it's very much harder / trickier / nuanced than R, you will
> >> >> have failed.
> >> >>
> >> >
> >> > This sounds unduly pessimistic to me. It's one thing to suggest
> >> > different
> >> > approaches, another to cry doom and threaten to go eat worms. And all
> >> > before
> >> > the code is written, benchmarks run, or trial made of the usefulness
> of
> >> > the
> >> > approach. Let us see how things look as they get worked out. Mark has
> a
> >> > good
> >> > track record for innovative tools and I'm rather curious myself to see
> >> > what
> >> > the result is.
> >> >
> >> > Chuck
> >> >
> >> >
> >> > _______________________________________________
> >> > NumPy-Discussion mailing list
> >> > NumPy-Discussion@scipy.org
> >> > http://mail.scipy.org/mailman/listinfo/numpy-discussion
> >> >
> >> >
> >>
> >> I hope you're right. So far it seems that anyone who has spent real
> >> time with R (e.g. myself, Nathaniel) has expressed serious concerns
> >> about the masked approach. And we got into this discussion at the Data
> >> Array summit in Austin last month because we're trying to make Python
> >> more competitive with R viz statistical and financial applications.
> >> I'm just trying to be (R)ealistic =P Remember that I very earnestly am
> >> doing everything I can these days to make scientific Python more
> >> successful in finance and statistics. One big difference with R's
> >> approach is that we care more about performance the the R community
> >> does. So maybe having special NA values will be prohibitive for that
> >> reason.
> >>
> >> Mark indeed has a fantastic track record and I've been extremely
> >> impressed with his NumPy work, so I've no doubt he'll do a good job. I
> >> just hope that you don't push aside my input-- my opinions are formed
> >> entirely based on my domain experience.
> >>
> >
> > I think what we really need to see are the use cases and work flow. The
> ones
> > that hadn't occurred to me before were memory mapped files and data
> stored
> > on disk in general. I think we may need some standard format for masked
> data
> > on disk if we don't go the NA value route.
> >
> > Chuck
> >
> >
> > _______________________________________________
> > NumPy-Discussion mailing list
> > NumPy-Discussion@scipy.org
> > http://mail.scipy.org/mailman/listinfo/numpy-discussion
> >
> >
>
> Here are some things I can think of that would be affected by any changes
> here
>
> 1) Right now users of pandas can type pandas.isnull(series[5]) and
> that will yield True if the value is NA for any dtype. This might be
> hard to support in the masked regime
>
I think this would map to np.ismissing(series[5]). What you want probably
depends on whether series[5] represents a single value, a struct dtype
value, or is itself an array.
> 2) Functions like {Series, DataFrame}.fillna would hopefully look just
> like this:
>
> # value is 0 or some other value to fill
> new_series = self.copy()
> new_series[isnull(new_series)] = value
>
That should work fine, yes.
Keep in mind that people will write custom NA handling logic. So they might
> do:
>
> series[isnull(other_series) & isnull(other_series2)] = val
>
> 3) Nulling / NA-ing out data is very common
>
> # null out this data up to and including date1 in these three columns
> frame.ix[:date1, [col1, col2, col3]] = NaN
>
With np.NA instead of NaN, I think it would give what you want.
>
> # But this should work fine too
> frame.ix[:date1, [col1, col2, col3]] = 0
>
Under the hood, this would be unmasking and setting the appropriate values.
> I'll try to think of some others. The main thing is that the NA value
> is very easy to think about and fits in naturally with how people (at
> least statistical / financial users) think about and work with data.
> If you have to say "I have to set these mask locations to True" it
> introduces additional mental effort compared with "I'll just set these
> values to NA"
>
This is exactly what I mean when I'm talking about implementation details
versus interface choices. With enough use cases like you've given here, I'm
hoping to get that interface right.
-Mark
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.scipy.org/pipermail/numpy-discussion/attachments/20110625/fbed7ec1/attachment-0001.html
More information about the NumPy-Discussion
mailing list