[Numpy-discussion] missing data discussion round 2
Dag Sverre Seljebotn
Wed Jun 29 13:07:45 CDT 2011
On 06/29/2011 07:38 PM, Mark Wiebe wrote:
> On Wed, Jun 29, 2011 at 9:35 AM, Dag Sverre Seljebotn
> <firstname.lastname@example.org <mailto:email@example.com>> wrote:
> On 06/29/2011 03:45 PM, Matthew Brett wrote:
> > Hi,
> > On Wed, Jun 29, 2011 at 12:39 AM, Mark Wiebe<firstname.lastname@example.org
> <mailto:email@example.com>> wrote:
> >> On Tue, Jun 28, 2011 at 5:20 PM, Matthew
> Brett<firstname.lastname@example.org <mailto:email@example.com>>
> >> wrote:
> >>> Hi,
> >>> On Tue, Jun 28, 2011 at 4:06 PM, Nathaniel Smith<firstname.lastname@example.org
> <mailto:email@example.com>> wrote:
> >>> ...
> >>>> (You might think, what difference does it make if you *can*
> unmask an
> >>>> item? Us missing data folks could just ignore this feature. But:
> >>>> whatever we end up implementing is something that I will have to
> >>>> explain over and over to different people, most of them not
> >>>> particularly sophisticated programmers. And there's just no
> >>>> way to explain this idea that if you store some particular
> value, then
> >>>> it replaces the old value, but if you store NA, then the old
> value is
> >>>> still there.
> >>> Ouch - yes. No question, that is difficult to explain. Well, I
> >>> think the explanation might go like this:
> >>> "Ah, yes, well, that's because in fact numpy records missing
> values by
> >>> using a 'mask'. So when you say `a = np.NA', what you mean is,
> >>> 'a._mask = np.ones(a.shape, np.dtype(bool); a._mask = False`"
> >>> Is that fair?
> >> My favorite way of explaining it would be to have a grid of
> numbers written
> >> on paper, then have several cardboards with holes poked in them
> in different
> >> configurations. Placing these cardboard masks in front of the
> grid would
> >> show different sets of non-missing data, without affecting the
> values stored
> >> on the paper behind them.
> > Right - but here of course you are trying to explain the mask, and
> > this is Nathaniel's point, that in order to explain NAs, you have to
> > explain masks, and so, even at a basic level, the fusion of the two
> > ideas is obvious, and already confusing. I mean this:
> > a = np.NA
> > "Oh, so you just set the a value to have some missing value code?"
> > "Ah - no - in fact what I did was set a associated mask in position
> > a so that you can't any longer see the previous value of a"
> > "Huh. You mean I have a mask for every single value in order to be
> > able to blank out a? It looks like an assignment. I mean, it
> > looks just like a = 4. But I guess it isn't?"
> > "Er..."
> > I think Nathaniel's point is a very good one - these are separate
> > ideas, np.NA and np.IGNORE, and a joint implementation is bound to
> > draw them together in the mind of the user. Apart from anything
> > else, the user has to know that, if they want a single NA value in an
> > array, they have to add a mask size array.shape in bytes. They have
> > to know then, that NA is implemented by masking, and then the 'NA for
> > free by adding masking' idea breaks down and starts to feel like a
> > kludge.
> > The counter argument is of course that, in time, the
> implementation of
> > NA with masking will seem as obvious and intuitive, as, say,
> > broadcasting, and that we are just reacting from lack of experience
> > with the new API.
> However, no matter how used we get to this, people coming from almost
> any other tool (in particular R) will keep think it is
> counter-intuitive. Why set up a major semantic incompatability that
> people then have to overcome in order to start using NumPy.
> I'm not aware of a semantic incompatibility. I believe R doesn't support
> views like NumPy does, so the things you have to do to see masking
> semantics aren't even possible in R.
Well, whether the same feature is possible or not in R is irrelevant to
whether a semantic incompatability would exist.
Views themselves are a *major* semantic incompatability, and are highly
confusing at first to MATLAB/Fortran/R people. However they have major
advantages outweighing the disadvantage of having to caution new users.
But there's simply no precedence anywhere for an assignment that doesn't
erase the old value for a particular input value, and the advantages
seem pretty minor (well, I think it is ugly in its own right, but that
is besides the point...)
More information about the NumPy-Discussion