[Numpy-discussion] missing data: semantics

Lluís xscript@gmx....
Thu Jun 30 12:46:27 CDT 2011


Ok, I think it's time to step back and reformulate the problem by
completely ignoring the implementation.

Here we have 2 "generic" concepts (i.e., applicable to R), plus another
extra concept that is exclusive to numpy:

* Assigning np.NA to an array, cannot be undone unless through explicit
  assignment (i.e., assigning a new arbitrary value, or saving a copy of
  the original array before assigning np.NA).

* np.NA values propagate by default, unless ufuncs have the "skipna =
  True" argument (or the other way around, it doesn't really matter to
  this discussion). In order to avoid passing the argument on each
  ufunc, we either have some per-array variable for the default "skipna"
  value (undesirable) or we can make a trivial ndarray subclass that
  will set the "skipna" argument on all ufuncs through the
  "_ufunc_wrapper_" mechanism.



Now, numpy has the concept of views, which adds some more goodies to the
list of concepts:

* With views, two arrays can share the same physical data, so that
  assignments to any of them will be seen by others (including NA
  values).

The creation of a view is explicitly stated by the user, so its
behaviour should not be perceived as odd (after all, you asked for a
view).

The good thing is that with views you can avoid costly array copies if
you're careful when writing into these views.



Now, you can add a new concept: local/temporal/transient missing data.

We can take an existing array and create a view with the new argument
"transientna = True".

Here, both the view and the "transientna = True" are explicitly stated
by the user, so it is assumed that she already knows what this is all
about.

The difference with a regular view is that you also explicitly asked for
local/temporal/transient NA values.

* Assigning np.NA to an array view with "transientna = True" will
  *not* be seen by any of the other views (nor the "original" array),
  but anything else will still work "as usual".

After all, this is what *you* asked for when using the "transientna =
True" argument.



To conclude, say that others *must not* care about whether the arrays
they're working with have transient NA values. This way, I can create a
view with transient NAs, set to NA some uninteresting data, and pass it
to a routine written by someone else that sets to NA elements that, for
example, are beyond certain threshold from the mean of the elements.

This would be equivalent to storing a copy of the original array before
passing it to this 3rd party function, only that "transientna", just as
views, provide some handy shortcuts to avoid copies.


My main point here is that views and local/temporal/transient NAs are
all *explicitly* requested, so that its behaviour should not appear as
something unexpected.

Is there an agreement on this?


Lluis

-- 
 "And it's much the same thing with knowledge, for whenever you learn
 something new, the whole world becomes that much richer."
 -- The Princess of Pure Reason, as told by Norton Juster in The Phantom
 Tollbooth


More information about the NumPy-Discussion mailing list