[Numpy-discussion] feedback request: proposal to add masks to the core ndarray
Fri Jun 24 12:43:42 CDT 2011
On Fri, Jun 24, 2011 at 9:27 AM, Bruce Southey <firstname.lastname@example.org> wrote:
> On 06/24/2011 09:06 AM, Robert Kern wrote:
> On Fri, Jun 24, 2011 at 07:30, Laurent Gautier <email@example.com> <firstname.lastname@example.org> wrote:
> On 2011-06-24 13:59, Nathaniel Smith <email@example.com> <firstname.lastname@example.org> wrote:
> On Thu, Jun 23, 2011 at 5:56 PM, Benjamin Root<email@example.com> <firstname.lastname@example.org> wrote:
> Lastly, I am not entirely familiar with R, so I am also very curious about
> what this magical "NA" value is, and how it compares to how NaNs work.
> Although, Pierre brought up the very good point that NaNs woulldn't work
> anyway with integer arrays (and object arrays, etc.).
> Since R is designed for statistics, they made the interesting decision
> that *all* of their core types have a special designated "missing"
> value. At the R level this is just called "NA". Internally, there are
> a bunch of different NA values -- for floats it's a particular NaN,
> for integers it's INT_MIN, for booleans it's 2 (IIRC), etc. (You never
> notice this, because R will silently cast a NA of one type into NA of
> another type whenever needed, and they all print the same.)
> Because any array can contain NA's, all R functions then have to have
> some way of handling this -- all their integer arithmetic knows that
> INT_MIN is special, for instance. The rules are basically the same as
> for NaN's, but NA and NaN are different from each other (because one
> means "I don't know, could be anything" and the other means "you tried
> to divide by 0, I *know* that's meaningless").
> That's basically it.
> -- Nathaniel
> Would the use of R's system for expressing "missing values" be possible
> in numpy through a special flag ?
> Any given numpy array could have a boolean flag (say "na_aware")
> indicating that some of the values are representing a missing cell.
> If the exact same system is used, interaction with R (through something
> like rpy2) would be simplified and more robust.
> The alternative proposal would be to add a few new dtypes that are
> NA-aware. E.g. an nafloat64 would reserve a particular NaN value
> (there are lots of different NaN bit patterns, we'd just reserve one)
> that would represent NA. An naint32 would probably reserve the most
> negative int32 value (like R does). Using the NA-aware dtypes signals
> that you are using NA values; there is no need for an additional flag.
> There is an very important distinction here between a masked value and a
> missing value. In some sense, a missing value is a permanently masked value
> and may be indistinguishable from 'not a number'. But a masked value is not
> a missing value because with masked arrays the original values still exists.
> Consequently using a masked value for 'missing-ness' can be reversed by the
> user changing the mask at any time. That is really the power of masked
> arrays as you can real missing values but also 'flag' unusual values as
In the design I'm proposing, it's using a mask to implement missing values,
hence the usage of the terms "masked" and "unmasked" elements. The semantics
you're describing can be achieved with the missing value interpretation.
First, you take a view of your array, then give it a mask. In the view,
there will be strict missing data semantics, but the data is still
accessible through the original array.
When people come to NumPy asking about missing values, they generally get
pointed at numpy.ma, so there is an impression out there that it's intended
for that usage.
Virtually software packages are handling missing values not masked values.
> So it is really, really important that you clarify what you are proposing
> because your proposal does mix these two different concepts.
> As per the missing value discussion, I would think that adding a missing
> value data type(s) for 'missing values' would be feasible and may be
> something that numpy should have. But that would not address 'masked values'
> and probably must be view as independent topic and thread.
Can you describe what needed features are missing from taking a view +
adding a mask, that 'masked values' as a separate concept would have? While
different, I think a single implementation with strict semantics can provide
both perspectives when used like this or in a similar fashion.
> Below are some sources for missing values in R and SAS. SAS has 28 ways
> that a user can define numerical values as 'missing values' - not just the
> dot! While not apparently universal, SAS has missing value codes to handle
> positive and negative infinity. R does differ between missing values and
> 'not a number' which, to my knowledge, SAS does not do.
The question this raises in my mind is whether an "NA"-like object in NumPy
should have a type associated with it, or whether it should be a singleton.
Like np.NA('i8') for a missing 64-bit int, or np.NA like None but with the
specific missing value semantics. Keeping the type around would allow for
checking against casting rules.
> This distinction is probably important for masked vs missing values. SAS
> uses a blank for missing character values but see the two links at the end
> for more than.
> This is for R:
> This page is a comparison to R:
> Some other SAS sources:
> Malachy J. Foley "MISSING VALUES: Everything You Ever Wanted to Know"
> 072-2011: Special Missing Values for Character Fields - SAS
Thanks for the links!
> NumPy-Discussion mailing list
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the NumPy-Discussion