[Numpy-discussion] feedback request: proposal to add masks to the core ndarray

Bruce Southey bsouthey@gmail....
Fri Jun 24 09:27:55 CDT 2011


On 06/24/2011 09:06 AM, Robert Kern wrote:
> On Fri, Jun 24, 2011 at 07:30, Laurent Gautier<lgautier@gmail.com>  wrote:
>> On 2011-06-24 13:59,  Nathaniel Smith<njs@pobox.com>  wrote:
>>> On Thu, Jun 23, 2011 at 5:56 PM, Benjamin Root<ben.root@ou.edu>    wrote:
>>>> Lastly, I am not entirely familiar with R, so I am also very curious about
>>>> what this magical "NA" value is, and how it compares to how NaNs work.
>>>> Although, Pierre brought up the very good point that NaNs woulldn't work
>>>> anyway with integer arrays (and object arrays, etc.).
>>> Since R is designed for statistics, they made the interesting decision
>>> that *all* of their core types have a special designated "missing"
>>> value. At the R level this is just called "NA". Internally, there are
>>> a bunch of different NA values -- for floats it's a particular NaN,
>>> for integers it's INT_MIN, for booleans it's 2 (IIRC), etc. (You never
>>> notice this, because R will silently cast a NA of one type into NA of
>>> another type whenever needed, and they all print the same.)
>>>
>>> Because any array can contain NA's, all R functions then have to have
>>> some way of handling this -- all their integer arithmetic knows that
>>> INT_MIN is special, for instance. The rules are basically the same as
>>> for NaN's, but NA and NaN are different from each other (because one
>>> means "I don't know, could be anything" and the other means "you tried
>>> to divide by 0, I *know* that's meaningless").
>>>
>>> That's basically it.
>>>
>>> -- Nathaniel
>> Would the use of R's system for expressing "missing values" be possible
>> in numpy through a special flag ?
>>
>> Any given numpy array could have a boolean flag (say "na_aware")
>> indicating that some of the values are representing a missing cell.
>>
>> If the exact same system is used, interaction with R (through something
>> like rpy2) would be simplified and more robust.
> The alternative proposal would be to add a few new dtypes that are
> NA-aware. E.g. an nafloat64 would reserve a particular NaN value
> (there are lots of different NaN bit patterns, we'd just reserve one)
> that would represent NA. An naint32 would probably reserve the most
> negative int32 value (like R does). Using the NA-aware dtypes signals
> that you are using NA values; there is no need for an additional flag.
>

There is an very important distinction here between a masked value and a 
missing value. In some sense, a missing value is a permanently masked 
value and may be indistinguishable from 'not a number'. But a masked 
value is not a missing value because with masked arrays the original 
values still exists. Consequently using a masked value for 
'missing-ness' can be reversed by the user changing the mask at any 
time. That is really the power of masked arrays as you can real missing 
values but also 'flag' unusual values as missing!

Virtually software packages are handling missing values not masked 
values. So it is really, really important that you clarify what you are 
proposing because your proposal does mix these two different concepts.

As per the missing value discussion, I would think that adding a missing 
value data type(s) for 'missing values' would be feasible and may be 
something that numpy should have. But that would not address 'masked 
values' and probably must be view as independent topic and thread.

Below are some sources for missing values in R and SAS.  SAS has 28 ways 
that a user can define numerical values as 'missing values' - not just 
the dot! While not apparently universal, SAS has missing value codes to 
handle positive and negative infinity. R does differ between missing 
values and 'not a number' which, to my knowledge, SAS does not do. This 
distinction is probably important for masked vs missing values. SAS uses 
a blank for missing character values but see the two links at the end 
for more than.

This is for R:
http://faculty.nps.edu/sebuttre/home/S/missings.html
http://www.ats.ucla.edu/stat/r/faq/missing.htm

http://support.sas.com/documentation/cdl/en/lrcon/62955/HTML/default/viewer.htm#a001292604.htm
This page is a comparison to R:
http://support.sas.com/documentation/cdl/en/imlug/63541/HTML/default/viewer.htm#imlug_r_sect019.htm

Some other SAS sources:
Malachy J. Foley "MISSING VALUES: Everything You Ever Wanted to Know"
http://analytics.ncsu.edu/sesug/2005/TU06_05.PDF
072-2011: Special Missing Values for Character Fields - SAS
support.*sas*.com/resources/papers/proceedings11/072-2011.pdf


Bruce
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.scipy.org/pipermail/numpy-discussion/attachments/20110624/3111a3a8/attachment.html 


More information about the NumPy-Discussion mailing list