[Numpy-discussion] Re: NumPy and None (null, NaN, missing)

Martin Maechler maechler at stat.math.ethz.ch
Mon Apr 10 03:29:17 CDT 2000


>>>>> "TimC" == gestalt-system-discuss-admin  <gestalt-system-discuss-admin at lists.sourceforge.net> writes:

    TimC> Date: Sun, 09 Apr 2000 01:07:13 +1000
    TimC> From: Tim Churches <tchur at bigpond.com>
    TimC> Organization: Gestalt Institute
    TimC> To: strang at nmr.mgh.harvard.edu, strang at bucky.nmr.mgh.harvard.edu,
    TimC>    gestalt-system-discuss at lists.sourceforge.net,
    TimC>    numpy-discussion at lists.sourceforge.net

    TimC> I'm a new user of MumPy so forgive me if this is a FAQ. ......

    TimC> I've been experimenting with using Gary Strangman's excellent stats.py
    TimC> functions. The spped of these functions when operating on NumPy arrays
    TimC> and the ability of NumPy to swallow very large arrays is remarkable.

    TimC> However, one deficiency I have noticed is the lack of the ability
    TimC> to represent nulls (i.e. missing values, None or NaN
    TimC> [Not-a-Number] in NumPy arrays. Missing values commonly occur in
    TimC> real-life statistical data and although they are usually excluded
    TimC> from most statistical calculations, it is important to be able to
    TimC> keep track of the number of missing data elements and report
    TimC> this.

I'm just a recent "listener" on gestalt-system-discuss,
and don't even have any python experience.
I'm member of the R core team (www.r-project.org).

In R (and even in S-plus, but almost invisibly there),
we even do differentiate between
"NA" (missing / not available)  and "NaN" (IEEE result of 0/0, etc).

I'd very much like to have these different as in R.
I think our implementation of these is quite efficient, 
implementing NA as one particular bit pattern from the whole possible NaN
set.

We use code like the following  (R source, src/main/arithmetic.c ) :

     static double R_ValueOfNA(void)
     {
	 ieee_double x;
	 x.word[hw] = 0x7ff00000;
	 x.word[lw] = 1954;
	 return x.value;
     }

     int R_IsNA(double x)
     {
	 if (isnan(x)) {
	     ieee_double y;
	     y.value = x;
	     return (y.word[lw] == 1954);
	 }
	 return 0;
     }

Martin Maechler <maechler at stat.math.ethz.ch>	http://stat.ethz.ch/~maechler/
        
    TimC> Because NumPy arrays can't represent missing data via a
    TimC> special value, it is necessary to exclude missing data elements
    TimC> from NumPy arrays and keep track of them elsewhere (in standard
    TimC> Python lists). This is messy. Also, it is quite common to use
    TimC> various imputation techniques to estimate the values of missing
    TimC> data elements - the ability to represent missing data in a NumPy
    TimC> array and then change it to an imputed value would be a real
    TimC> boon.




More information about the Numpy-discussion mailing list