[Numpy-discussion] Concepts for masked/missing data

Eric Firing efiring@hawaii....
Sat Jun 25 16:42:52 CDT 2011


On 06/25/2011 09:09 AM, Benjamin Root wrote:
>
>
> On Sat, Jun 25, 2011 at 1:57 PM, Nathaniel Smith <njs@pobox.com
> <mailto:njs@pobox.com>> wrote:
>
>     On Sat, Jun 25, 2011 at 11:50 AM, Eric Firing <efiring@hawaii.edu
>     <mailto:efiring@hawaii.edu>> wrote:
>      > On 06/25/2011 07:05 AM, Nathaniel Smith wrote:
>      >> On Sat, Jun 25, 2011 at 9:26 AM, Matthew
>     Brett<matthew.brett@gmail.com <mailto:matthew.brett@gmail.com>>  wrote:
>      >>> To clarify, you're proposing for:
>      >>>
>      >>> a = np.sum(np.array([np.NA, np.NA])
>      >>>
>      >>> 1) ->  np.NA
>      >>> 2) ->  0.0
>      >>
>      >> Yes -- and in R you get actually do get NA, while in numpy.ma
>     <http://numpy.ma> you
>      >> actually do get 0. I don't think this is a coincidence; I think it's
>      >
>      > No, you don't:
>      >
>      > In [2]: np.ma.array([2, 4], mask=[True, True]).sum()
>      > Out[2]: masked
>      >
>      > In [4]: np.sum(np.ma.array([2, 4], mask=[True, True]))
>      > Out[4]: masked
>
>     Huh. So in numpy.ma <http://numpy.ma>, sum([10, NA]) and sum([10])
>     are the same, but
>     sum([NA]) and sum([]) are different? Sounds to me like you should file
>     a bug on numpy.ma...
>
>
> Actually, no... I should have tested this before replying earlier:
>
>  >>> a = np.ma.array([2, 4], mask=[True, True])
>  >>> a
> masked_array(data = [-- --],
>               mask = [ True  True],
>         fill_value = 999999)
>
>  >>> a.sum()
> masked
>  >>> a = np.ma.array([], mask=[])
>  >>> a
>  >>> a
> masked_array(data = [],
>               mask = [],
>         fill_value = 1e+20)
>  >>> a.sum()
> masked
>
> They are the same.
>
>
>     Anyway, the general point is that in R, NA's propagate, and in
>     numpy.ma <http://numpy.ma>, masked values are ignored (except,
>     apparently, if all values
>     are masked). Here, I actually checked these:
>
>     Python: np.ma.array([2, 4], mask=[True, False]).sum() -> 4
>     R: sum(c(NA, 4)) -> NA
>
>
> If you want NaN behavior, then use NaNs.  If you want masked behavior,
> then use masks.

But I think that where Mark is heading is towards infrastructure that 
makes it easy and efficient to do either, as needed, case by case, line 
by line, for any dtype--not just floats.  If he can succeed, that helps 
all of us.  This doesn't have to be "R versus masked arrays", or 
beginners versus experienced programmers.

Eric

>
> Ben Root
>
>
>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion



More information about the NumPy-Discussion mailing list