[Numpy-discussion] feedback request: proposal to add masks to the core ndarray
Wes McKinney
wesmckinn@gmail....
Fri Jun 24 19:11:46 CDT 2011
On Fri, Jun 24, 2011 at 8:02 PM, Charles R Harris
<charlesr.harris@gmail.com> wrote:
>
>
> On Fri, Jun 24, 2011 at 5:22 PM, Wes McKinney <wesmckinn@gmail.com> wrote:
>>
>> On Fri, Jun 24, 2011 at 7:10 PM, Charles R Harris
>> <charlesr.harris@gmail.com> wrote:
>> >
>> >
>> > On Fri, Jun 24, 2011 at 4:21 PM, Matthew Brett <matthew.brett@gmail.com>
>> > wrote:
>> >>
>> >> Hi,
>> >>
>> >> On Fri, Jun 24, 2011 at 10:09 PM, Benjamin Root <ben.root@ou.edu>
>> >> wrote:
>> >> ...
>> >> > Again, there are pros and cons either way and I see them very
>> >> > orthogonal
>> >> > and
>> >> > complementary.
>> >>
>> >> That may be true, but I imagine only one of them will be implemented.
>> >>
>> >> @Mark - I don't have a clear idea whether you consider the nafloat64
>> >> option to be still in play as the first thing to be implemented
>> >> (before array.mask). If it is, what kind of thing would persuade you
>> >> either way?
>> >>
>> >
>> > Mark can speak for himself, but I think things are tending towards
>> > masks.
>> > They have the advantage of one implementation for all data types,
>> > current
>> > and future, and they are more flexible since the masked data can be
>> > actual
>> > valid data that you just choose to ignore for experimental reasons.
>> >
>> > What might be helpful is a routine to import/export R files, but that
>> > shouldn't be to difficult to implement.
>> >
>> > Chuck
>> >
>> >
>> > _______________________________________________
>> > NumPy-Discussion mailing list
>> > NumPy-Discussion@scipy.org
>> > http://mail.scipy.org/mailman/listinfo/numpy-discussion
>> >
>> >
>>
>> Perhaps we should make a wiki page someplace summarizing pros and cons
>> of the various implementation approaches? I worry very seriously about
>> adding API functions relating to masks rather than having special NA
>> values which propagate in algorithms. The question is: will Joe Blow
>> Former R user have to understand what is the mask and how to work with
>> it? If the answer is yes we have a problem. If it can be completely
>> hidden as an implementation detail, that's great. In R NAs are just
>> sort of inherent-- they propagate you deal with them when you have to
>> via na.rm flag in functions or is.na.
>>
>
> Well, I think both of those can be pretty transparent. Could you illustrate
> some typical R usage, to wit.
>
> 1) setting a value to na
> 2) checking a value for na
>
> Other things are problematic, like checking for integer overflow. For safety
> that would be desireable, for speed not. I think that is a separate question
> however. In any case, if we do check such things we should be able to set
> the corresponding mask value in the loop, and I suppose that is the sort of
> thing you want.
>
> Chuck
>
>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
>
I think anyone making decisions about this needs to have a pretty good
understanding of what R does. So here's some examples but you guys
really need to spend some time with R if you have not already
arr <- rnorm(20)
arr
[1] 1.341960278 0.757033314 -0.910468762 -0.475811935 -0.007973053
[6] 1.618201117 -0.965747088 0.386811224 0.229158237 0.987050613
[11] 1.293453170 -2.432399045 -0.247593481 -0.639769586 -0.464996583
[16] 0.720181047 0.846607030 0.486173088 -0.911247626 0.370326788
arr[5:10] = NA
arr
[1] 1.3419603 0.7570333 -0.9104688 -0.4758119 NA NA
[7] NA NA NA NA 1.2934532 -2.4323990
[13] -0.2475935 -0.6397696 -0.4649966 0.7201810 0.8466070 0.4861731
[19] -0.9112476 0.3703268
is.na(arr)
[1] FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE
[13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
mean(arr)
[1] NA
mean(arr, na.rm=T)
[1] -0.01903945
arr + rnorm(20)
[1] 2.081580297 0.505050028 -0.696287035 -1.280323279 NA
[6] NA NA NA NA NA
[11] 2.166078369 -1.445271291 0.764894624 0.795890929 0.549621207
[16] 0.005215596 -0.170001426 0.712335355 -0.919671745 -0.617099818
and obviously this is OK too:
arr <- rep('wes', 10)
arr[5:7] <- NA
is.na(arr)
[1] FALSE FALSE FALSE FALSE TRUE TRUE TRUE FALSE FALSE FALSE
note, NA gets excluded from categorical variables (factors):
as.factor(arr)
[1] wes wes wes wes <NA> <NA> <NA> wes wes wes
Levels: wes
e.g. groupby with NA:
> tapply(rnorm(10), arr, mean)
wes
-0.5271853
More information about the NumPy-Discussion
mailing list