[Numpy-discussion] feedback request: proposal to add masks to the core ndarray

Matthew Brett matthew.brett@gmail....
Sat Jun 25 10:08:27 CDT 2011


Hi,

On Sat, Jun 25, 2011 at 3:44 PM, Wes McKinney <wesmckinn@gmail.com> wrote:
...
> Here are some things I can think of that would be affected by any changes here
>
> 1) Right now users of pandas can type pandas.isnull(series[5]) and
> that will yield True if the value is NA for any dtype. This might be
> hard to support in the masked regime

But, following the NEP, I could imagine something like this:

def isnull(a):
    if a.validitymask is None:
        return np.ones(a.shape, dtype=np.bool)
    return a.validitymask == False

I suppose the return array in this case would be 0d bool.  Would that
not serve here?

> 2) Functions like {Series, DataFrame}.fillna would hopefully look just
> like this:
>
> # value is 0 or some other value to fill
> new_series = self.copy()
> new_series[isnull(new_series)] = value

isnull above or:

new_series = new_series.fill_masked(value)

?

> Keep in mind that people will write custom NA handling logic. So they might do:
>
> series[isnull(other_series) & isnull(other_series2)] = val

> 3) Nulling / NA-ing out data is very common
>
> # null out this data up to and including date1 in these three columns
> frame.ix[:date1, [col1, col2, col3]] = NaN

I think Mark is proposing that this:

frame.ix[:date1, [col1, col2, col3]] = np.NA

will work - maybe he can correct me if I'm wrong?

> I'll try to think of some others. The main thing is that the NA value
> is very easy to think about and fits in naturally with how people (at
> least statistical / financial users) think about and work with data.
> If you have to say "I have to set these mask locations to True" it
> introduces additional mental effort compared with "I'll just set these
> values to NA"

I could imagine making the API such that, in practice, you would be
thinking that you were setting the values to NA, even though you were
in fact setting a mask.

My own worry here is not about the API, but the implementation.  I'm
worried that it is using more memory, and I don't know how we can be
sure whether it will be faster without implementing both.

See you,

Matthew


More information about the NumPy-Discussion mailing list