[Numpy-discussion] feedback request: proposal to add masks to the core ndarray
Fri Jun 24 11:45:04 CDT 2011
On Fri, Jun 24, 2011 at 6:59 AM, Matthew Brett <firstname.lastname@example.org>wrote:
> On Fri, Jun 24, 2011 at 2:32 AM, Nathaniel Smith <email@example.com> wrote:
> > If we think that the memory overhead for floating point types is too
> > high, it would be easy to add a special case where maybe(float) used a
> > distinguished NaN instead of a separate boolean. The extra complexity
> > would be isolated to the 'maybe' dtype's inner loop functions, and
> > transparent to the Python level. (Implementing a similar optimization
> > for the masking approach would be really nasty.) This would change the
> > overhead comparison to 0% versus 12.5% in favor of the dtype approach.
> Can I take this chance to ask Mark a bit more about the problems he
> sees for the dtypes with missing values? That is have a
> type dtypes. I see in your NEP you say 'The trouble with this
> approach is that it requires a large amount of special case code in
> each data type, and writing a new data type supporting missing data
> requires defining a mechanism for a special signal value which may not
> be possible in general.'
> Just to be clear, you are saying that that, for each dtype, there
> needs to be some code doing:
> missing_value = dtype.missing_value
> then, in loops:
> if val[here] == missing_value:
> and the fact that 'missing_value' could be any type would make the
> code more complicated than the current case where the mask is always
> bools or something?
I'm referring to the underlying C implementations of the dtypes and any
additional custom dtypes that people create. With the masked approach, you
implement a new custom data type in C, and it automatically works with
missing data. With the custom dtype approach, you have to do a lot more
error-prone work to handle the special values in all the ufuncs.
> Nathaniel's point about reduction in storage needed for the mask to 0
> is surely significant if we want numpy to be the best choice for big
The mask will only be there if it's explicitly requested, so it's not taking
away from NumPy in any way. If someone is dealing with data that large, I
likely wouldn't always be with the particular NA conventions NumPy chooses
for the various primitive data types, so that approach isn't a clear win
You mention that it would be good to allow masking for any new dtype -
> is that a practical problem? I mean, how many people will in fact
> have the combination of a) need of masking b) need of custom dtype,
> and c) lack of time or expertise to implement masking for that type?
Well, the people who need that right now will probably look at the NumPy C
source code and give up immediately. I'd rather push the system in a
direction of it being easier for those people than harder. It should be
possible to define a C++ data type class with overloaded operators, then say
NPY_EXPOSE_DTYPE(MyCustomClass), which would wrap those overloaded operators
with NumPy conventions. If this were done, I suspect many people would
create custom data types.
> Thanks a lot for the proposal and the discussion,
> NumPy-Discussion mailing list
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the NumPy-Discussion