[Numpy-discussion] questions about a "complicated" user-defined dtype and the ufunc API

Travis Oliphant oliphant@enthought....
Mon Aug 23 11:20:47 CDT 2010


On Aug 22, 2010, at 4:36 PM, Nathaniel Smith wrote:

> I'm experimenting with a user-defined "enumeration" dtype -- where the
> underlying array holds a set of integers, but they (mostly) appear to
> the user as strings. (This would be potentially useful for
> representing categorical data, modeling hdf5 enumerations, etc.) So
> for each set of enumerated values, I have one Python-level object that
> stores the mapping between strings and integers (an instance of the
> 'Enum' class), and a few Python-level objects that represent each of
> the enumerated values (instances of the 'EnumValue' class).
> 
> To map this into numpy, I've defined and registered a single custom
> dtype with 'EnumValue' as its typeobj, and then when I want to create
> an array of enumerations I make a copy of this registered dtype (with
> PyArray_DescrNewFromType) and stash a reference to the appropriate
> Enum object in its 'metadata' dictionary.

> 
> Question 1: Is this overall approach -- of only calling
> PyArray_RegisterDataType once, and then sharing the resulting typenum
> among all my different dtype instances -- correct? It doesn't seem
> reasonable to register a new dtype for every different set of
> enumerations, because AFAICT this would create a memory leak (since
> you can't "unregister" a dtype).

Yes, this is the approach to take. 

> 
> Anyway, that seems to be working okay, but then I wanted to teach "=="
> about my new dtype, so that I can compare to strings and such. So, I
> need to register my comparison function with the "np.equal" ufunc.
> 
> Question 2: Am I missing something, or does the ufunc API make this
> impossible? The problem is that a "PyUFuncGenericFunction" doesn't
> have any way to find the dtypes of the arrays that it's working on.
> All of the PyArray_ArrFuncs functions take a pointer to the underlying
> ndarray as an argument, so that when working with a string or void
> array, you can find the actual dtype and figure out (e.g.) the size of
> the objects involved. But ufunc inner loops don't get that, so I guess
> it's just impossible to define a ufunc over variable-sized data, or
> data that you need to be able to see the dtype metadata to interpret?

Yes, currently that is correct.   Variable data-types don't work in ufuncs for some subtle reasons.  But, the changes that allow date-times to work also fix (or will fix) this problem as a side-effect.

The necessary changes to ufuncs have not been made, yet, however.   They are planned.   And, yes, this would allow ufuncs to be used for string equality testing, etc. 

Thanks, 

-Travis






More information about the NumPy-Discussion mailing list