[Numpy-discussion] questions about a "complicated" user-defined dtype and the ufunc API
Mon Aug 23 11:20:47 CDT 2010
On Aug 22, 2010, at 4:36 PM, Nathaniel Smith wrote:
> I'm experimenting with a user-defined "enumeration" dtype -- where the
> underlying array holds a set of integers, but they (mostly) appear to
> the user as strings. (This would be potentially useful for
> representing categorical data, modeling hdf5 enumerations, etc.) So
> for each set of enumerated values, I have one Python-level object that
> stores the mapping between strings and integers (an instance of the
> 'Enum' class), and a few Python-level objects that represent each of
> the enumerated values (instances of the 'EnumValue' class).
> To map this into numpy, I've defined and registered a single custom
> dtype with 'EnumValue' as its typeobj, and then when I want to create
> an array of enumerations I make a copy of this registered dtype (with
> PyArray_DescrNewFromType) and stash a reference to the appropriate
> Enum object in its 'metadata' dictionary.
> Question 1: Is this overall approach -- of only calling
> PyArray_RegisterDataType once, and then sharing the resulting typenum
> among all my different dtype instances -- correct? It doesn't seem
> reasonable to register a new dtype for every different set of
> enumerations, because AFAICT this would create a memory leak (since
> you can't "unregister" a dtype).
Yes, this is the approach to take.
> Anyway, that seems to be working okay, but then I wanted to teach "=="
> about my new dtype, so that I can compare to strings and such. So, I
> need to register my comparison function with the "np.equal" ufunc.
> Question 2: Am I missing something, or does the ufunc API make this
> impossible? The problem is that a "PyUFuncGenericFunction" doesn't
> have any way to find the dtypes of the arrays that it's working on.
> All of the PyArray_ArrFuncs functions take a pointer to the underlying
> ndarray as an argument, so that when working with a string or void
> array, you can find the actual dtype and figure out (e.g.) the size of
> the objects involved. But ufunc inner loops don't get that, so I guess
> it's just impossible to define a ufunc over variable-sized data, or
> data that you need to be able to see the dtype metadata to interpret?
Yes, currently that is correct. Variable data-types don't work in ufuncs for some subtle reasons. But, the changes that allow date-times to work also fix (or will fix) this problem as a side-effect.
The necessary changes to ufuncs have not been made, yet, however. They are planned. And, yes, this would allow ufuncs to be used for string equality testing, etc.
More information about the NumPy-Discussion