[Numpy-discussion] questions about a "complicated" user-defined dtype and the ufunc API

Nathaniel Smith njs@pobox....
Sun Aug 22 16:36:47 CDT 2010

I'm experimenting with a user-defined "enumeration" dtype -- where the
underlying array holds a set of integers, but they (mostly) appear to
the user as strings. (This would be potentially useful for
representing categorical data, modeling hdf5 enumerations, etc.) So
for each set of enumerated values, I have one Python-level object that
stores the mapping between strings and integers (an instance of the
'Enum' class), and a few Python-level objects that represent each of
the enumerated values (instances of the 'EnumValue' class).

To map this into numpy, I've defined and registered a single custom
dtype with 'EnumValue' as its typeobj, and then when I want to create
an array of enumerations I make a copy of this registered dtype (with
PyArray_DescrNewFromType) and stash a reference to the appropriate
Enum object in its 'metadata' dictionary.

Question 1: Is this overall approach -- of only calling
PyArray_RegisterDataType once, and then sharing the resulting typenum
among all my different dtype instances -- correct? It doesn't seem
reasonable to register a new dtype for every different set of
enumerations, because AFAICT this would create a memory leak (since
you can't "unregister" a dtype).

Anyway, that seems to be working okay, but then I wanted to teach "=="
about my new dtype, so that I can compare to strings and such. So, I
need to register my comparison function with the "np.equal" ufunc.

Question 2: Am I missing something, or does the ufunc API make this
impossible? The problem is that a "PyUFuncGenericFunction" doesn't
have any way to find the dtypes of the arrays that it's working on.
All of the PyArray_ArrFuncs functions take a pointer to the underlying
ndarray as an argument, so that when working with a string or void
array, you can find the actual dtype and figure out (e.g.) the size of
the objects involved. But ufunc inner loops don't get that, so I guess
it's just impossible to define a ufunc over variable-sized data, or
data that you need to be able to see the dtype metadata to interpret?

This seems easy enough to fix, and would probably allow the removal of
a big pile of code in arrayobject.c that does special-case handling
for "==" on strings and void arrays. (Another side-effect of the
current special-case approach is that "==" and "np.equal" behave
differently on arrays of strings.) But is there another option?

Anyway, thanks for your help! I'll attach the code in case it helps
(and because I'm not too confident I'm getting the details right!).
Sample usage:
>>> from npenum import *; import numpy as np
>>> e = Enum(["low", "medium", "high"])
>>> a = np.array(["low", "low", "high", "medium"], dtype=e)
>>> a
array([low, low, high, medium], dtype=EnumValue)
>>> a.view(Enum.get(a).inttype)
array([0, 0, 2, 1], dtype=uint32)

-- Nathaniel
-------------- next part --------------
A non-text attachment was scrubbed...
Name: npenum.pyx
Type: application/octet-stream
Size: 12197 bytes
Desc: not available
Url : http://mail.scipy.org/pipermail/numpy-discussion/attachments/20100822/42e086a7/attachment.obj 

More information about the NumPy-Discussion mailing list