[Numpy-discussion] Enum type

Travis Oliphant teoliphant@gmail....
Wed Jan 4 00:07:31 CST 2012


A categorical type (or enum type) is an important dtype to add to NumPy.   It would be very nice if the option existed to make the categorical dtype "dynamic" in that the categories can grow as more data is added or inserted into the array.   This would effectively allow binning of data on insertion into the array.  

The option would need to exist to have both "fixed" and "dynamic" dtypes because there are important use-cases for both.

-Travis

On Jan 3, 2012, at 2:02 PM, Nathaniel Smith wrote:

> On Tue, Jan 3, 2012 at 9:46 AM, Ognen Duzlevski <ognen@enthought.com> wrote:
>> Hello,
>> 
>> I am playing with adding an enum dtype to numpy (to get my feet wet in
>> numpy really). I have looked at the
>> https://github.com/martinling/numpy_quaternion and I feel comfortable
>> with my understanding of adding a simple type to numpy in technical
>> terms.
> 
> Hi Ognen,
> 
> I'm in the middle of an intercontinental move, so I can't help much,
> but I'd also love to see a proper enum/categorical type in numpy, so
> here are a few notes:
> 
> - I wrote a simple cython implementation of this last year, which
> might be useful -- code attached.
> 
> - The barrier I ran into, which you'll surely run into as well, is a
> flaw in the ufunc API in numpy. Currently, ufunc inner loops do not
> have any way to access the dtype of the array they are being called
> on. For most dtypes, this isn't an issue -- the inner loop for adding
> together int32's knows that it is being called on an array of int32's,
> it doesn't need to see the dtype to figure that out. But with enums,
> each array has a different set of possible categories, and these will
> be attached to the dtype object somehow. So if you want to do, say,
> equality comparison between an enum-array and a string-array:
>  np.enumarray(["a"", "b", "c"]) == ["a", "c", "b"] -> np.array([True,
> False, True])
> ...you can't actually make this work in current numpy. The solution is
> that the ufunc API needs to be changed to make dtype's somehow
> available to inner loops. (Probably by passing a pointer to the array
> object, like all the PyArray_ArrFuncs do.)
> 
> See this thread:
> http://mail.scipy.org/pipermail/numpy-discussion/2010-August/052401.html
> 
> - Both the statistical folk (pandas, statsmodels) and the hdf5 folk
> (pytables, h5py) have reasons to want better enum support. (Maybe
> there are other use cases too -- anyone I'm forgetting?) You should
> make sure to talk to both groups to make sure what you come up with
> will work for them.
> 
> Cheers,
> -- Nathaniel
> 
>> I am mostly a C programmer and have programmed in Python but not at
>> the level where my code wcould be considered "pretty" or maybe even
>> "pythonic". I know enums from C and have browsed around a few python
>> enum implementations online. Most of them use hash tables or lists to
>> associate names to numbers - these approaches just feel "heavy" to me.
>> 
>> What would be a proper "numpy approach" to this? I am looking mostly
>> for direction and advice as I would like to do the work myself :-)
>> 
>> Any input appreciated :-)
>> Ognen
>> _______________________________________________
>> NumPy-Discussion mailing list
>> NumPy-Discussion@scipy.org
>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
> <npenum.pyx>_______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion



More information about the NumPy-Discussion mailing list