[Numpy-discussion] draft enum NEP

Wes McKinney wesmckinn@gmail....
Sun Mar 11 18:03:17 CDT 2012


On Fri, Mar 9, 2012 at 5:48 PM, David Gowers (kampu) <00ai99@gmail.com> wrote:
> Hi,
>
> On Sat, Mar 10, 2012 at 3:25 AM, Bryan Van de Ven <bryanv@continuum.io> wrote:
>> Hi all,
>>
>> I have started working on a NEP for adding an enumerated type to NumPy.
>> It is on my GitHub:
>>
>>     https://github.com/bryevdv/numpy/blob/enum/doc/neps/enum.rst
>>
>> It is still very rough, and incomplete in places. But I would like to
>> get feedback sooner rather than later in order to refine it. In
>> particular there are a few questions inline in the document that I would
>> like input on. Any comments, suggestions, questions, concerns, etc. are
>> very welcome.
>
> "t = np.dtype('enum', map=(n,v))"
>
> ^ Is this supposed to be indicating 'this is an enum with values
> ranging between n and v'? It could be a bit more clear.
>
> Is it possible to partially define an enum? That is, give the maximum
> and minimum values, and only some of the enumeration value:name
> mappings?
> For example, an enum where 0 means 'n/a', +n means 'Type A Object
> #(n-1)' and -n means 'Type B Object #(abs(n) - 1)'. I just want to map
> the non-scalar values, while having a way to avoid treating valid
> scalar values (eg +64) as out-of-range.
> Example of what I mean:
>
> "t = np.dtype('enum[N_A:0]', range = (-127, 127))"
> (defined values being printed as a string, undefined being printed as a number.)
>
> David
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion

I'll have to think about this (a little brain dump here). I have many
use cases in pandas where this would be useful which are basically
direct translations of R's factor data type. Note that R always
coerces the levels (the unique values) AFAICT to string type. However,
mapping back to a well-dtyped array is important, too. So the
temptation might be to do something like this:

ndarray: dtype storage type (uint32 or something)
mapping : khash with type PyObject* -> uint32

Now, one problem with this is that you want the mapping + dtype to be
invertible (otherwise you're left doing some type inference). The way
that I implement the mapping is to restrict the labeling to be from 0
to N - 1 which makes things easier. If we decide that having an
explicit value mapping

The nice thing about this is that the same set of core algorithms can
be used to fix numpy.unique. For example you would like to be able to
do:

enum_arr = np.enum(arr)

(this seems like a reasonable API to me) and that is a direct
equivalent of R's factor function. You need to be able to pass an
explicit ordering when calling the enum/factor function. If not
specified, you should have an option to either sort or not-- for
example suppose you convert an array of 1 million integers to enum but
you don't particularly care about the uniques (which could be very
large, up to the size of the array) being ordered (no need to pay N
log N for large N).

One nice thing about khash is that it can be serialized fairly easily.

Have you looked much at how I use enum-like ideas in pandas? It would
be great if I could offload some of this data algorithmic work to
NumPy.

We will want the enum data type to integrate with text file readers--
if you "factorize as you go" you can drastically reduce the memory
usage of a structured array (or pandas DataFrame) columns with
long-ish strings and relatively few unique values.

- Wes


More information about the NumPy-Discussion mailing list