[Numpy-discussion] Enum/Factor NEP (now with code)
Dag Sverre Seljebotn
Wed Jun 13 11:04:53 CDT 2012
On 06/13/2012 03:33 PM, Nathaniel Smith wrote:
> On Tue, Jun 12, 2012 at 10:27 PM, Bryan Van de Ven<email@example.com> wrote:
>> Hi all,
>> It has been some time, but I do have an update regarding this proposed
>> feature. I thought it would be helpful to flesh out some parts of a
>> possible implementation to learn what can be spelled reasonably in
>> NumPy. Mark Wiebe helped out greatly in navigating the NumPy code
>> codebase. Here is a link to my branch with this code;
>> and the updated NEP:
>> Not everything in the NEP is implemented (integral levels and natural
>> naming in particular) and some parts definitely need more fleshing out.
>> However, things currently work basically as described in the NEP, and
>> there is also a small set of tests that demonstrate current usage. A few
>> things will crash python (astype especially). More tests are needed. I
>> would appreciate as much feedback and discussion as you can provide!
> Hi Bryan,
> I skimmed over the diff:
> It was a bit hard to read since it seems like about half the changes
> in that branch are datatime cleanups or something? I hope you'll
> separate those out -- it's much easier to review self-contained
> changes, and the more changes you roll together into a big lump, the
> more risk there is that they'll get lost all together.
> From the updated NEP I actually understand the use case for "open
> types" now, so that's good :-). But I don't think they're actually
> workable, so that's bad :-(. The use case, as I understand it, is for
> when you want to extend the levels set on the fly as you read through
> a file. The problem with this is that it produces a non-deterministic
> level ordering, where level 0 is whatever was seen first in the file,
> level 1 is whatever was seen second, etc. E.g., say I have a CSV file
> I read in:
> With the scheme described in the NEP, my initial_skill dtype will have
> levels ["LOW", "HIGH"], and by skill_after_training dtype will have
> levels ["HIGH","LOW"], which means that their storage will be
> incompatible, comparisons won't work (or will have to go through some
> nasty convert-to-string-and-back path), etc. Another situation where
> this will occur is if you have multiple data files in the same format;
> whether or not you're able to compare the data from them will depend
> on the order the data happens to occur in in each file. The solution
> is that whenever we automagically create a set of levels from some
> data, and the user hasn't specified any order, we should pick an order
> deterministically by sorting the levels. (This is also what R does.
> levels(factor(c("a", "b"))) -> "a", "b". levels(factor(c("b", "a")))
> -> "a", "b".)
> I'm inclined to say therefore that we should just drop the "open type"
> idea, since it adds complexity but doesn't seem to actually solve the
> problem it's designed for.
If one wants to have an "open", hassle-free enum, an alternative would
be to cryptographically hash the enum string. I'd trust 64 bits of hash
for this purpose.
The obvious disadvantage is the extra space used, but it'd be a bit more
hassle-free compared to regular enums; you'd never have to fix the set
of enum strings and they'd always be directly comparable across
different arrays. HDF libraries etc. could compress it at the storage
layer, storing the enum mapping in the metadata.
Just a thought.
> Can you explain why you're using khash instead of PyDict? It seems to
> add a *lot* of complexity -- like it seems like you're using about as
> many lines of code just marshalling data into and out of the khash as
> I used for my old npenum.pyx prototype (not even counting all the
> extra work required to , and AFAICT my prototype has about the same
> amount of functionality as this. (Of course that's not entirely fair,
> because I was working in Cython... but why not work in Cython?) And
> you'll need to expose a Python dict interface sooner or later anyway,
> I'd think?
> I can't tell if it's worth having categorical scalar types. What value
> do they provide over just using scalars of the level type?
> Terminology: I'd like to suggest we prefer the term "categorical" for
> this data, rather than "factor" or "enum". Partly this is because it
> makes my life easier ;-):
> and partly because numpy has a very diverse set of users and I suspect
> that "categorical" will just be a more transparent name to those who
> aren't already familiar with the particular statistical and
> programming traditions that "factor" and "enum" come from.
> I'm disturbed to see you adding special cases to the core ufunc
> dispatch machinery for these things. I'm -1 on that. We should clean
> up the generic ufunc machinery so that it doesn't need special cases
> to handle adding a simple type like this.
> I'm also worried that I still don't see any signs that you're working
> with the downstream libraries that this functionality is intended to
> be useful for, like the various HDF5 libraries and pandas. I really
> don't think this functionality can be merged to numpy until we have
> affirmative statements from those developers that they are excited
> about it and will use it, and since they're busy people, it's pretty
> much your job to track them down and make sure that your code will
> solve their problems.
> Hope that helps -- it's exciting to see someone working on this, and
> you seem to be off to a good start!
> NumPy-Discussion mailing list
More information about the NumPy-Discussion