[Numpy-discussion] Enum/Factor NEP (now with code)
Bryan Van de Ven
Wed Jun 13 17:06:29 CDT 2012
On 6/13/12 1:12 PM, Nathaniel Smith wrote:
> your-branch's-base-master but not in your-repo's-master are new stuff
> that you did on your branch. Solution is just to do
> git push<your github remote name> master
> Yes, of course we *could* write the code to implement these "open"
> dtypes, and then write the documentation, examples, tutorials, etc. to
> help people work around their limitations. Or, we could just implement
> np.fromfile properly, which would require no workarounds and take less
> code to boot.
> So would a proper implementation of np.fromfile that normalized the
> level ordering.
My understanding of the impetus for the open type was sensitivity to the
performance of having to make two passes over large text datasets. We'll
have to get more feedback from users here and input from Travis, I think.
> categories in their data, I don't know. But all your arguments here
> seem to be of the form "hey, it's not *that* bad", and it seems like
> there must be some actual affirmative advantages it has over PyDict if
> it's going to be worth using.
I should have been more specific about the performance concerns. Wes
summed them up, though: better space efficiency, and not having to
box/unbox native types.
>> I think I like "categorical" over "factor" but I am not sure we should
>> ditch "enum". There are two different use cases here: I have a pile of
>> strings (or scalars) that I want to treat as discrete things
>> (categories), and: I have a pile of numbers that I want to give
>> convenient or meaningful names to (enums). This latter case was the
>> motivation for possibly adding "Natural Naming".
> So mention the word "enum" in the documentation, so people looking for
> that will find the categorical data support? :-)
I'm not sure I follow. Natural Naming seems like a great idea for people
that want something like an actual enum (i.e., a way to avoid magic
numbers). We could even imagine some nice with-hacks:
colors = enum(['red', 'green', 'blue')
But natural naming will not work with many category names ("VERY HIGH")
if they have spaces, etc. So, we could add a parameter to factor(...)
that turns on and off natural naming for a dtype object when it is created:
colors = factor(['red', 'green', 'blue'], closed=True, natural_naming=False)
colors = enum(['red', 'green', 'blue'])
I think the latter is better, not only because it is more parsimonious,
but because it also expresses intent better. Or we can just not have
natural naming at all, if no one wants it. It hasn't been implemented
yet, so that would be a snap. :) Hopefully we'll get more feedback from
>>> I'm disturbed to see you adding special cases to the core ufunc
>>> dispatch machinery for these things. I'm -1 on that. We should clean
>>> up the generic ufunc machinery so that it doesn't need special cases
>>> to handle adding a simple type like this.
>> This could certainly be improved, I agree.
> I don't want to be Mr. Grumpypants here, but I do want to make sure
> we're speaking the same language: what "-1" means is "I consider this
> a show-stopper and will oppose merging any code that does not improve
> on this". (Of course you also always have the option of trying to
> change my mind. Even Mr. Grumpypants can be swayed by logic!)
Well, a few comments. The special case in array_richcompare is due to
the lack of string ufuncs. I think it would be great to have string
ufuncs, but I also think it is a separate concern and outside the scope
of this proposal. The special case in arraydescr_typename_get is for the
same reason as datetime special case, the need to access dtype metadata.
I don't think you are really concerned about these two, though?
That leaves the special case in
PyUFunc_SimpleBinaryComparisonTypeResolver. As I said, I chaffed a bit
when I put that in. On the other hand, having dtypes with this extent of
attached metadata, and potentially dynamic metadata, is unique in NumPy.
It was simple and straightforward to add those few lines of code, and
does not affect performance. How invasive will the changes to core ufunc
machinery be to accommodate a type like this more generally? I took the
easy way because I was new to the numpy codebase and did not feel
confident mucking with the central ufunc code. However, maybe the
dispatch can be accomplished easily with the casting machinery. I am not
so sure, I will have to investigate. Of course, I welcome input,
suggestions, and proposals on the best way to improve this.
>> I'm glad Francesc and Wes are aware of the work, but my point was that
>> that isn't enough. So if I were in your position and hoping to get
>> this code merged, I'd be trying to figure out how to get them more
>> actively on board?
Is there some other way besides responding to and attempting to
accommodate technical needs?
More information about the NumPy-Discussion