[Numpy-discussion] Enum/Factor NEP (now with code)

Wes McKinney wesmckinn@gmail....
Wed Jun 13 13:54:44 CDT 2012

On Wed, Jun 13, 2012 at 2:12 PM, Nathaniel Smith <njs@pobox.com> wrote:
> On Wed, Jun 13, 2012 at 5:44 PM, Bryan Van de Ven <bryanv@continuum.io> wrote:
>> On 6/13/12 8:33 AM, Nathaniel Smith wrote:
>>> Hi Bryan,
>>> I skimmed over the diff:
>>>     https://github.com/bryevdv/numpy/compare/master...enum
>>> It was a bit hard to read since it seems like about half the changes
>>> in that branch are datatime cleanups or something? I hope you'll
>>> separate those out -- it's much easier to review self-contained
>>> changes, and the more changes you roll together into a big lump, the
>>> more risk there is that they'll get lost all together.
>> I'm not quite sure what happened there, my git skills are not advanced
>> by any measure. I think the datetime changes are a much smaller fraction
>> than fifty percent, but I will see what I can do to separate them out in
>> the near future.
> Looking again, it looks like a lot of it is actually because when I
> asked github to show me the diff between your branch and master, it
> showed me the diff between your branch and *your repository's* version
> of master. And your branch is actually based off a newer version of
> 'master' than you have in your repository. So, as far as git and
> github are concerned, all those changes that are included in
> your-branch's-base-master but not in your-repo's-master are new stuff
> that you did on your branch. Solution is just to do
>  git push <your github remote name> master
>>>  From the updated NEP I actually understand the use case for "open
>>> types" now, so that's good :-). But I don't think they're actually
>>> workable, so that's bad :-(. The use case, as I understand it, is for
>>> when you want to extend the levels set on the fly as you read through
>>> a file. The problem with this is that it produces a non-deterministic
>>> level ordering, where level 0 is whatever was seen first in the file,
>>> level 1 is whatever was seen second, etc. E.g., say I have a CSV file
>>> I read in:
>>>      subject,initial_skill,skill_after_training
>>>      1,LOW,HIGH
>>>      2,LOW,LOW
>>>      3,HIGH,HIGH
>>>      ...
>>> With the scheme described in the NEP, my initial_skill dtype will have
>>> levels ["LOW", "HIGH"], and by skill_after_training dtype will have
>>> levels ["HIGH","LOW"], which means that their storage will be
>>> incompatible, comparisons won't work (or will have to go through some
>> I imagine users using the same open dtype object in both fields of the
>> structure dtype used to read in the file, if both fields of the file
>> contain the same categories. If they don't contain the same categories,
>> they are incomparable in any case. I believe many users have this
>> simpler use case where each field is a separate category, and they want
>> to read them all individually, separately on the fly.  For these simple
>> cases, it would "just work". For your case example there would
>> definitely be a documentation, examples, tutorials, education issue, to
>> avoid the "gotcha" you describe.
> Yes, of course we *could* write the code to implement these "open"
> dtypes, and then write the documentation, examples, tutorials, etc. to
> help people work around their limitations. Or, we could just implement
> np.fromfile properly, which would require no workarounds and take less
> code to boot.
>>> nasty convert-to-string-and-back path), etc. Another situation where
>>> this will occur is if you have multiple data files in the same format;
>>> whether or not you're able to compare the data from them will depend
>>> on the order the data happens to occur in in each file. The solution
>>> is that whenever we automagically create a set of levels from some
>>> data, and the user hasn't specified any order, we should pick an order
>>> deterministically by sorting the levels. (This is also what R does.
>>> levels(factor(c("a", "b"))) ->  "a", "b". levels(factor(c("b", "a")))
>>> ->  "a", "b".)
>> A solution is to create the dtype object when reading in the first file,
>> and to reuse that same dtype object when reading in subsequent files.
>> Perhaps it's not ideal, but it does enable the work to be done.
> So would a proper implementation of np.fromfile that normalized the
> level ordering.
>>> Can you explain why you're using khash instead of PyDict? It seems to
>>> add a *lot* of complexity -- like it seems like you're using about as
>>> many lines of code just marshalling data into and out of the khash as
>>> I used for my old npenum.pyx prototype (not even counting all the
>>> extra work required to , and AFAICT my prototype has about the same
>>> amount of functionality as this. (Of course that's not entirely fair,
>>> because I was working in Cython... but why not work in Cython?) And
>>> you'll need to expose a Python dict interface sooner or later anyway,
>>> I'd think?
>> I suppose I agree with the sentiment that the core of NumPy really ought
>> to be less dependent on the Python C API, not more. I also think the
>> khash API is pretty dead simple and straightforward, and the fact that
>> it is contained in a singe header is attractive.  It's also quite
>> performant in time and space. But if others disagree strongly, all of
>> it's uses are hidden behind the interface in leveled_dtypes.c, it could
>> be replaced with some other mechanism easily enough.
> I'm not at all convinced by the argument that throwing in random
> redundant data types into NumPy will somehow reduce our dependence on
> the Python C API. If you have a plan to replace *all* use of dicts in
> numpy with khash, then we can talk about that, I guess. But that would
> be a separate patch, and I don't think using PyDict in this patch
> would really have any effect on how difficult that separate patch was
> to do.
> PyDict also has a very simple API -- and in fact, the comparison is
> between the PyDict API+the khash API, versus just the PyDict API
> alone, since everyone working with the Python C API already has to
> know how that works. It's also contained in effectively zero header
> files, which is even more attractive than one header file. And that
> interface in leveled_dtypes.c is the one that I was talking about
> being larger than my entire categorical dtype implementation.
> None of this means that using it is a bad idea, of course! Maybe it
> has some key advantage over PyDict in terms of memory use or
> something, for those people who have hundreds of thousands of distinct
> categories in their data, I don't know. But all your arguments here
> seem to be of the form "hey, it's not *that* bad", and it seems like
> there must be some actual affirmative advantages it has over PyDict if
> it's going to be worth using.
>>> I can't tell if it's worth having categorical scalar types. What value
>>> do they provide over just using scalars of the level type?
>> I'm not certain they are worthwhile either, which is why I did not spend
>> any time on them yet. Wes has expressed a desire for very broad
>> categorical types (even more than just scalar categories), hopefully he
>> can chime in with his motivations.
>>> Terminology: I'd like to suggest we prefer the term "categorical" for
>>> this data, rather than "factor" or "enum". Partly this is because it
>>> makes my life easier ;-):
>>>    https://groups.google.com/forum/#!msg/pystatsmodels/wLX1-a5Y9fg/04HFKEu45W4J
>>> and partly because numpy has a very diverse set of users and I suspect
>>> that "categorical" will just be a more transparent name to those who
>>> aren't already familiar with the particular statistical and
>>> programming traditions that "factor" and "enum" come from.
>> I think I like "categorical" over "factor" but I am not sure we should
>> ditch "enum". There are two different use cases here: I have a pi