[Numpy-discussion] Enum/Factor NEP (now with code)

Wes McKinney wesmckinn@gmail....
Wed Jun 13 13:54:44 CDT 2012


On Wed, Jun 13, 2012 at 2:12 PM, Nathaniel Smith <njs@pobox.com> wrote:
> On Wed, Jun 13, 2012 at 5:44 PM, Bryan Van de Ven <bryanv@continuum.io> wrote:
>> On 6/13/12 8:33 AM, Nathaniel Smith wrote:
>>> Hi Bryan,
>>>
>>> I skimmed over the diff:
>>>     https://github.com/bryevdv/numpy/compare/master...enum
>>> It was a bit hard to read since it seems like about half the changes
>>> in that branch are datatime cleanups or something? I hope you'll
>>> separate those out -- it's much easier to review self-contained
>>> changes, and the more changes you roll together into a big lump, the
>>> more risk there is that they'll get lost all together.
>>
>> I'm not quite sure what happened there, my git skills are not advanced
>> by any measure. I think the datetime changes are a much smaller fraction
>> than fifty percent, but I will see what I can do to separate them out in
>> the near future.
>
> Looking again, it looks like a lot of it is actually because when I
> asked github to show me the diff between your branch and master, it
> showed me the diff between your branch and *your repository's* version
> of master. And your branch is actually based off a newer version of
> 'master' than you have in your repository. So, as far as git and
> github are concerned, all those changes that are included in
> your-branch's-base-master but not in your-repo's-master are new stuff
> that you did on your branch. Solution is just to do
>  git push <your github remote name> master
>
>>>  From the updated NEP I actually understand the use case for "open
>>> types" now, so that's good :-). But I don't think they're actually
>>> workable, so that's bad :-(. The use case, as I understand it, is for
>>> when you want to extend the levels set on the fly as you read through
>>> a file. The problem with this is that it produces a non-deterministic
>>> level ordering, where level 0 is whatever was seen first in the file,
>>> level 1 is whatever was seen second, etc. E.g., say I have a CSV file
>>> I read in:
>>>
>>>      subject,initial_skill,skill_after_training
>>>      1,LOW,HIGH
>>>      2,LOW,LOW
>>>      3,HIGH,HIGH
>>>      ...
>>>
>>> With the scheme described in the NEP, my initial_skill dtype will have
>>> levels ["LOW", "HIGH"], and by skill_after_training dtype will have
>>> levels ["HIGH","LOW"], which means that their storage will be
>>> incompatible, comparisons won't work (or will have to go through some
>>
>> I imagine users using the same open dtype object in both fields of the
>> structure dtype used to read in the file, if both fields of the file
>> contain the same categories. If they don't contain the same categories,
>> they are incomparable in any case. I believe many users have this
>> simpler use case where each field is a separate category, and they want
>> to read them all individually, separately on the fly.  For these simple
>> cases, it would "just work". For your case example there would
>> definitely be a documentation, examples, tutorials, education issue, to
>> avoid the "gotcha" you describe.
>
> Yes, of course we *could* write the code to implement these "open"
> dtypes, and then write the documentation, examples, tutorials, etc. to
> help people work around their limitations. Or, we could just implement
> np.fromfile properly, which would require no workarounds and take less
> code to boot.
>
>>> nasty convert-to-string-and-back path), etc. Another situation where
>>> this will occur is if you have multiple data files in the same format;
>>> whether or not you're able to compare the data from them will depend
>>> on the order the data happens to occur in in each file. The solution
>>> is that whenever we automagically create a set of levels from some
>>> data, and the user hasn't specified any order, we should pick an order
>>> deterministically by sorting the levels. (This is also what R does.
>>> levels(factor(c("a", "b"))) ->  "a", "b". levels(factor(c("b", "a")))
>>> ->  "a", "b".)
>>
>> A solution is to create the dtype object when reading in the first file,
>> and to reuse that same dtype object when reading in subsequent files.
>> Perhaps it's not ideal, but it does enable the work to be done.
>
> So would a proper implementation of np.fromfile that normalized the
> level ordering.
>
>>> Can you explain why you're using khash instead of PyDict? It seems to
>>> add a *lot* of complexity -- like it seems like you're using about as
>>> many lines of code just marshalling data into and out of the khash as
>>> I used for my old npenum.pyx prototype (not even counting all the
>>> extra work required to , and AFAICT my prototype has about the same
>>> amount of functionality as this. (Of course that's not entirely fair,
>>> because I was working in Cython... but why not work in Cython?) And
>>> you'll need to expose a Python dict interface sooner or later anyway,
>>> I'd think?
>>
>> I suppose I agree with the sentiment that the core of NumPy really ought
>> to be less dependent on the Python C API, not more. I also think the
>> khash API is pretty dead simple and straightforward, and the fact that
>> it is contained in a singe header is attractive.  It's also quite
>> performant in time and space. But if others disagree strongly, all of
>> it's uses are hidden behind the interface in leveled_dtypes.c, it could
>> be replaced with some other mechanism easily enough.
>
> I'm not at all convinced by the argument that throwing in random
> redundant data types into NumPy will somehow reduce our dependence on
> the Python C API. If you have a plan to replace *all* use of dicts in
> numpy with khash, then we can talk about that, I guess. But that would
> be a separate patch, and I don't think using PyDict in this patch
> would really have any effect on how difficult that separate patch was
> to do.
>
> PyDict also has a very simple API -- and in fact, the comparison is
> between the PyDict API+the khash API, versus just the PyDict API
> alone, since everyone working with the Python C API already has to
> know how that works. It's also contained in effectively zero header
> files, which is even more attractive than one header file. And that
> interface in leveled_dtypes.c is the one that I was talking about
> being larger than my entire categorical dtype implementation.
>
> None of this means that using it is a bad idea, of course! Maybe it
> has some key advantage over PyDict in terms of memory use or
> something, for those people who have hundreds of thousands of distinct
> categories in their data, I don't know. But all your arguments here
> seem to be of the form "hey, it's not *that* bad", and it seems like
> there must be some actual affirmative advantages it has over PyDict if
> it's going to be worth using.
>
>>> I can't tell if it's worth having categorical scalar types. What value
>>> do they provide over just using scalars of the level type?
>>
>> I'm not certain they are worthwhile either, which is why I did not spend
>> any time on them yet. Wes has expressed a desire for very broad
>> categorical types (even more than just scalar categories), hopefully he
>> can chime in with his motivations.
>>
>>> Terminology: I'd like to suggest we prefer the term "categorical" for
>>> this data, rather than "factor" or "enum". Partly this is because it
>>> makes my life easier ;-):
>>>    https://groups.google.com/forum/#!msg/pystatsmodels/wLX1-a5Y9fg/04HFKEu45W4J
>>> and partly because numpy has a very diverse set of users and I suspect
>>> that "categorical" will just be a more transparent name to those who
>>> aren't already familiar with the particular statistical and
>>> programming traditions that "factor" and "enum" come from.
>>
>> I think I like "categorical" over "factor" but I am not sure we should
>> ditch "enum". There are two different use cases here: I have a pile of
>> strings (or scalars) that I want to treat as discrete things
>> (categories), and: I have a pile of numbers that I want to give
>> convenient or meaningful names to (enums). This latter case was the
>> motivation for possibly adding "Natural Naming".
>
> So mention the word "enum" in the documentation, so people looking for
> that will find the categorical data support? :-)
>
>>> I'm disturbed to see you adding special cases to the core ufunc
>>> dispatch machinery for these things. I'm -1 on that. We should clean
>>> up the generic ufunc machinery so that it doesn't need special cases
>>> to handle adding a simple type like this.
>>
>> This could certainly be improved, I agree.
>
> I don't want to be Mr. Grumpypants here, but I do want to make sure
> we're speaking the same language: what "-1" means is "I consider this
> a show-stopper and will oppose merging any code that does not improve
> on this". (Of course you also always have the option of trying to
> change my mind. Even Mr. Grumpypants can be swayed by logic!)
>
>>> I'm also worried that I still don't see any signs that you're working
>>> with the downstream libraries that this functionality is intended to
>>> be useful for, like the various HDF5 libraries and pandas. I really
>>> don't think this functionality can be merged to numpy until we have
>>> affirmative statements from those developers that they are excited
>>> about it and will use it, and since they're busy people, it's pretty
>>> much your job to track them down and make sure that your code will
>>> solve their problems.
>>
>> Francesc is certainly aware of this work, and I emailed Wes earlier this
>> week, I probably should have mentioned that, though. Hopefully they will
>> have time to contribute their thoughts. I also imagine Travis can speak
>> on behalf of the users he has interacted with over the last several
>> years that have requested this feature that don't happen to follow
>> mailing lists.
>
> I'm glad Francesc and Wes are aware of the work, but my point was that
> that isn't enough. So if I were in your position and hoping to get
> this code merged, I'd be trying to figure out how to get them more
> actively on board?
>
> -N
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion

OK, I need to spend some time on this as it will directly impact me.
Random thoughts here.

It looks like the levels can only be strings. This is too limited for
my needs. Why not support all possible NumPy dtypes? In pandas world,
the levels can be any unique Index object (note, I'm going to change
the name of the Factor class to Categorical before 0.8.0 final per
discussion with Nathaniel):

In [2]: Factor.from_array(np.random.randint(0, 10, 100))
Out[2]:
Factor:
array([6, 6, 4, 2, 1, 2, 3, 5, 1, 5, 2, 9, 2, 8, 8, 1, 5, 2, 6, 9, 2, 1, 3,
       6, 4, 4, 8, 1, 3, 1, 7, 9, 6, 4, 8, 0, 2, 9, 6, 2, 0, 6, 7, 5, 1, 7,
       8, 2, 7, 9, 7, 6, 5, 8, 3, 9, 4, 5, 0, 1, 4, 1, 8, 8, 6, 8, 0, 2, 2,
       7, 0, 9, 9, 9, 4, 6, 4, 1, 8, 6, 3, 3, 2, 5, 3, 9, 9, 0, 0, 7, 2, 1,
       6, 0, 7, 6, 6, 0, 7, 5])
Levels (10): array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [3]: Factor.from_array(np.random.randint(0, 10, 100)).levels
Out[3]: Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [4]: Factor.from_array(np.random.randint(0, 10, 100)).labels
Out[4]:
array([0, 4, 3, 6, 6, 0, 6, 2, 2, 6, 2, 4, 7, 4, 1, 8, 1, 4, 8, 6, 4, 5, 6,
       4, 8, 3, 9, 5, 3, 0, 4, 2, 7, 0, 1, 8, 0, 7, 8, 6, 5, 6, 1, 6, 2, 7,
       8, 5, 7, 5, 1, 5, 0, 5, 6, 5, 5, 4, 0, 3, 3, 8, 5, 1, 1, 2, 6, 7, 7,
       1, 6, 6, 4, 4, 8, 2, 1, 7, 8, 3, 7, 8, 1, 5, 0, 6, 9, 9, 9, 5, 7, 3,
       1, 2, 0, 1, 5, 6, 4, 5])

The API for constructing an enum/factor/categorical array from fixed
levels and an array of labels seems somewhat weak to me. A very common
scenario is to need to construct a factor from an array of integers
with an associated array of levels:


In [13]: labels
Out[13]:
array([6, 7, 3, 8, 8, 6, 7, 4, 8, 4, 2, 8, 8, 4, 8, 8, 1, 9, 5, 9, 6, 5, 7,
       1, 6, 5, 2, 0, 4, 4, 1, 8, 6, 0, 1, 5, 9, 6, 0, 2, 1, 5, 8, 9, 6, 8,
       0, 1, 9, 5, 8, 6, 3, 4, 3, 3, 8, 7, 8, 2, 9, 8, 9, 9, 5, 0, 5, 2, 1,
       0, 2, 2, 0, 5, 4, 7, 6, 5, 0, 7, 3, 5, 6, 0, 6, 2, 5, 1, 5, 6, 3, 8,
       7, 9, 7, 3, 3, 0, 4, 4])

In [14]: levels
Out[14]: Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [15]: Factor(labels, levels)
Out[15]:
Factor:
array([6, 7, 3, 8, 8, 6, 7, 4, 8, 4, 2, 8, 8, 4, 8, 8, 1, 9, 5, 9, 6, 5, 7,
       1, 6, 5, 2, 0, 4, 4, 1, 8, 6, 0, 1, 5, 9, 6, 0, 2, 1, 5, 8, 9, 6, 8,
       0, 1, 9, 5, 8, 6, 3, 4, 3, 3, 8, 7, 8, 2, 9, 8, 9, 9, 5, 0, 5, 2, 1,
       0, 2, 2, 0, 5, 4, 7, 6, 5, 0, 7, 3, 5, 6, 0, 6, 2, 5, 1, 5, 6, 3, 8,
       7, 9, 7, 3, 3, 0, 4, 4])
Levels (10): array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

What is the story for NA values (NaL?) in a factor array? I code them
as -1 in the labels, though you could use INT32_MAX or something. This
is very important in the context of groupby operations.

Are the levels ordered (Nathaniel brought this up already looks like)?
It doesn't look like it. That is also necessary. You also need to be
able to sort the levels (which is a relabeling, I have lots of code in
use for this). In the context of groupby in pandas, when processing a
key (array of values) to a factor to be used for aggregating some
data, you have the option of returning an object that has the levels
as observed in the data or sorting. Sorting can obviously be very
expensive depending on the number of groups in the data
(http://wesmckinney.com/blog/?p=437). Example:

from pandas import DataFrame
from pandas.util.testing import rands
import numpy as np

df = DataFrame({'key' : [rands(10) for _ in xrange(100000)] * 10,
           'data' : np.random.randn(1000000)})

In [32]: timeit df.groupby('key').sum()
1 loops, best of 3: 374 ms per loop

In [33]: timeit df.groupby('key', sort=False).sum()
10 loops, best of 3: 185 ms per loop

The "factorization time" for the `key` column dominates the runtime;
the factor is computed once then reused if you keep the GroupBy object
around:

In [36]: timeit grouped.sum()
100 loops, best of 3: 6.05 ms per loop

As another example of why ordered factors matter, consider a quantile
cut (google for the "cut" function in R) function I wrote recently:


In [40]: arr = Series(np.random.randn(1000000))

In [41]: cats = qcut(arr, [0, 0.25, 0.5, 0.75, 1])

In [43]: arr.groupby(cats).describe().unstack(0)
Out[43]:
       (-4.85, -0.673]  (-0.673, 0.00199]  (0.00199, 0.677]  (0.677, 4.914]
count    250000.000000      250000.000000     250000.000000   250000.000000
mean         -1.270623          -0.323092          0.326325        1.271519
std           0.491317           0.193254          0.193044        0.490611
min          -4.839798          -0.673224          0.001992        0.677177
25%          -1.533021          -0.487450          0.158736        0.888502
50%          -1.150136          -0.317501          0.320352        1.150480
75%          -0.887974          -0.155197          0.490456        1.534709
max          -0.673224           0.001990          0.677176        4.913536

If you don't have ordered levels, then the quantiles might come out in
the wrong order depending on how the strings sort or fall out of the
hash table.

Nathaniel: my experience (see blog posting above for a bit more) is
that khash really crushes PyDict for two reasons: you can use it with
primitive types and avoid boxing, and secondly you can preallocate.
Its memory footprint with large hashtables is also a fraction of
PyDict. The Python memory allocator is not problematic-- if you create
millions of Python objects expect the RAM usage of the Python process
to balloon absurdly.

Anyway, this is exciting work assuming we get the API right and
hitting all the use cases. On top of all this I am _very_ performance
sensitive so you'll have to be pretty aggressive with benchmarking
things. I have concerns about ceding control over critical
functionality that I need for pandas (which has become a large and
very important library these days for a lot of people), but as long as
the pieces in NumPy are suitably mature and robust for me to switch to
them eventually that would be great.

I'll do my best to stay involved in the discussion, though I'm
juggling a lot of things these days (e.g. I have the PyData book
deadline approaching like a freight train).

- Wes


More information about the NumPy-Discussion mailing list