[Numpy-discussion] BOF notes: Fernando's proposal: NumPy ndarray with named axes

Wes McKinney wesmckinn@gmail....
Tue Jul 6 12:43:38 CDT 2010

On Tue, Jul 6, 2010 at 12:56 PM, Keith Goodman <kwgoodman@gmail.com> wrote:
> On Tue, Jul 6, 2010 at 9:52 AM, Joshua Holbrook <josh.holbrook@gmail.com> wrote:
>> On Tue, Jul 6, 2010 at 8:42 AM, Skipper Seabold <jsseabold@gmail.com> wrote:
>>> On Tue, Jul 6, 2010 at 12:36 PM, Joshua Holbrook
>>> <josh.holbrook@gmail.com> wrote:
>>>> I'm kinda-sorta still getting around to building/reading the sphinx
>>>> docs for datarray. <_< Like, I've gone through them before, but it was
>>>> more cursory than I'd like. Honestly, I kinda let myself get caught up
>>>> in trying to automate the process of getting them onto github pages.
>>>> I have to admit that I didn't 100% understand the reasoning behind not
>>>> allowing integer ticks (I blame jet lag--it's a nice scapegoat). I
>>>> believe it originally had to do with what you meant if you typed, say,
>>>> A[3:"london"]; Did you mean the underlying ndarray index 3, or the
>>>> outer level "tick" 3? I think if you didn't allow integers, then you
>>>> could simply wrap your "3" in a string: A["3":"London"] so it's
>>>> probably not a deal-breaker, but I would imagine that using (a)
>>>> separate method(s) for label-based indexing may make allowing
>>>> integer-datatyped labels.
>>>> Thoughts?
>>> Would you mind bottom-posting/ posting in-line to make the thread
>>> easier to follow?
>>>> --Josh
>>>> On Tue, Jul 6, 2010 at 8:23 AM, Keith Goodman <kwgoodman@gmail.com> wrote:
>>>>> On Tue, Jul 6, 2010 at 9:13 AM, Skipper Seabold <jsseabold@gmail.com> wrote:
>>>>>> On Tue, Jul 6, 2010 at 11:55 AM, Keith Goodman <kwgoodman@gmail.com> wrote:
>>>>>>> On Tue, Jul 6, 2010 at 7:47 AM, Joshua Holbrook <josh.holbrook@gmail.com> wrote:
>>>>>>>> I really really really want to work on this. I already forked datarray
>>>>>>>> on github and did some research on What Other People Have Done (
>>>>>>>> http://jesusabdullah.github.com/2010/07/02/datarray.html ). With any
>>>>>>>> luck I'll contribute something actually useful. :)
>>>>>>> I like the figure!
>>>>>>> To do label indexing on a larry you need to use lix, so lar.lix[...]
>>>>>> FYI, if you didn't see it, there are also usage docs in dataarray/doc
>>>>>> that you can build with sphinx that show a lot of the thinking and
>>>>>> examples (they spent time looking at pandas and larry).
>>>>>> One question that was asked of Wes, that I'd propose to you as well
>>>>>> Keith, is that if DataArray became part of NumPy, do you think you
>>>>>> could use it to work on top of for larry?
>>>>> This is all very exciting. I did not know that DataArray had ticks so
>>>>> I never took a close look at it.
>>>>> After reading the sphinx doc, one question I had was how firm is the
>>>>> decision to not allow integer ticks? I use int ticks a lot.
>>> I think what Josh said is right.  However, we proposed having all of
>>> the new labeled axis access pushed to a .aix (or whatever) method, so
>>> as to avoid any confusion, as the original object can be accessed just
>>> as an ndarray.  I'm not sure where this leaves us vis-a-vis ints as
>>> ticks.
>>> Skipper
>>> _______________________________________________
>>> NumPy-Discussion mailing list
>>> NumPy-Discussion@scipy.org
>>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>> Sorry re: posting at-top. I guess habit surpassed observation of
>> community norms for a second there. Whups!
>> My opinion on the matter is that, as a matter of "purity," labels
>> should all have the string datatype. That said, I'd imagine that
>> passing an int as an argument would be fine, due to python's
>> loosey-goosey attitude towards datatypes. :) That, or, y'know,
>> str(myint).
> Ideally (for me), the only requirement for ticks would be hashable and
> unique along any one axis. So, for example, datetime.date() could be a
> tick but a list could not be a tick (not hashable).
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion

Gmail needs to really get its act and enable bottom-posting by
default. Definitely an annoyance

There are many issues at play here so I wanted to give some of my
thoughts re: building pandas, larry, etc. on top of DataArray (or
whatever it is that makes its way into NumPy), can put this on the
wiki, too:

1. Giving semantic information to axes (not ticks, though)

I think this is very useful but wouldn't be immediately useful in
pandas except perhaps moving axis names elsewhere (which are currently
a part of the data-structures and always have the same name). I
wouldn't be immediately comfortable say, making a pandas DataFrame a
subclass of DataArray and making them implicitly interoperable. Going
back and forth e.g. from DataArray and DataFrame *should* be an easy
operation-- you could imagine using DataArray to serialize both pandas
and larry objects for example!

2. Container for axis metadata (Axis object in datarray, Index in pandas, ...)

I would be more than happy to offload the "ordered set" data structure
onto NumPy. In pandas, Index is that container-- it's an ndarray
subclass with a handful of methods and a reverse index (e.g. if you
have ['d', 'b', 'a' 'c'] you have a dict somewhere with {'d' : 0, 'b'
: 1, ...} for O(1) lookups). I'm producing the reverse index in Cython
at object creation time-- Keith recently added the same thing (Cython)
to larry to get a speed boost, but he does it only when needed. It's
also nice to have some other convenience methods in this object, like
set operations.

In pandas, there is also the DateRange class (subclass of Index, so
recognized as valid by the data structures) which has a sequence of
Python datetime objects and frequency information. IMHO this should
all go inside NumPy and leverage the datetime64 dtype. With date
ranges you can also special case set operations (e.g. union or
intersection) when the ranges overlap (in practice this can yield a
huge performance boost)!

I like using ndarray for the ticks because slicing produces views,
etc. (but in the current implementation in pandas slicing requires
constructing a new reverse index from scratch).

As for the acceptable type for ticks-- I am with Keith in requiring
only hashability. So to support integer ticks for completeness
DataArray probably needs a separate "access by tick" interface
(already mentioned above I believe). I saw criticism on the datarray
docs about pandas having ambiguous behavior for integer ticks-- my
view is that you have ticks so you don't have to think about "where"
things are in the data structure ;) But again datarray is a different
story-- ticks not required!

3. Data alignment routines

I think the fundamental data alignment routines in larry and pandas
belong in NumPy. We're both creating an integer vector in Cython and
passing that to ndarray.take. There is also the issue of missing data
handling. We should spend a little time and decide on the API for
these functions that will work for both libraries and probably write C

Here's the Cython code I'm referring to (which isn't all that pretty,
and makes assumptions guaranteed by other parts of pandas):


4. Group-by routines

Not necessarily related to DataArray but highly relevant to
statistical data structures (Skipper made a comment about this at the
BoF). Having core group by routines (see Travis's NEP:
which is not rendering correctly for me, download the RST) makes a lot
of sense rather than have all of us implement our own things.

Group-by basically comes down to solving two problems: assigning
chunks of data to groups (using some kind of mapping or function), and
doing something with those group assignments (like aggregating or
transforming-- think like group means or standardizing / zscoring
within group). Using Python dicts to store the group assignments
computed by arbitrary functions (the way pandas does it now) is often
suboptimal if you want to, say, group one ndarray by another-- I think
in most cases we can do a lot better, but will be important to have a
very "general" group-by where performance might be a little slower.


In any case-- if we can trim down the amount of duplicated logic
between the various libraries, I think that would be a big win
overall. I'm not sure if having "one data object to rule them all" is
something we can achieve for the moment. pandas has been developed
decidedly for statistics, econometrics, and finance which has led to
some slightly domain-specific design choices. I am fairly certain
there are a large number of users out there for whom these sort of
tools could be hugely useful in making the switch to Python from R,
Matlab, Java, C++, etc.

- Wes

More information about the NumPy-Discussion mailing list