[Numpy-discussion] BOF notes: Fernando's proposal: NumPy ndarray with named axes

Skipper Seabold jsseabold@gmail....
Thu Jul 8 11:39:22 CDT 2010


On Thu, Jul 8, 2010 at 12:02 PM, Rob Speer <rspeer@mit.edu> wrote:
>> While I haven't had a chance to really look in-depth at the changes
>> myself (I'm a busy man! So many mailing lists!), I so far like the
>> look and sound of them. That's just my opinion, though.
>
> If people are okay with the attribute magic, I have a proposal for more of it.
>
> In my own project where I use labeled arrays
> (http://github.com/commonsense/divisi2), I don't have labeled axes.
> But I assumed everything was 1 or 2-D, and gave the 2-D matrices
> methods like "row_named", "col_named", etc., to encourage readable
> code.
>
> With the current implementation of datarray, I could get that by
> labeling the axes "row" and "col", except the moment you transpose a
> matrix like that you get rows named "col" and columns named "row", so
> that's not the right answer.
>
> My proposal is that datarray.row should be equivalent to
> datarray.axes[0], and datarray.column should be equivalent to
> datarray.axes[1], so that you can always ask for something like
> "arr.column.named(2010)" (replace those with square brackets if you
> like).
>
> Not sure yet what the right way is to generalize this to 1-D and n-D.

I think we have to start from the nD case, even if I (and I think most
users) will tend to think in 2D.  The rest is just going to have to be
up to developers how they want users to interact with what we, the
developers, see as axes.  No end-user wants to think about the 6th
axis of the data, but I don't want to be pegged into rows and columns
thinking because I don't think it works for the below example.

Forgive me if this is has already been addressed, but my question is
what happens when we have more than one "label" (not as in a labeled
axis but an observation label -- but not a tick because they're not
unique!) per say row axis and heterogenous dtypes.  This is really the
problem that I would like to see addressed and from the BoF comments
I'm not sure this use case is going to be covered.  I'm also not sure
I expressed myself clearly enough or understood what's already
available.  For me, this is the single most common use case and most
of what we are talking about now is just convenient slicing but
ignoring some basic and prominent concerns.  Please correct me if I'm
wrong.  I need to play more with DataArray implementation but haven't
had time yet.

I often have data that looks like this (not really, but it gives the
idea in a general way I think).

city, month, year, region, precipitation, temperature
"Austin", "January", 1980, "South", 12.1, 65.4,
"Austin", "February", 1980, "South", 24.3, 55.4
"Austin", "March", 1980, "South", 3, 69.1
....
"Austin", "December", 2009, 1, 62.1
"Boston", "January", 1980, "Northeast", 1.5, 19.2
....
"Boston","December", 2009, "Northeast", 2.1, 23.5
...
"Memphis","January",1980, "South", 2.1, 35.6
...
"Memphis","December",2009, "South", 1.2, 33.5
...

Sometimes, I want, say, to know what the average temperature is in
December.  Sometimes I want to know what the average temperature is in
Memphis.  Sometimes I want to know the average temperature in Memphis
in December or in Memphis in 1985.  If I do this with structured
arrays, most group-by type operations are at best O(n).  Really this
isn't feasible.

An even more difficult question is what if I want descriptive
statistics on the "region" variable?  Ie., I want to know how many
observations I have for each region.  This one can wait, but is still
important for doing statistics.

Can these use cases be covered right now by DataArray?  Pandas, larry,
divisi?  Others?  I'm having trouble thinking how it could be done
with DataArray.

Skipper


More information about the NumPy-Discussion mailing list