[Numpy-discussion] BOF notes: Fernando's proposal: NumPy ndarray with named axes

Skipper Seabold jsseabold@gmail....
Thu Jul 8 13:41:12 CDT 2010


On Thu, Jul 8, 2010 at 1:38 PM, Lluís <xscript@gmx.net> wrote:
> Skipper Seabold writes:
>
>> On Thu, Jul 8, 2010 at 12:02 PM, Rob Speer <rspeer@mit.edu> wrote:
> [...]
>>> My proposal is that datarray.row should be equivalent to
>>> datarray.axes[0], and datarray.column should be equivalent to
>>> datarray.axes[1], so that you can always ask for something like
>>> "arr.column.named(2010)" (replace those with square brackets if you
>>> like).
>>>
>>> Not sure yet what the right way is to generalize this to 1-D and n-D.
>
>> I think we have to start from the nD case, even if I (and I think most
>> users) will tend to think in 2D.  The rest is just going to have to be
>> up to developers how they want users to interact with what we, the
>> developers, see as axes.  No end-user wants to think about the 6th
>> axis of the data, but I don't want to be pegged into rows and columns
>> thinking because I don't think it works for the below example.
>
> You could simply provide a subclass of datarray called 'table' that
> automatically labels the two (mandatory) axis as 'column' and 'row'.
>
>
> [...]
>> city, month, year, region, precipitation, temperature
>> "Austin", "January", 1980, "South", 12.1, 65.4,
>> "Austin", "February", 1980, "South", 24.3, 55.4
>> "Austin", "March", 1980, "South", 3, 69.1
>> ....
>> "Austin", "December", 2009, 1, 62.1
>> "Boston", "January", 1980, "Northeast", 1.5, 19.2
>> ....
>> "Boston","December", 2009, "Northeast", 2.1, 23.5
>> ...
>> "Memphis","January",1980, "South", 2.1, 35.6
>> ...
>> "Memphis","December",2009, "South", 1.2, 33.5
>> ...
>
>> Sometimes, I want, say, to know what the average temperature is in
>> December.  Sometimes I want to know what the average temperature is in
>> Memphis.  Sometimes I want to know the average temperature in Memphis
>> in December or in Memphis in 1985.  If I do this with structured
>> arrays, most group-by type operations are at best O(n).  Really this
>> isn't feasible.
>
> If I understood well, you could have 4 axes (assuming that an Axis can only
> handle a single label/variable).
>
> a = DatArray(numpy.array([...], dtype = [("precipitation", float),
>                                         ("temperature", float)]),
>             (("city", ["Austin", ...]),
>              ("month", ["January"]),
>              ...))
>
> Then, you can:
>  a.city.named("Memphis").month.named("December")["temperature"].mean()
>  a.city.named("Memphis").year.named(1985)["temperature"].mean()
>

One question at this point, is if attribute access like this has to be
coded in Python like recarrays currently?  If so, what is the speed
trade-off.

> Or shorter:
>  a.named["Memphis","December"]["temperature"].mean()
>  a.named["Memphis",:,"1985"]["temperature"].mean()
>

Much prefer the shorter.

I also prefer by to named, but this is for later...  Ie., I'm thinking
I want you to group my data by... then give me temperature.  That way
it's a little clearer why there are two sets of [], IMO.

> This raises the problem of non-homogeneous measurements. For example, if you had
> only a few measurements for Austin, the rest would be just NaNs to make the
> shape homogeneus.

Of course.  And I will very often have this case.  For instance, I
will have household survey data where each household has a certain id,
but there are a different number of family members.

>
> I solved this in sciexp2 with (this is not the API, but translated into a
> DatArray-like interface for clarity):
>
>  a = Data(numpy.array([...], dtype = [("precipitation", float),
>                                       ("temperature", float)]),
>             (("measurement", "@city@-@month@-@year@-@region@",
>               [{"city": "Austin", "month": "January", "year": 1980, "region": "South"},
>                ...])))

I have no idea what that's supposed to do!  What do you fill in the
"missing" data with, NaNs?

>
>  a.named[::"city == 'Memphis' && month == 'December'"]["temperature"].mean()
>  a.named[::"city == 'Memphis' && year == 1985"]["temperature"].mean()

This makes sense.

>
> But of course, this represents a tradeoff between "wasted" space and speed. The
> internals are on the line of (using ordered dicts):
>
>  { 'city' : { 'Memphis': set(<indexes with memphis>),
>               ... },
>    'month' : { 'December': set(<indexes with december>),
>                ... },
>    ... }
>
> Which translates into:
>
>  a[union( d['city']['Memphis'], d['month']['december'] )]
>
> There's a less optimized path that supports arbitrary expressions (less than,
> more than or equal, etc.), but has a cost of O(n).

Wouldn't this need to be supported in any case?

>
>
>> An even more difficult question is what if I want descriptive
>> statistics on the "region" variable?  Ie., I want to know how many
>> observations I have for each region.  This one can wait, but is still
>> important for doing statistics.
>
> This _should_ be:
>
>  a.region.named("South").size

Sounds ok.

Skipper


>
>
> Read you,
>     Lluis
>
> --
>  "And it's much the same thing with knowledge, for whenever you learn
>  something new, the whole world becomes that much richer."
>  -- The Princess of Pure Reason, as told by Norton Juster in The Phantom
>  Tollbooth
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>


More information about the NumPy-Discussion mailing list