[Numpy-discussion] BOF notes: Fernando's proposal: NumPy ndarray with named axes

Lluís xscript@gmx....
Thu Jul 8 15:05:59 CDT 2010


Skipper Seabold writes:
[...]
>> If I understood well, you could have 4 axes (assuming that an Axis can only
>> handle a single label/variable).
>> 
>> a = DatArray(numpy.array([...], dtype = [("precipitation", float),
>>                                         ("temperature", float)]),
>>             (("city", ["Austin", ...]),
>>              ("month", ["January"]),
>>              ...))
>> 
>> Then, you can:
>>  a.city.named("Memphis").month.named("December")["temperature"].mean()
>>  a.city.named("Memphis").year.named(1985)["temperature"].mean()
>> 

> One question at this point, is if attribute access like this has to be
> coded in Python like recarrays currently?  If so, what is the speed
> trade-off.

I think so, but that's probably not that much costly (sorry, no numbers). As
this access is only expected to be used by the final user, all internal
operations accessing "real" attributes still go through the fast path. This will
all depend on the optimizations applied by the python runtime, which I simply
don't know.


[...]
>> I solved this in sciexp2 with (this is not the API, but translated into a
>> DatArray-like interface for clarity):
>> 
>>  a = Data(numpy.array([...], dtype = [("precipitation", float),
>>                                       ("temperature", float)]),
>>             (("measurement", "@city@-@month@-@year@-@region@",
>>               [{"city": "Austin", "month": "January", "year": 1980, "region": "South"},
>>                ...])))

> I have no idea what that's supposed to do!  What do you fill in the
> "missing" data with, NaNs?

Ok, the idea is that on the "one variable per axis" extreme, you would have a
lot of holes that in my case I fill with NaNs.

The other extreme (the one on the example above) is an "all variables in a
single axis", where _no_ NaNs will appear unless your data explicitly inserts
them (of course, you can have an arbitrary mix of them). What I tried to show is
that the names/ticks of this axis are:

  a.named['Austin-January-1980-South']

You could accomplish this right now in datarray if you instantiated it with:

  a = Data(numpy.array([...], dtype = [("precipitation", float),
                                       ("temperature", float)]),
           labels = (("mysingleaxis", ['Austin-January-1980-South', ... ])))

But although they are all merged into a single axis, you can still access them
separately in sciexp2 by using "filters" (that's why I used an extra item in the
tuple, and a list of dicts instead of strings):

>>  a.named[::"city == 'Memphis' && month == 'December'"]["temperature"].mean()
>>  a.named[::"city == 'Memphis' && year == 1985"]["temperature"].mean()


Another reason to have multiple variables, is that the insertion of NaNs to
maintain shape homogeneity will make these "synthetic" NaNs undistinguishable
from other NaNs that might be on your original input data, unless you use a
masked array or something similar to distinguish them.

It is also my case that I use multiple variables on a single axis as this is how
I later organize the information on my plots, such that I can iterate on
"name/tick" and data pairs that are used as matplotlib labels; where these names
contain, for example, the set of parameters I've tested:
         
  >>> a = Data(numpy.array([...], dtype = [("time", float), ... ]),
               labels = (("configuration", "@ways@-way @size@KB",
                          [{"ways":2, "size":32},
                           {"ways":4, "size":32},
                           ...])
                         ("benchmark", "@benchmark@",
                          [{"benchmark":"foo"},
                           ...])))
  >>> for t in a.configuration.iteritems(): print t
  ("2-way 32KB", Data(<data for all benchmarks>))
  ("4-way 32KB", Data(<data for all benchmarks>))
  ...


>> But of course, this represents a tradeoff between "wasted" space and speed. The
>> internals are on the line of (using ordered dicts):
>> 
>>  { 'city' : { 'Memphis': set(<indexes with memphis>),
>>               ... },
>>    'month' : { 'December': set(<indexes with december>),
>>                ... },
>>    ... }
>> 
>> Which translates into:
>> 
>>  a[union( d['city']['Memphis'], d['month']['december'] )]
>> 
>> There's a less optimized path that supports arbitrary expressions (less than,
>> more than or equal, etc.), but has a cost of O(n).

> Wouldn't this need to be supported in any case?

No, if you were to use a single variable per axis, this would be:

  { 'city' : { 'Memphis' : <index for memphis>,
               ... }
    ... }

which translates into:

  a[d['city']['Memphis'], d['month']['December']]

On the other hand, my example needs an _ordered_ set (sorry, forgot to say
that), and the union will provide the numeric indexes for the given filter.

The cost for this search is relative to the cost of the union of ordered sets,
and the non-optimized case for filters with '<=', etc. has a cost of o(n)
(relative to the length of the axis being filtered).


Lluis

-- 
 "And it's much the same thing with knowledge, for whenever you learn
 something new, the whole world becomes that much richer."
 -- The Princess of Pure Reason, as told by Norton Juster in The Phantom
 Tollbooth


More information about the NumPy-Discussion mailing list