[Numpy-discussion] BOF notes: Fernando's proposal: NumPy ndarray with named axes

Lluís xscript@gmx....
Thu Jul 8 06:13:57 CDT 2010


Rob Speer writes:

>>>> arr.country.named('Netherlands').year.named(2010)
>>>> arr.country.named('Spain').year.named(slice(1994, 2010))
>>>> arr.year.named(2006).country[0:2]

This looks too verbose to me.

As axis always have a total order, I'd go for the most compact representation
(assuming 'country' is the first axis, and 'year' the second one):

   arr['Netherlands','2010']
   arr['Spain','1994':'2010']
   arr[0:2,'2006']

This is my current implementation, which also allows for slices with mixed
integers and names everywhere.

I understand this might not be the desired default behaviour, as requires
looking into the types of every item in '__getitem__', and this might be a
performance issue (although my current implementation tries to optimize for the
case of integer indexes).

Thus, we can use something in the middle:

   arr[0,1]
   arr.names['Netherlands',2010] # I'd rather go for 'names' instead of 'ticks'
   arr.country['Spain'].year[1994:2010]

The default '__getitem__' still has full speed, but accessing the 'named'
attribute allows for accessing on the lines of my previous example, while still
allowing the access through axis name without requiring an explicit 'slice'.

Although this is not my preferred syntax, I think it is a good compromise, and I
could always subclass this to redirect the default '__getitem__' into
'names.__getitem__'.

Btw, I store the names to index translations on an ordered dict (indexed by
name), such that I can also provide an 'arr.iteritems' method that returns
tuples with 'name/tick' and the array contents of that index. In the above
syntax, this would probably be 'arr.<axisname>.iteritems'.

Another feature I like is being able to translate back and forth from
names/ticks to integers, which I do through my 'Dimension.__getitem__' method
(Dimension is the equivalent of datarray's 'Axis').

PS: I also have a separation between axis and their naming, meaning that I can
have a single axis with both 'country' and 'year', such that I would index with
'Netherlands-2010' (other examples do make more sense), but still be able to
access them separately (this reduces the size of the full ndarray, as there is
no need for so many NaNs to make the ndarray homoheneus on size, and it brings
the ndarray closer to the structuring of data on the mind of the user).

Read you,
     Lluis

-- 
 "And it's much the same thing with knowledge, for whenever you learn
 something new, the whole world becomes that much richer."
 -- The Princess of Pure Reason, as told by Norton Juster in The Phantom
 Tollbooth


More information about the NumPy-Discussion mailing list