[Numpy-discussion] BOF notes: Fernando's proposal: NumPy ndarray with named axes
Thu Jul 8 13:27:19 CDT 2010
On Thu, Jul 8, 2010 at 1:35 PM, Rob Speer <email@example.com> wrote:
>> Forgive me if this is has already been addressed, but my question is
>> what happens when we have more than one "label" (not as in a labeled
>> axis but an observation label -- but not a tick because they're not
>> unique!) per say row axis and heterogenous dtypes. This is really the
>> problem that I would like to see addressed and from the BoF comments
>> I'm not sure this use case is going to be covered. I'm also not sure
>> I expressed myself clearly enough or understood what's already
>> available. For me, this is the single most common use case and most
>> of what we are talking about now is just convenient slicing but
>> ignoring some basic and prominent concerns. Please correct me if I'm
>> wrong. I need to play more with DataArray implementation but haven't
>> had time yet.
>> I often have data that looks like this (not really, but it gives the
>> idea in a general way I think).
>> city, month, year, region, precipitation, temperature
>> "Austin", "January", 1980, "South", 12.1, 65.4,
>> "Austin", "February", 1980, "South", 24.3, 55.4
>> "Austin", "March", 1980, "South", 3, 69.1
>> "Austin", "December", 2009, 1, 62.1
>> "Boston", "January", 1980, "Northeast", 1.5, 19.2
>> "Boston","December", 2009, "Northeast", 2.1, 23.5
>> "Memphis","January",1980, "South", 2.1, 35.6
>> "Memphis","December",2009, "South", 1.2, 33.5
> Your labels are unique if you look at them the right way. Here's how I
> would represent that in a datarray:
> * axis0 = 'city', ['Austin', 'Boston', ...]
> * axis1 = 'month', ['January', 'February', ...]
> * axis2 = 'year', [1980, 1981, ...]
> * axis3 = 'region', ['Northeast', 'South', ...]
> * axis4 = 'measurement', ['precipitation', 'temperature']
> and then I'd make a 5-D datarray labeled with [axis0, axis1, axis2,
> axis3, axis4].
Yeah, this is what I was thinking I would have to do, but it's still
not clear to me (I have trouble trying to think in 5 dimensions...).
For instance, what axis holds my actual numeric data?
axis4, with a "precipitation" tick?
> Now I realize not everyone wants to represent their tabular data as a
> big tensor that they index every which way, and I think this is one
> thing that pandas is for.
This is kind of where I would like the divide to be between user and
developer. On top of all of this, I would like to see a __repr__ or
something that actually spits out a 2d spreadsheet-looking
representation. It would help me stay sane I think. Fernando's nice
3D graphic only can go so far as a mental model (for me at least).
> Oh, and the other problem with the 5-D datarray is that you'd probably
> want it to be sparse. This is another discussion worth having.
> I want to eventually replace the labeling stuff in Divisi with
> datarray, but sparse matrices are largely the point of using Divisi.
> So how do we make a sparse datarray?
> One answer would be to have datarray be a wrapper that encapsulates
> any sufficiently matrix-like type. This is approximately what I did in
> the now-obsolete Divisi1. Nobody liked the fact that you had to wrap
> and unwrap your arrays to accomplish anything that we hadn't thought
> of in writing Divisi. I would not recommend this route.
> The other option, which is more like Divisi2. would be to provide the
> functionality of datarray using a mixin. Then a standard dense
> datarray could inherit from (np.ndarray, Datarray), while a sparse
> datarray could inherit from (sparse.csr_matrix, Datarray), for
Mix-ins sounds reasonable to me as long as this could easily be
accomplished. Ie., why use csr? Can you go between others? Are the
sparse matrices reasonably stable given recent activity? Not
rhetorical questions, I don't use sparse matrices much.
More information about the NumPy-Discussion