[Numpy-discussion] BOF notes: Fernando's proposal: NumPy ndarray with named axes

Joshua Holbrook josh.holbrook@gmail....
Wed Jul 7 10:40:52 CDT 2010


On Wed, Jul 7, 2010 at 5:52 AM, Bruce Southey <bsouthey@gmail.com> wrote:
> On 07/06/2010 01:09 PM, Gael Varoquaux wrote:
>> Just to give a data point, my research group and I would be very excited
>> at the idea of having Fernando's data arrays in Numpy. We can't offer to
>> maintain it, because we are already fairly involved in machine learning
>> and neuroimaging specific code, but we would be able to rely on it more
>> in our packages, and we love it!
>>
>> Gaël
>>
>> On Mon, Jul 05, 2010 at 11:31:02PM -0500, Jonathan March wrote:
>>
>>>     Fernando Perez proposed a NumPy enhancement, an ndarray with named axes,
>>>     prototyped as DataArray by him, Mike Trumpis, Jonathan Taylor, Matthew
>>>     Brett, Kilian Koepsell and Stefan van der Walt.
>>>
>>
>>>     At SciPy 2010 on July 1, Fernando convened a BOF (Birds of a Feather)
>>>     discussion of this proposal.
>>>
>>
>>>     The notes from this BOF can be found at:
>>>     [1]http://projects.scipy.org/numpy/wiki/NdarrayWithNamedAxes
>>>     (linked from the Plans section of [2]http://projects.scipy.org/numpy )
>>>
>>
>>>     HELP NEEDED: Fernando does not have the resources to drive the project
>>>     beyond this prototype, which already does what he needs. If this is to go
>>>     anywhere, it needs people to do the work. Please step forward.
>>>
>>
>>> References
>>>
>>
>>>     Visible links
>>>     1. http://projects.scipy.org/numpy/wiki/NdarrayWithNamedAxes
>>>     2. http://projects.scipy.org/numpy
>>>

It's 7:30am, so if I say something crazy bear with me. ;)

> This is very interesting work especially if can be used to extend or
> replace the current record arrays (and perhaps structured arrays).

I don't think record arrays are intended to solve quite the same
problem. I think of record arrays as arrays of tuples, whereas
datarray&friends are giving labels to axes and indices. In fact,
there's really no reason why you couldn't label the axes and indices
of a record array. To be honest, though, I haven't really used the
record array previously, and tbh I'm eyeing it with some suspicion. If
anyone wants to defend the poor defenseless record array, I'm all
ears!

(Speaking of the matrix: If nobody uses it, why not deprecate it?)

> If it can not then you really need to make a case for yet another data
> structure. Currently we will have all these unnecessary and incompatible
> hybrids rather than a single option - competition is not good.  I really
> dislike the current impasse with numpy's Matrix class and do not wish
> this to happen again.

Sure. I think the case is pretty easy, though:  Look at all the ad-hoc
implementations of something like this elsewhere. Just off the top of
my head: Larry, pandas, datarray, metaarray (from the cookbook),
tabular, and pyDataFrame. There is clearly a lot of demand for
something like this. On the other hand, many of these solutions
(pandas and tabular in particular) have goals quite beyond just the
datatype. For examples, pandas is meant for 2-d and 3-d financial data
in particular, and tabular was written to emulate a 2-d spreadsheet.
So, clearly, a nice, solid, basic labeled array that's been accepted
into what's really *the* de facto standard numerical library for
python, is something that a lot of people would appreciate, and many
developers have said they would use something like datarray in numpy
were it available.

> However, I am not saying that you can not create
> another scikit rather that there has to be some consideration if if is
> to go back into numpy/scipy.

I don't think datarray itself would best fit in a scikit, though there
are definitely some common manipulations that people would want to do
to datarrays which may fit in a scikit better than in numpy (in my
head I'm already calling it datarraytools).

> As per Wes's reply in this thread, I really do think that a set of
> specific behaviors that are expected for this new data structure need to
> be agreed upon. Currently speed should not an issue until the basic
> functionality is covered.

I agree that premature optimization is a bad idea. Best to nail down
the features and api first.

> I think that there are at least the following
> concerns that people need to agree on:
>
> 1) Indexing especially related to slicing and broadcasting.
> 2) Joining data structures - what to do when all data structures have
> the same 'metadata' (axes, labels, dtypes) and when each of these
> differ. Also, do you allow union (so the result is includes all axes,
> labels etc present all data structures)  or intersection (keep only the
> axes and labels in common) operations?
> 3) How do you expect basic mathematical operations to work? For example,
> what does A +1 mean if A has different data types like strings?
> 4) How should this interact with the rest of numpy?

Why not allow both unions and intersections? Just make separate
functions for them.

I think the standard behavior of the datarray, assuming that indices
themselves don't get into it, should be very similar to that of the
stock ndarray. A possible exception would be when two datarrays have
the same axes and ticks but are in a different order, since one would
either rearrange one set of axes/ticks, or throw an error.


More information about the NumPy-Discussion mailing list