[Numpy-discussion] BOF notes: Fernando's proposal: NumPy ndarray with named axes
Wed Jul 7 10:51:58 CDT 2010
Bruce Southey writes:
> 1) Indexing especially related to slicing and broadcasting.
1.1) Absolute indexing/slicing
1.2) Partial slicing
For the case of "compund" ticks that is merging multiple ticks into a
a[::"subtick1 == 'subtick1value'"] (partial slicing)
That is, I have a dict in an ndarray subclass for the 'tickvalue' -> int
translation, but tick values are built themselves with dicts with one key
for every subtick, such that the user can flatten/reshape the ndarray
subclass and merge the "subticks" into a single tick on any axis/dimension.
This reshaping operation has some complexities regarding shape homogeneity
of the result of reshaping, which are handled during a reshape operation in
# 'a' has three "tick/metadata variables": varA, varB, varC
# a tick is built as '@varA@-@varB@-@varC@'
a["a1-b1-c1"] = 1.0
a["a1-b1-c2"] = 1.0
a["a2-b1-c1"] = 1.0
# reshape it into 2 dimensions, the first with '@varA@', the second with
b = a.reshape(['varA'], ['varB', 'varC'])
# then, 'b' is
Data([ # a1
[1.0, # b1-c1
1.0] # b1-c2
[ # a2
[1.0, # b1-c1
nan] # b1-c2
> 2) Joining data structures - what to do when all data structures have
> the same 'metadata' (axes, labels, dtypes) and when each of these
> differ. Also, do you allow union (so the result is includes all axes,
> labels etc present all data structures) or intersection (keep only the
> axes and labels in common) operations?
First, I assume two levels of semantics on the structure:
* A set of values, reached by indexing multiple axis/dimensions.
I use this for identifying experiments, where the parameters of the
experiments are spread among an arbitrary number of dimensions (see above).
* A specific value of a set of values, (in my case) reached by indexing a field
in a structured array.
I use each structure/record to encapsulate all the various outputs of a
single experiment, where structure fields can have arbitrarily different
What I allow right now in sciexp2 is the "union" of experiment outputs; that is,
the union of structured arrays into a single one.
On the "experiment" metadata side, I think operations should fail if metadata
differs, unless you want to "append" new experiments (in my case this is
appending new tick variable values describing the very same tick
variables). Conceptually, I do this like (def append(self, data)):
0) Check both are describing the same type of experiments (i.e., have the same
metadata variables, although different values for them)
assert self.variables() == data.variables()
1) Flattening the affected arrays (flatteing is the inverse operation of the
above example, which I have not currently implemented but would be easy to
if speed were not a concern)
flat_self = self.flatten()
flat_data = data.dlatten()
2) Concatenate the two sequences of metadata. Will fail if any repeated
res = np.Data(len(flat_self) + len(flat_data),
metadata=flat_self.metadata + flat_data.metadata)
res[:len(flat_self)] = flat_self
res[len(flat_self):] = flat_data
3) Reshape 'res' metadata like 'self'. This will take care of placing NaN to
homogeneize the resulting structure.
4) Return res :)
> 3) How do you expect basic mathematical operations to work? For example,
> what does A +1 mean if A has different data types like strings?
I'd run for forcing the user to specify which structure fields have to operated
on (of course, assuming these really are structured arrays).
But, one thing that has been bugging me is how to operate on all fields when the
operation is compatible with all fields. For example, calculating the average of
all experiment results on a given axis.
Right now I have to calculate each of them separately, and then perform a
"union" of the resulting structured arrays (which in fact are not structured
arrays but plain ndarrays).
> 4) How should this interact with the rest of numpy?
Not sure what you mean. I already maintain metadata through all numpy
operations, except when indexing with 'numpy.newaxis', for which right now I
return a plain ndarray instead of creating a stub dimension metadata.
BTW, I stuck with the 'dimension' wording instead of 'axis' because of
'numpy.ndarray.ndim'. Maybe this should be unified with the 'axis' argumnent on
numeric operations, in order to use a single wording for the concept.
"And it's much the same thing with knowledge, for whenever you learn
something new, the whole world becomes that much richer."
-- The Princess of Pure Reason, as told by Norton Juster in The Phantom
More information about the NumPy-Discussion