[Numpy-discussion] BOF notes: Fernando's proposal: NumPy ndarray with named axes

Lluís xscript@gmx....
Wed Jul 7 10:51:58 CDT 2010


Bruce Southey writes:
> 1) Indexing especially related to slicing and broadcasting.

1.1) Absolute indexing/slicing

     a[0], a['tickvalue']

1.2) Partial slicing

     For the case of "compund" ticks that is merging multiple ticks into a
     single one:

       a['subtick1value-subtick2value'] (absolute)
       a[::"subtick1 == 'subtick1value'"] (partial slicing)

     That is, I have a dict in an ndarray subclass for the 'tickvalue' -> int
     translation, but tick values are built themselves with dicts with one key
     for every subtick, such that the user can flatten/reshape the ndarray
     subclass and merge the "subticks" into a single tick on any axis/dimension.

     This reshaping operation has some complexities regarding shape homogeneity
     of the result of reshaping, which are handled during a reshape operation in
     sciexp2. Example:

       # 'a' has three "tick/metadata variables": varA, varB, varC
       # a tick is built as '@varA@-@varB@-@varC@'
       a["a1-b1-c1"] = 1.0
       a["a1-b1-c2"] = 1.0
       a["a2-b1-c1"] = 1.0
       # reshape it into 2 dimensions, the first with '@varA@', the second with
       # '@varB@-@varC@'
       b = a.reshape(['varA'], ['varB', 'varC'])
       # then, 'b' is
       Data([      # a1
             [1.0, # b1-c1
              1.0] # b1-c2
            ], 
            [      # a2
             [1.0, # b1-c1
              nan] # b1-c2
            ])


> 2) Joining data structures - what to do when all data structures have 
> the same 'metadata' (axes, labels, dtypes) and when each of these 
> differ. Also, do you allow union (so the result is includes all axes, 
> labels etc present all data structures)  or intersection (keep only the 
> axes and labels in common) operations?

First, I assume two levels of semantics on the structure:

 * A set of values, reached by indexing multiple axis/dimensions.
  
   I use this for identifying experiments, where the parameters of the
   experiments are spread among an arbitrary number of dimensions (see above).

 * A specific value of a set of values, (in my case) reached by indexing a field
   in a structured array.

   I use each structure/record to encapsulate all the various outputs of a
   single experiment, where structure fields can have arbitrarily different
   types.

What I allow right now in sciexp2 is the "union" of experiment outputs; that is,
the union of structured arrays into a single one.

On the "experiment" metadata side, I think operations should fail if metadata
differs, unless you want to "append" new experiments (in my case this is
appending new tick variable values describing the very same tick
variables). Conceptually, I do this like (def append(self, data)):

  0) Check both are describing the same type of experiments (i.e., have the same
     metadata variables, although different values for them)

        assert self.variables() == data.variables()

  1) Flattening the affected arrays (flatteing is the inverse operation of the
     above example, which I have not currently implemented but would be easy to
     if speed were not a concern)

        flat_self = self.flatten()
        flat_data = data.dlatten()

  2) Concatenate the two sequences of metadata. Will fail if any repeated
     elements exist.

        res = np.Data(len(flat_self) + len(flat_data),
                      metadata=flat_self.metadata + flat_data.metadata)
        res[:len(flat_self)] = flat_self
        res[len(flat_self):] = flat_data

  3) Reshape 'res' metadata like 'self'. This will take care of placing NaN to
     homogeneize the resulting structure.

  4) Return res :)


> 3) How do you expect basic mathematical operations to work? For example, 
> what does A +1 mean if A has different data types like strings?

I'd run for forcing the user to specify which structure fields have to operated
on (of course, assuming these really are structured arrays).

But, one thing that has been bugging me is how to operate on all fields when the
operation is compatible with all fields. For example, calculating the average of
all experiment results on a given axis.

Right now I have to calculate each of them separately, and then perform a
"union" of the resulting structured arrays (which in fact are not structured
arrays but plain ndarrays).


> 4) How should this interact with the rest of numpy?

Not sure what you mean. I already maintain metadata through all numpy
operations, except when indexing with 'numpy.newaxis', for which right now I
return a plain ndarray instead of creating a stub dimension metadata.

BTW, I stuck with the 'dimension' wording instead of 'axis' because of
'numpy.ndarray.ndim'. Maybe this should be unified with the 'axis' argumnent on
numeric operations, in order to use a single wording for the concept.


Read you,
     Lluis

-- 
 "And it's much the same thing with knowledge, for whenever you learn
 something new, the whole world becomes that much richer."
 -- The Princess of Pure Reason, as told by Norton Juster in The Phantom
 Tollbooth


More information about the NumPy-Discussion mailing list