[Numpy-discussion] Records in scipy core

Colin J. Williams cjw at sympatico.ca
Fri Dec 2 08:52:03 CST 2005


Travis Oliphant wrote:

> Christopher Hanley wrote:
>
>> Hi Travis,
>>
>> About a year ago (summer 2004) on the numpy distribution list there 
>> was a lot of discussion of the records interface.  I will dig through 
>> my notes and put together a summary.
>>  
>>
> Thanks for the pointers.  I had forgotten about that discussion.   I 
> went back and re-read the thread.
>
> Here's a good link for others to re-read (the end of) this thread:
>
> http://news.gmane.org/find-root.php?message_id=%3cBD22BAC0.E9EB%25perry%40stsci.edu%3e 
>
>
> I think some very good points were made.  These points should be 
> addressed from the context of scipy arrays which now support records 
> in a very basic way.   Because of this, we can support nested records 
> of records --- but how is this to be presented to the user is still an 
> open question (i.e. how do you build one...)
>
> I've finally been converted to believe that the notion of records is 
> very important because it speaks of how to do the basic (typeless, 
> mathless) array object that will go into Python correctly  If we can 
> get the general records type done right, then all the other types are 
> examples of it.
>
> Thus, I would like to revive discussion of the record object for 
> inclusion in scipy core.  I pretty much agree with the semantics that 
> Perry described in his final email (is this all implemented in 
> numarray, yet?), except I would agree with Francesc Alted that a 
> titles or labels concept should be allowed.
> I'm more enthusiastic about code than discussion, so I'm hoping for a 
> short-lived discussion followed by actual code.  I'm ready to do the 
> implementation this week (I've already borrowed lots of great code 
> from numarray which makes it easier), but feel free to chime in even 
> if you read this later.
>
> In my mind, the discussion about the records array is primarily a 
> discussion about the records data-type.  The way I'm thinking, the 
> scipy ndarray is a homogeneous collection of the same "thing."  The 
> big change in scipy core is that Numeric used to allow only certain 
> data types, but now the ndarray can contain an arbitrary "void" data 
> type.  You can also add data-types to scipy core.  These data-types 
> are "almost" full members of the scipy data-type community.  The 
> "almost" is because the N*N casting  matrix is not updated (this would 
> require a re-design of how casting is considered).   At some point, 
> I'd like to fix this wart and make it so that data-types can be added 
> at will -- I think if we get the record type right, I'll be able to 
> figure out how to do this.
>
> We need to add a "record" data-type to scipy.  Then, any array can be 
> of "record" type, and there will be an additional "array scalar" that 
> is what is returned when selecting a single element from the array.   
> So, a record array would simply be an array of "records" plus some 
> extra stuff for dealing with the mapping from field names to actual 
> segments of the array element (we may decide that this mapping is 
> general enough that all scipy arrays should have the capability of 
> assigning names to sub-bytes of its main data-type and means of 
> accessing those sub-bytes in which case the subclass is unnecessary).
> Let me explain further:  Right now, the machinery is in place in 
> scipy_core to get and set in any ndarray (regardless of its data-type) 
> an arbitrary "field".  A "field" in this context is defined as a 
> sub-section of the basic element making up the array.   Generically 
> the sub-section is defined by an offset and a data-type or a tuple of 
> a data type and a shape (to allow sub-arrays in a record).    What I 
> understand the user to want is the binding of a name to this generic 
> sub-section descriptor.
> 1) Should we allow that for every scipy ndarray:  complex data types 
> have an obvious binding, would anybody want to name the first two 
> bytes of their int32 array?  I suggest holding off on this one until a 
> records array is working....
>
> 2) Supposing we don't go with number 1, we need to design a record 
> data type that has this name-binding capability.
>
> The recarray class in scipy core SVN essentially just does this.
>
> Question:  How important is backwards compatibility with old numarray 
> specification.  In particular, I would go with the .fields access 
> described by Perry, and eliminate the .field() approach?
>
I feel that it is not particularly important.  Having a good design is 
the thing to shoot for.

>
> Thanks for reading and any comments you can make.
>
> -Travis
>
I'm not clear as to what the current design objective is and so I'll try 
to recap and perhaps expand my pieces in the referenced discussion to 
set out the sort of arrangement I would like to see.

We are moving towards having a multi-dimensional array which can hold 
objects of fixed size and type, the smallest being one byte (although 
the void would appear to be a collection of no size objects).  Most of 
the need, and thus the focus, is on numeric objects, ranging in size 
from Int8 to Complex64.

The Record is a fixed size object containing fields.  Each field has a 
name, an optional title and data of a fixed type (perhaps including 
another record instance and maybe arrays of fixed size?).

In the example below, AddressRecord and PersonRecord would be 
sub-classes of Record where the fields are named and, optionally, field 
titles given.  The names would be consistent with Python naming whereas 
the title could be any Python string.

The use of attributes raises the possibility that one could have nested 
records.  For example, suppose one has an address record:

addressRecord
   streetNumber
   streetName
   postalCode
   ...

There could then be a personal record:
personRecord
   ...
   officeAddress
   homeAddress
   ...

One could address a component as rec.homeAddress.postalCode.

Suppose one has a (n, n) array of persons then one could access the data in the following ways:

persons[1]                            all records in the second row
persons[:,1]                          all records in the second column
persons[1, 1]                         return a specific person record
persons[1, 1].homeAddress             the home address record for a specific person
persons[1, 1].homeAddress.postalCode  the postal code for a specific person
persons.homeAddress.postalCode        an (n, n) array containing all postal codes
persons.homeAddress.postalCode.title  could be 'Zip Code'

I see no need to have the attribute 'field' and would like to avoid the use of strings to 
identify a record component.  This does require that fields be named as Python 
identifiers but is this restriction a killer?

Colin W.





More information about the Numpy-discussion mailing list