[Numpy-discussion] Proposed record array behavior: the rest of the story

Perry Greenfield perry at stsci.edu
Tue Jul 20 09:05:02 CDT 2004


We now turn to the behavior of Records. We'll note that many of the current
proposals had been considered in the past but not implemented with more of a
'wait and see' attitude towards what was really necessary and a desire to
prevent too many ways of doing the same thing without seeing that there was
a real call for them.

This proposal deals with the behavior of record array 'items', i.e., what we
call Record objects now.

The primary issues that have been raised with regard to Record behavior are
summarized as follows:

1) Items should be tuples instead of Records
2) Items should be objects, but present tuple and/or dictionary consistent
behavior.
3) Field (or column) names should be accessible as Record (and record
array) attributes.

Issue 1: Should record array items be tuples instead of Records?

Francesc Alted made this suggestion recently. Essentially the argument is
that tuples are a natural way of representing records. Unfortunately, tuples
do not provide a means of accessing fields of a record by name, but only by
number. For this reason alone, tuples don't appear to be adequate. Francesc
proposed allowing dictionary-like indexing to record arrays to facilitate
the field access to tuple entries by name. However, it seems that if
rarr is a record array, that both rarr['column 1'][2] and rarr[2]['column
1'] should work, not just the former. So the short answer is "No".

It should be noted that using tuples will force another change in current
behavior. Note that the current Record objects are actually views into the
record array. Changing the value within a record object changes the record
array. Use of tuples won't allow that since tuples are not mutable. Whole
records must be changed in their entirety if single elements of record
arrays were set by and returned from tuples.

But his comments (and well as those of others) do point out a number of
problems with the current implementation that could be improved, and making
the Record object support tuple behaviors is quite reasonable. Hence:

Issue 2: Should record array items present tuple and/or dictionary
compatible behaviors?

The short answer is, yes, we do agree that they should. This includes many
of the proposals made including:

1) supporting all Tuple capabilities with the following differences:
    a) fields are mutable (unlike tuple items) so long as the assigned value
is coerceable to the expected type. For example the current methods of doing
so are:

>>> cell = oneRec.field(1)
>>> oneRec.setfield(1, newValue)

This proposal would allow:

>>> cell = oneRec[1]
>>> oneRec[1] = newValue

    b) slice assignments are permitted so long as it doesn't change the size
of the record (i.e., no insertion of extra items) and the items can be
assigned as permitted for a. E.g.,

OneCell[2:4] = (3, 'abc')

    c) __str__ will result in a display looking like that for tuples,
__repr__ will show a Record constructor

>>> print oneRec # as is currently implemented
(1.1, 2, 'abc', 3)
>>> oneRec
Record((1.1, 2, 'abc', 3), formats=['1Float32', '1Int16', '1a3', '1Int32'])
    names=['abc', 'c2', 'xyz', 'c4'])

(note that how best to handle formats is still being thought about)

2) supporting all Dictionary capabilities with the following differences:
    a) keys and items are ordered.
    b) keys are restricted to being integers or strings only
    c) new keys cannot be dynamically added or deleted as for dictionaries
    d) no support for any other dictionary capabilities that can change the
number or names of items
    e) __str__ will not show a result looking like a dictionary (see 1c)
    f) values must meet Record object required type (or be coerceable to it)
    
For example the current

>>> cell = onRec.field('c2')
>>> oneRec.setfield('c2', newValue)

And the proposed added indexing capability:

>>> cell = oneRec['c2']
>>> oneRec['c2'] = newValue

Issue 3: Field (or column) names should be accessible as Record (and record
array) attributes.

As much as the attribute approach has appeal for simple usage, the problems
of name collisions and mismatches between acceptable field names
and attribute names strikes us as it does Russell Owen as being very
problematic. The technique of using a special attribute as Francesc suggests
(in his case, cols) that contains the field name attributes solves the name
collision problem, but not the legality issue (particularly with regard to
illegal characters, it's hard to imagine easily remembered mappings between
legal attribute representations and the actual field name. We are inclined
to try to pass (for now anyway) on mapping fields to attributes in any way.
It seems to us that indexing by name should be convenient enough, as well as
fully flexible to really satisfy all needs (and is needed in any case since
attributes are a clumsy way to use field access when using a variable to
specify the field (yes, one can use  getattr(), but it's clumsy)

*******************************************

Record array behavior changes:

1) It will be possible to assign any sequence to a record array item so long
as the sequence contains the right number of fields, and each item of the
sequence can be coerced to what the record array expects for the
corresponding field of the record. (addressing numarray feature request
928473 by Russell Owen).

I.e.,

>>> recArr[1] = (2, 3.2, 'xyz', 3)

2) One may assign a record to a record array so long as the record matches
the format of the record format of the record array (current behavior).
3) Easier construction and initialization of recarrays with default field
values as requested in numarray bug report 928479)
4) Support for lists of field names and formats as detailed in numarray bug
report 928488.
5) Field name indexing for record arrays. It will be possible to index
record arrays with a field name, i.e., if the index is a string, then what
will be returned is a numarray/chararray for that column. (Note that it
won't be possible to index record arrays by field number for obvious
reasons).

I.e. Currently

>>> col = recArr.field('doc')

Can also be

>>> col = recArr['abc']

But the current

>>> col = recArr.field(1)

Cannot become

>>> col = recArr[1]

On the other hand, it will not be permitted to mix a field index with an
array index in the same brackets, e.g., rarr[10, 'column 2'] will not be
supported. Allowing indexing to have two different interpretations is a bit
worrying. But if record array items may be indexed in this manner, it seems
natural to permit the same indexing for the record array. Mixing the two
kinds of indexing in one index seems of limited usefulness in the first
place and it makes inheriting the existing indexing machinery for NDArrays
more complicated (any efficiency gains in avoiding the intermediate object
creation by using two separate index operations will likely be offset by the
slowness of handling much more complicated mixed indices). Perhaps someone
can argue for why mixing field indices with array indices is important, but
for now we will prohibit this mode of indexing.

This does point to a possible enhancement for the field indexing, namely
being able to provide the equivalent of index arrays (e.g., a list of field
names) to generate a new  record array with a subset of fields.

Are there any other issues that should be addressed for improving record
arrays?






More information about the Numpy-discussion mailing list