[Numpy-discussion] Data type change completed

Colin J. Williams cjw at sympatico.ca
Tue Dec 6 07:05:06 CST 2005


Travis Oliphant wrote:

> Colin J. Williams wrote:
>
>> Travis Oliphant wrote:
>>
>>>
>>> I've committed the data-type change discussed at the end of last 
>>> week to the SVN repository.  Now the concept of a data type for an 
>>> array has been replaced with a "data-descriptor".  This 
>>> data-descriptor is flexible enough to handle an arbitrary record 
>>> specification with fields that include records and arrays or arrays 
>>> of records.  While nesting may not be the best data-layout for a new 
>>> design, when memory-mapping an arbitrary fixed-record-length file, 
>>> this capability allows you to handle even the most obsure record file.
>>>
>> Does this mean that the dtype parameter is changed?  obscure??
>
>
> No, it's not changed.  The dtype parameter is still used and it is 
> still called the same thing.   It's just that what constitutes a 
> data-type has changed significantly.
>
> For example now tuples and dictionaries can be used to describe a 
> data-type.  These definitions are recursive so that whenever data-type 
> is used it means anything that can be interpreted as a data-type.  And 
> I really mean data-descriptor, but data-type is in such common usage 
> that I still use it.

This would appear to be a good step forward but with all of the 
immutable types (int8, FloatType, TupleType, etc.) the data is stored in 
the ArrayType instance (array_data?) whereas, with a dictionary, it 
would appear to be necessary to store the items outside the array.  Is 
that desirable?

Even the tuple can have its content modified, as the example below shows:

 >>> a= []
 >>> b= (a, [2, 3])
 >>> b[0]
[]
 >>> b[0]=99
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
TypeError: object does not support item 
assignment                             <<<    GOOD
 >>> b[1][0]
2
 >>> b[1][0]=99
 >>> b
([], [99, 
3])                                                                                          
<<< HERE WE CHANGE THE VALUE OF THE
                                                                                                                   
TUPLE
 >>>

>
> Tuple:
> ========
> (fixed-size-data-type, shape)
> (generic-size-data-type, itemsize)
> (base-type-data-type, new-type-data-type)
>
> Examples:
>
> dtype=(int32, (5,5))   ---  a 5x5 array of int32 is the description of 
> this item.
> dtype=(str, 10) --- a length-10 string

So dtype now contains both the data type of each element and the shape 
of the array?  This seems a significant change from numarray or Numeric.

> dtype=(int16, {'real':(int8,0),'imag':(int8,4)}  --- a descriptor that 
> acts
>                                                                               
> like an int16 array mathematically
>                                                                               
> (in ufuncs) but has real and imag
>                                                                        
>      
> fields.                                                                              
>
>
This adds complexity, is there a compensating benefit?  Do all of the 
complex operations apply?

>
> Dictionary (defaults to a dtypechar == 'V')
> ==========

Why no clean things up by dropping typechar?  These seemed to be one of 
the warts in numarray, only carried forward for
compatibility reasons.  Could the compatibility objectives of the 
project not be achieved, outside the ArrayType object, with a wrapper of 
some sort?

> format1:
>
> {"names": list-of-field-names,
>  "formats":  list of data-types,
>
> <optionally>
>  "offsets" : list of  start-of-the-field
>  "titles" : extra field names
> }
>
Couldn't the use of records avoid the cumbersome use of keys?

> format2 (and how it's stored internally)
>
> {key1 : (data-type1, offset1 [, title1]),
>  key2 : (data-type2, offset2 [, title2]),
>   ...
>  keyn : (data-typen, offsetn [, titlen])
> }
>
This is cleaner, but couldn't this inormation be contained within the 
Record instance?

>
> Other objects not already covered:
> =====================
> ????
> Right now, it just passes the tp_dict of the typeobject to the 
> dictionary-conversion routine.
> I'm open for ideas here and will probably have better ideas once the 
> actual record data-type (not data-descriptor but actual subclass of 
> the scipy.void data type) looks like.
>
> All of these can be used as the dtype parameter wherever it is taken 
> (of course you can't
> always do something useful with every data-descriptor).
> When an ndarray has an associated type descriptor with fields (that's 
> where the field information is
> stored),  then those fields can be accessed using string or unicode 
> keys to the getitem call.
>
I've used ArrayType in place of ndarray (or maybe it should have been 
ndbigarray?) above as it appear to be more descriptive and fits with the 
Python convention on class naming.

> Thus, you can do something like this:
>
> >>> a = ones((4,3), dtype=(int16, {'real':(int8, 0), 'imag':(int8, 1)}))
> >>> a['imag'] = 2
> >>> a['real'] = 1
> >>> a.tostring()
> '\x01\x02\x01\x02\x01\x02\x01\x02\x01\x02\x01\x02\x01\x02\x01\x02\x01\x02\x01\x02\x01\x02\x01\x02' 
>
>
Or, one could have something like:
class SmallComplex(Record):
..''' This class typically has no instances in user code. '''
..real= (int8, )
..imag= (int8)
..def __init__(self):
....
..def __new__(self):
....

 >>> a = ones((4,3), dtype= SmallComplex)
 >>> a.imag = 2
 >>> a.real = 1
 >>> a.tostring()
'\x01\x02\x01\x02\x01\x02\x01\x02\x01\x02\x01\x02\x01\x02\x01\x02\x01\x02\x01\x02\x01\x02\x01\x02' 


>
> Note that there are now three distinct but interacting Python objects:
>
> 1) the N-dimensional array of a fixed itemsize.
> 2) a Python object representing one element of the array.
> 3) the data-descriptor object describing the data-type of the array.

This looks cleaner.  Perehaps 2) and 3) could be phrased a little 
differently:

2) a Python object which is one element of the array.
3) the data-descriptor object describing the data-type of the array 
element.

>
> These three things were always there under the covers (the 
> PyArray_Descr* has been there since Numeric), and the standard Python 
> types were always filling in for number 2.  Now we are just being more 
> explicit about it.
>
> Now, all three things are present and accounted for.  I'm really quite 
> happy with the resulting infrastructure. I think it will allow some 
> really neat possibilities.
>
> I'm thinking the record array subclass will allow attribute-based 
> look-up and register a nice record type for the actual "element" in of 
> the record array.
>
This is good but the major structure is the array which can have 
elements of various types such as ComplexType, NoneType, int8 or a 
variety of user defined immutable records.

Colin W.

PS  My Record sketch above needs a lot more thinking through

>
> -Travis
>
>
>
> -------------------------------------------------------
> This SF.net email is sponsored by: Splunk Inc. Do you grep through log 
> files
> for problems?  Stop!  Download the new AJAX search engine that makes
> searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
> http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
> _______________________________________________
> Numpy-discussion mailing list
> Numpy-discussion at lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/numpy-discussion
>





More information about the Numpy-discussion mailing list