[Numpy-discussion] Re: Array Metadata

Francesc Altet faltet at carabos.com
Fri Apr 1 02:01:11 CST 2005


I'm very much with the opinions of Scott. Just some remarks.

A Divendres 01 Abril 2005 06:12, Scott Gilbert va escriure:
> > __array_names__ (optional comma-separated names for record fields)
>
> I really like this idea.  Although I agree with David M. Cooke that it
> should be a tuple of names.  Unless there is a use case I'm not
> considering, it would be preferrable if the names were restricted to valid
> Python identifiers.

Ok. I was thinking on easing the life of C extension writers, but I
agree that a tuple of names should be relatively easily dealed in C as
well. However, as the __array_typestr__ would be a plain string, then
an __array_names__ being a plain string would be consistent with that.

Also, it would be worth to know how to express a record of different
shaped fields. I mean, how to represent a record like:

[array(Int32,shape=(2,3)), array(Float64,shape=(3,))]

The possibilities are:

__array_shapes__ = ((2,3),(3,))
__array_typestr__ = (i,d)

Other possibility could be an extension of the current struct format:

__array_typestr__ = "(2,3)i(3,)d"

more on that later on.

> The struct module has a portable set of typecodes.  They call it
> "standard", but it's the same thing.  The struct module let's you specify
> either standard or native.  For instance, the typecode for "standard long"
> ("=l") is always 4 bytes while a "native long" ("@l") is likely to be 4 or
> 8 bytes depending on the platform.  The __array_typestr__ codes should
> require the "standard" sizes.  There is a table at the bottom of the
> documentation that goes into detail:
>
>     http://docs.python.org/lib/module-struct.html

I fully agree with Scott here. Struct typecodes are offering a way to
approach the Python standards, and this is a good thing for many
developers that knows nothing of array packages and its different
typecodes. IMO, the set of portable set of typecodes in struct module
should only be abandoned if they cannot fulfil all the requirements of
Numeric3/numarray. But I'm pretty confident that they will eventually
do.

> The only problem with the struct module is that it's missing a few types...
> (long double, PyObject, unicode, bit).

Well, bit is not used either in Numeric/numarray and I think few
people would complain on this (they can always pack bits into bytes).
PyObject and unicode can be reduced to a sequence of bytes and some
other metadata to the array protocol can be added to complement its
meaning (say __array_str_encoding__ = "UTF-8" or similar). 

long double is the only type that should be added to struct typecodes,
but convincing the Python crew to do that should be not difficult, I
guess.

> > I also think that rather than attach < or > to the start of the
> > string it would be easier to have another protocol for endianness.
> > Perhaps something like:
> >
> > __array_endian__  (optional Python integer with the value 1 in it).
> > If it is not 1, then a byteswap must be necessary.
>
> A limitation of this approach is that it can't adequately represent
> struct/record arrays where some fields are big endian and others are little
> endian.

Having a mix of different endianess data values in the same data
record would be a bit ill-minded. In fact, numarray does not support
this: a recarray should be all little or big endian. I think that '<'
and '>' would be more than enough to represent this.

> > Bool               -- "b%d" % sizeof(bool)
> > Signed Integer     -- "i%d" % sizeof(<some int>)
> > Unsigned Integer   -- "u%d" % sizeof(<some uint>)
> > Float              -- "f%d" % sizeof(<some float>)
> > Complex            -- "c%d" % sizeof(<some complex>)
> > Object             -- "O%d" % sizeof(PyObject *)
> >          --- this would only be useful on shared memory
> > String             -- "S%d" % itemsize
> > Unicode            -- "U%d" % itemsize
> > Void               -- "V%d" % itemsize
>
> The above is a nice start at reinventing the struct module typecodes.  If
> you and Perry agree to it, that would be great.  A few additions though:

Again, I think it would be better to not get away from the struct
typecodes. But if you end doing it, well, I would like to propose a
couple of additions to the new protocol:

1.- Support shapes for record specification. I'm listing two
possibilities:

  A) __array_typestr__ = "(2,3)i(3,)d"
  
  This would be an easy extension of the struct string type definition.

  B) __array_typestr__ = ("i4","f8")
     __array_shapes__ = ((2,3),(3,))

  This is more 'à la numarray'.
  
2.- Allow nested datatypes. Although numarray does not support this
yet, I think it could be very advantageous to be able to express:

[array(Int32,shape=(5,)),[array(Int16,shape=(2,)),array(Float32,shape=(3,4))]]

i.e., the first field would be an array of ints with 6 elements, while
the second field would be actually another record made of 2 fields:
one array of short ints, and other array of simple precision floats.

I'm not sure how exactly implement this, but, what about:

  A) __array_typestr__ = "(5,)i[(2,)h(3,4)f]"
  
  B) __array_typestr__ = ("i4",("i2","f8"))
     __array_shapes__ = ((5,),((2,),(3,4))
  
Because I'm suggesting to adhere the struct specification, I prefer
option A), although I guess option B would be easier to use for
developers (even for extension developers).


> > So, what if we proposed for the Python core not something like
> > Numeric3 (which would still exist in scipy.base and be everybody's
> > favorite array :-) ), but a very minimal array object (scaled back
> > even from Numeric) that followed the array protocol and had some
> > C-API associated with it.
> >
> > This minimal array object would support 5 basic types ('bool',
> > 'integer', 'float', 'complex', 'Object').   (Maybe a void type
> > could be defined and a void "scalar" introduced (which would be
> > the bytes object)).  These types correspond to scalars already
> > available in Python and so the whole 0-dim array Python scalar
> > arguments could be ignored.
>
> I really like this idea.  It could easily be implemented in C or Python
> script.  Since half it's purpose is for documentation, the Python script
> implementation might make more sense.

Yeah, I fully agree with this also.

Cheers,

-- 
>qo<   Francesc Altet     http://www.carabos.com/
V  V   Cárabos Coop. V.   Enjoy Data
 ""





More information about the Numpy-discussion mailing list