[Numpy-discussion] Bytes Object and Metadata

Scott Gilbert xscottg at yahoo.com
Fri Mar 25 22:59:04 CST 2005


Adding metadata at the buffer object level causes problems for "view"
semantics.  Let's say that everyone agreed what "itemsize" and "itemtype"
meant:

    real_view = complex_array.real

The real_view will have to use a new buffer since they can't share the old
one.  The buffer used in complex_array would have a typecode like
ComplexDouble and an itemsize of 16.  The buffer in real_view would need a
typecode of Double and an itemsize of 8.  If metadata is stored with the
buffer object, it can't be the same buffer object in both places.

Another case would be treating a 512x512 image of 4 byte pixels as a
512x512x4 image of 1 byte RGBA elements.  Or even coercing from Signed to
Unsigned.


The bytes object as proposed does allow new views to be created from other
bytes objects (sharing the same memory underneath), and these views could
each have separate metadata, but then you wouldn't be able to have arrays
that used other types of buffers.  Having arrays use mmap buffers is very
useful.

The bytes object shouldn't create views from arbitrary other buffer objects
because it can't rely on the general semantics of the PyBufferProcs
interface.  The foreign buffer object might realloc and invalidate the
pointer for instance...  The current Python "buffer" builtin does this, and
the results are bad.  So creating a bytes object as a view on the mmap
object doesn't work in the general case.

Actually, now that I think about it, the mmap object might be safe.  I
don't believe the current implementation of mmap does any reallocing under
the scenes and I think the pointer stays valid for the lifetime of the
object.  If we verified that mmap is safe enough, bytes could make a
special case out of it, but then you would be locked into bytes and mmap
only.  Maybe that's acceptable...

Still, I think keeping the metadata at a different level, and having the
bytes object just be the Python way to spell a call to C's malloc will
avoid a lot of problems.  Read below for how I think the metadata stuff
could be handled.


--- Chris Barker <Chris.Barker at noaa.gov> wrote:
> 
> There are any number of Third party extensions that could benefit from 
> being able to directly read the data in Numeric* arrays: PIL, wxPython, 
> etc. Etc. My personal example is wxPython:
> 
> At the moment, you can pass a Numeric or numarray array into wxPython, 
> and it will be converted to a wxList of wxPoints (for instance), but 
> that is done by using the generic sequence protocol, and a lot of type 
> checking. As you can imagine, that is pretty darn slow, compared to just 
> typecasting the data pointer and looping through it. Robin Dunn, quite 
> reasonably, doesn't want wxPython to depend on Numeric, so that's what 
> we've got.
> 
> My understanding of this memory object is that an extension like 
> wxPython wouldn't not need to know about Numeric, but could simply get 
> the memory Object, and there would be enough meta-data with it to 
> typecast and loop through the data. I'm a bit skeptical about how this 
> would work. It seems that the metadata required would be the full set of 
> stuff in an array Object already:
> 
> type
> dimensions
> strides
> 
> This could be made a bit simpler by allowing only contiguous arrays, but 
> then there would need to be a contiguous flag.
> 
> To make use of this, wxPython would have to know a fair bit about 
> Numeric Arrays anyway, so that it can check to see if the data is 
> appropriate. I guess the advantage is that while the wxPython code would 
> have to know about Numeric arrays, it wouldn't have to include Numeric 
> headers or code.
> 

I think being able to traffic in N-Dimensional arrays without requiring
linking against the libraries is a good thing.

Several years ago, I proposed a solution to this problem.  Actually I did a
really poor job of proposing it and irritated a lot of people in the
process.  I'm embarrassed to post a link to the following thread, but here
it is anyway:

    http://aspn.activestate.com/ASPN/Mail/Message/numpy-discussion/1166013

Accept my appologies if you read the whole thing just now.  :-)  Accept my
sincere appologies if you read it at the time.


I think the proposal is still relevant today, but I might revise it a bit
as follows.  A bear minimum N-Dimensional array for interchanging data
across libraries could get by with following attributes:

    # Create a simple record type for storing attributes
    class BearMin: pass
    bm = BearMin()

    # Set the attributes sufficient to describe a simple ndarray
    bm.buffer = <a buffer or sequence object>
    bm.shape = <a tuple of ints describing it's shape>
    bm.itemtype = <a string describing the elements>

The bm.buffer and bm.shape attributes are pretty obvious.  I would suggest
that the bm.itemtype borrow it's typecodes from the Python struct module,
but anything that everyone agreed on would work.  (The struct module is
nice because it is already documented and supports native and portable
types of many sizes in both endians.  It also supports composite struct
types.)

Those attributes are sufficient for someone to *produce* an N-Dimensional
array that could be understood by many libraries.  Someone who *consumes*
the data would need to know a few more:

    bm.offset = <an integer offset into the buffer>
    bm.strides = <a tuple of ints for non-contiguous or Fortran arrays>

The value of bm.offset would default to zero if it wasn't present, and the
the tuple bm.strides could be generated from the shape assuming it was a C
style array.  Subscripting operations that returned non-contiguous views of
shared data could change bm.offset to non-zero.  Subscripting would also
affect the bm.strides, and creating a Fortran style array would require
bm.strides to be present.

You might also choose to add bm.itemsize in addition to bm.itemtype when
you can describe how big elements are, but you can't sufficiently describe
what the data is using the agreed upon typecodes.  This would be uncommon. 
The default for bm.itemsize would come from struct.calcsize(bm.itemtype).

You might also choose to add bm.complicated for when the array layout can't
be described by the shape/offset/stride combination.  For instance
bm.complicated might get used when creating views from more sophisticated
subscripting operations like index arrays or mask arrays.  Although it
looks like Numeric3 plans on making new contiguous copies in those cases.

The C implementations of arrays would only have to add getattr like
methods, and the data could be stored very compactly.



More information about the Numpy-discussion mailing list