[Numpy-discussion] Introduction

Scott Gilbert xscottg at yahoo.com
Thu Apr 11 21:46:02 CDT 2002


--- Perry Greenfield <perry at stsci.edu> wrote:
>
> I guess we are not sure we understand what you mean by interface.
> In particular, we don't understand why sharing the same object
> attributes (the private ones you list above) is a benefit to the
> code you are writing if you aren't also using the low level
> implementation. The above attributes are private and nothing 
> external to the Class should depend on or even know about them.
> Could you elaborate on what you mean by interface and the
> relationship between your arrays and numarrays?
>

There are several places in your code that check to see if you are working with
a valid type for NDArrays.  Currently this check consists of asking the
following questions:

   'Is it a tuple or list?'
   'Is it a scalar of some sort?'
   'Does it derive from our NDArray class?'

If any of these questions answer true, it does the right thing and moves on. 
If none of these is true, it raises an exception.

I suppose this is fine if you are only concerned about working with your own
implementation of an array type, but I hope you'll consider the following as a
minor change that opens up the possibility for other compatible array
implementations to work interoperably.

Instead have the code ask the following questions:

   'Is it a tuple or list?'
   'Is it a scalar of some sort?'
   'Does it support the attributes necessary to be like an NDArray object?'

This change is very similar to how you can pass in any Python object to the
"pickle.dump()" function, and if it supports the "write()" method it will be
called:

      >>> class WhoKnows:
      ...     def write(self, x):
      ...          print x
      >>>
      >>> import pickle
      >>>
      >>> w = WhoKnows()
      >>>
      >>> pickle.dump('some data', w)
      S'some data'
      p1
      .

Until reading your response above, I didn't realize that you consider your
single underscore attributes to be totally private.  In general, I try to use a
single underscore to mean protected (meaning you can use them if you REALLY
know what you are doing), hence my confusion.  With that in mind, pretend that
I suggested the following instead:

    The specification of an NDArray is that it has the following attributes

        ndarray_buffer      - a PyObject which has PyBufferProcs
        ndarray_shape       - a tuple specifying the shape of the array
        ndarray_stride      - a tuple specifyinf the index multipliers
        ndarray_itemsize    - an int/long stating the size of items
        ndarray_itemtype    - some representation of type 

This would be a very minor change to your functions like inputarray(),
getNDInfo(), getNDArray(), but it would allow your UFuncs to work with other
implementations of arrays.  As an example similar to the pickle example above:

     import array
     class ScottArray:
         def __init__(self):
             self.ndarray_buffer   = array.array('d', [0]*100)
             self.ndarray_shape    = (10, 10)
             self.ndarray_stride   = (80, 8)
             self.ndarray_itemsize = 8
             self.ndarray_itemtype = 'Float64'

     import numarray

     n = numarray.numarray((10, 10), type='Float64')
     s = ScottArray()

     very_cool = numarray.add(n, s)


This example is kind of silly.  I mean, why wouldn't I just use numarray for
all of my array needs?  Well, that's where my world is a little different than
yours I think.  Instead of using 'array.array()' above, there are times where
I'll need to use 'whizbang.array()' to get a different PyBufferProcs supporting
object.  Or where I'll need to work with a crazy type in one part of the code,
but I'd like to pass it to an extension that combines your types and mine.

In these cases where I need "special memory" or "special types" I could try and
get you guys to accept a patch, but this would just pollute your project and
probably annoy you in general.  A better solution is to create a general
standard mechanism for implementing NDArray types, and let me make my own.


In the above example, we could have completely different NDArray
implementations working interoperably inside of one UFunc.  It seems to me that
all it really takes to be an NDArray can be specified by a list of attributes
like the one above.  (Probably need a few more attributes to be really general:
'ndarray_endian', etc...)  In the end, NDArrays are just pointers to a buffer,
and descriptors for indexing.


I don't believe this would have any significant affect on the performance of
numarray.  (The efficient fast C code still gets a pointer to work with.)  More
over, I'd be very willing to contribute patches to make this happen.


If you agree, and we can flesh out what this "attribute interface" should be,
then I can start distributing my own array module to the engineers where I work
without too much fear that they'll be screwed once numarray is stable and they
want to mix and match.

Code always lives a lot longer than I want it to, and if I give them something
now which doesn't work with your end product, I'll have done them a disservice.


BTW: Allowing other types to fill in as NDArrays also allows other types to
implement things like slicing as they see fit (slice and copy contiguious,
slice and copy on write, slice and copy by reference, etc...).

>
> We are hoping to get numarray into the distribution [it won't be the
> end of the world for us if it doesn't happen]. I'll warn you that the
> PEP is out of date. We are likely to update it only after we feel
> we are close to having the implementation ready for consideration 
> for including into the standard distribution. I would refer to the
> actual implementation and the design notes for the time being.
>

Yeah, I recognize that the PEP is gathering dust at the moment.  I'm not having
too much trouble following through the source and design docs.  It took me a
few days to "get it", but that's probably because I'm slower than your average
bear.  :-)

Regarding the PEP, what I would like to see happen is that if we agree that the
"attribute interface" stuff above is the right way to go about things, I would
(or we would) submit a milder interim PEP specifying what those attributes are,
how they are to be interpreted, and a simple Python module implementing a
general NDArray class for consumption.  Hopefully this PEP would specify a
canonical list of type names as well.  Then we could make updates to the other
PEP if necessary.



>
> Some of the name changes are worth considering (like replacing ._byteswap
> with an endian indicator, though I find _endian completely opaque as to
> what it would mean--1 means what? little or big?). (BTW, we already have
> _itemsize). _contiguous and _aligned are things we have been considering
> changing, but I would have to think about it carefully to determine if
> they really are redundant.
> 

It's all open for discussion, but I would propose that ndarray_endian be one
of:

    '>' - big endian
    '<' - little endian

This is how the standard Python struct module specifies endian, and I've been
trying to stay consistant with the baseline when possible.

>
> It looks like you are trying to deal with records with these "structs". 
> We deal with records (efficiently) in a completely different way. Take
> a look at the recarray module.
> 

Will definitely do.

I've called them structs simply because they borrow their format string from
the struct module that ships with Python.  I'm not hung up on the name, and I
wouldn't object to an alias.

Too early for me to tell if there is even a difference in the underlying
memory, but maybe we'll end up with 'structs' for my notion of things, and
'records' for yours.

>
> We deal with memory mapping a completely different way. It's a bit late
> for me to go into it in great detail, but we wrap the standard library
> mmap module with a module that lets us manage memory mapped files.
> This module basically memory maps an entire file and then in effect
> mallocs segments of that file as buffer objects. This allocation of
> subsets is needed to ensure that overlapping memory maps buffers
> don't happen. One can basically reserve part of the memory mapped file
> as a buffer. Once that is done, nothing else can use that part of the
> file for another buffer. We do not intend to handle memory maps as a
> way of sequentially mapping parts of the file to provide windowed views
> as your code segment above suggests. If you want a buffer that is the
> whole (large) file, you just get a mapped buffer to the whole thing.
> (Why wouldn't you?)
> 

I think the idea of taking a 500 megabyte (or 5 gigabyte) file, and windowing 1
meg of actual memory at time pretty attractive.  Sometimes we do very large
correlations, and there just isn't enough memory to mmap the whole file (much
less two files for correlation).

Any library that doesn't want to support this business could just raise a
NotImplemented error on encountering them.

Maybe I shouldn't be calling this "memory mapping".  Even though it could be
implemented on top of mmap, truthfully I just want to support a "windowing"
interface.  If we could specify the windowing attributes and indicate the
standard usage that would be great.  Maybe:

      ndarray_window(self, offset)
      ndarray_winmin
      ndarray_winmax


>
> The above scheme is needed for our purposes because many of our data files
> contain multiple data arrays and we need a means of creating a numarray
> object for each one. Most of this machinery has already been implemented,
> but we haven't released it since our I/O package (for astronomical FITS
> files) is not yet at the point of being able to use it.
> 

There is a group at my company that is using FITS for some stuff.  I don't know
enough about it to comment though...


Cheers,
    -Scott



__________________________________________________
Do You Yahoo!?
Yahoo! Tax Center - online filing with TurboTax
http://taxes.yahoo.com/




More information about the Numpy-discussion mailing list