[Numpy-discussion] A numpy accumulator...
Mon Oct 5 03:53:02 CDT 2009
A Saturday 03 October 2009 10:06:12 Christopher Barker escrigué:
> OK -- this one I'm intending to send!
> Hi all,
> This idea was inspired by a discussion at the SciPy conference, in which
> we spent a LOT of time during the numpy tutorial talking about how to
> accumulate values in an array when you don't know how big the array
> needs to be when you start.
> The "standard practice" is to accumulate in a python list, then convert
> the final result into an array. This is a good idea because Python lists
> are standard, well tested, efficient, etc.
> However, as was pointed out in that lengthy discussion, if what you are
> doing is accumulating is a whole bunch of numbers (ints, floats,
> whatever), or particularly if you need to accumulate a data type that
> plain python doesn't support, there is a lot of overhead involved: a
> python float type is pretty heavyweight. If performance or memory use is
> important, it might create issues. You can use and array.array, but it
> doesn't support all numpy types, particularly custom dtypes.
> I talked about this on the cython list (as someone asked how to do
> accumulate in cython), and a few folks thought it would be useful, so I
> put together a prototype.
> What I have in mind is very simple. It would be:
> - Only 1-d
> - Support append() and extend() methods
> - support indexing and slicing
> - Support any valid numpy dtype
> - which could even get you pseudo n-d arrays...
> - maybe it would act like an array in other ways, I'm not so sure.
> - ufuncs, etc.
> It could take the place of using python lists/arrays when you really
> want a numpy array, but don't know how big it will be until you've
> filled it.
> The implementation I have now uses a regular numpy array as the
> "buffer". The buffer is re-sized as needed with ndarray.resize(). I've
> enclosed the class, a bunch of tests (This is the first time I've ever
> really done test-driven development, though I wouldn't say that this is
> a complete test suite).
> A few notes about this implementation:
> * the name of the class could be better, and so could some of the
> method names.
> * on further thought, I think it could handle n-d arrays, as long as
> you only accumulated along the first index.
> * It could use a bunch more methods
> - deleting part of the array
> - math
> - probably anything supported by array.array would be good.
> * Robert pointed me to the array.array implementation to see how it
> expands the buffer as you append. It did tricks to get it to grow fast
> when the array is very small, then eventually to add about 1/16 of the
> used array size to the buffer. I imagine that this would gets used
> because you were likely to have a big array, so I didn't bother and
> start with a buffer at 128 elements, then add 1/4 each time you need to
> expand -- these are both tweakable attributes.
> * I'm keeping the buffer a hidden variable, and slicing and __array__
> return copies - this is so that it won't get multiple references, and
> then not be expandable.
> * I did a little simple profiling, and discovered that it's slower
> than a python list by a factor of more than 2 (for accumulating python
> ints, anyway). With a bit of experimentation, I think that's because of
> a couple factors:
> - an extra function call -- the append() method needs to then do an
> assignment to the buffer
> - Object conversion -- python lists store python objects, so the
> python int can just go right in there. with numpy, it needs to be
> converted to a C int first -- a bit if extra overhead. Though a straight
> assignment into a pre-allocated array i faster than a list.
> I think it's still an improvement for memory use.
> Maybe it would be worth writing in C or Cython to avoid some of this. In
> particular, it would be nice if you could use it in Cython, and put C
> types directly it...
> * This could be pretty useful for things like genfromtxt.
> What do folks think? is this useful? What would you change, etc?
That's interesting. I'd normally use the `resize()` method for what you want,
but indeed your approach is way more easy-to-use.
If you are looking for performance improvements, I'd have a look at the
`PyArray_Resize()` function in 'core/src/multiarray/shape.c' (trunk). It
seems to me that the zero-initialization of added memory can be skipped,
allowing for more performance for the `resize()` method (most specially for
large size increments). A new parameter (say, ``zero_init=True``) could be
added to `resize()` to specify that you don't want the memory initialized.
More information about the NumPy-Discussion