[Numpy-discussion] array constructor from generators?
zpincus at stanford.edu
Wed Apr 5 08:32:02 CDT 2006
[sorry if this comes through twice -- seems to have not sent the
> I brought this up last week and Travis was OK with it. I have it on
> my todo list, but if you are in a hurry you're welcome to do it
Sorry if that was on the list and I missed it! Hate to be adding more
noise than signal. At any rate, I'm not in a hurry, but I'd be happy
to help where I can. (Though for the next week or so I think I'm
> If you do look at it, consider looking into the '__length_hint__
> parameter that's slated to go into Python 2.5. When this is
> present, it's potentially a big win, since you can preallocate the
> array and fill it directly from the iterator. Without this, you
> probably can't do much better than just building a list from the
> array. What would work well would be to build a list, then steal
> its memory. I'm not sure if that's feasible without leaking a
> reference to the list though.
Can you steal its memory and then give it some dummy memory that it
can free without problems, so that the list can be deallocated
without trouble? Does anyone know if you can just give the list a
NULL pointer for it's memory and then immediately decref it? free
(NULL) should always be safe, I think. (??)
> Also, with iterators, specifying dtype will make a huge difference.
> If an object has __length_hint__ and you specify dtype, then you
> can preallocate the array as I suggested above. However, if dtype
> is not specified, you still need to build the list completely,
> determine what type it is, allocate the array memory and then copy
> the values into it. Much less efficient!
How accurate is __length_hint__ going to be? It could lead to a fair
bit of special case code for growing and shrinking the final array if
__length_hint__ turns out to be wrong. Code that python lists already
If the list's memory can be stolen safely, how does this strategy sound:
- Given a generator, build it up into a list internally, and then
steal the list's memory.
- If a dtype is provided, wrap the generator with another generator
that casts the original generator's output to the correct dtype. Then
use the wrapped generator to create a list of the proper dtype, and
steal that list's memory.
A potential problem with stealing list memory is that it could waste
memory if the list has more bytes allocated than it is using (I'm not
sure if python lists can get this way, but I presume that they resize
themselves only every so often, like C++ or Java vectors, so most of
the time they have some allocated but unused bytes). If lists have a
squeeze method that's guaranteed not to cause any copies, or if this
can be added with judicious use of realloc, then that problem is
> Another note of caution: You are going to have to deal with
> iterators of
> iterators of iterators of.... I'm not sure if that actually overly
> matters; I haven't looked at PyArray_New for some time. Enjoy!
This is a good point. Numpy does fine with nested lists, but what
should it do with nested generators? I originally thought that
basically 'array(generator)' should make the exact same thing as
'array([f for f in generator])'. However, for nested generators, this
would be an object array of generators.
I'm not sure which is better -- having more special cases for
generators that make generators, or having a simple rubric like above
for how generators are treated.
More information about the Numpy-discussion