[Numpy-discussion] array constructor from generators?

Zachary Pincus zpincus at stanford.edu
Wed Apr 5 08:32:02 CDT 2006

[sorry if this comes through twice -- seems to have not sent the  
first time]

Hi folks,

> I brought this up last week and Travis was OK with it. I have it on  
> my todo list, but if you are in a hurry you're welcome to do it  
> instead.

Sorry if that was on the list and I missed it! Hate to be adding more  
noise than signal. At any rate, I'm not in a hurry, but I'd be happy  
to help where I can. (Though for the next week or so I think I'm  

> If you do look at it, consider looking into the '__length_hint__  
> parameter that's slated to go into Python 2.5. When this is  
> present, it's potentially a big win, since you can preallocate the  
> array and fill it directly from the iterator. Without this, you  
> probably can't do much better than just building a list from the  
> array. What would work well would be to build a list, then steal  
> its memory. I'm not sure if that's feasible without leaking a  
> reference to the list though.

Can you steal its memory and then give it some dummy memory that it  
can free without problems, so that the list can be deallocated  
without trouble? Does anyone know if you can just give the list a  
NULL pointer for it's memory and then immediately decref it? free 
(NULL) should always be safe, I think. (??)

> Also, with iterators, specifying dtype will make a huge difference.  
> If an object has __length_hint__ and you specify dtype, then you  
> can preallocate the array as I suggested above. However, if dtype  
> is not specified, you still need to build the list completely,  
> determine what type it is, allocate the array memory and then copy  
> the values into it. Much less efficient!

How accurate is __length_hint__ going to be? It could lead to a fair  
bit of special case code for growing and shrinking the final array if  
__length_hint__ turns out to be wrong. Code that python lists already  
have, moreover.

If the list's memory can be stolen safely, how does this strategy sound:
- Given a generator, build it up into a list internally, and then  
steal the list's memory.
- If a dtype is provided, wrap the generator with another generator  
that casts the original generator's output to the correct dtype. Then  
use the wrapped generator to create a list of the proper dtype, and  
steal that list's memory.

A potential problem with stealing list memory is that it could waste  
memory if the list has more bytes allocated than it is using (I'm not  
sure if python lists can get this way, but I presume that they resize  
themselves only every so often, like C++ or Java vectors, so most of  
the time they have some allocated but unused bytes). If lists have a  
squeeze method that's guaranteed not to cause any copies, or if this  
can be added with judicious use of realloc, then that problem is  

> Another note of caution: You are going to have to deal with  
> iterators of
> iterators of iterators of.... I'm not sure if that actually overly  
> complicates
> matters; I haven't looked at PyArray_New for some time. Enjoy!

This is a good point. Numpy does fine with nested lists, but what  
should it do with nested generators? I originally thought that  
basically 'array(generator)' should make the exact same thing as  
'array([f for f in generator])'. However, for nested generators, this  
would be an object array of generators.

I'm not sure which is better -- having more special cases for  
generators that make generators, or having a simple rubric like above  
for how generators are treated.

Any thoughts?


More information about the Numpy-discussion mailing list