[Numpy-discussion] array constructor from generators?

Tim Hochberg tim.hochberg at cox.net
Wed Apr 5 12:16:11 CDT 2006


Zachary Pincus wrote:

> [sorry if this comes through twice -- seems to have not sent the  
> first time]

I've only seen it once so far, but my numpy mail seems to be coming 
through all out of order right now.

> Hi folks,
>
> tim>
>
>> I brought this up last week and Travis was OK with it. I have it on  
>> my todo list, but if you are in a hurry you're welcome to do it  
>> instead.
>
>
> Sorry if that was on the list and I missed it! Hate to be adding more  
> noise than signal. At any rate, I'm not in a hurry, but I'd be happy  
> to help where I can. (Though for the next week or so I think I'm  
> swamped...)

There was no real discussion then. I said I thought it was a good idea. 
Travis said OK. That was about it.

> tim>
>
>> If you do look at it, consider looking into the '__length_hint__  
>> parameter that's slated to go into Python 2.5. When this is  present, 
>> it's potentially a big win, since you can preallocate the  array and 
>> fill it directly from the iterator. Without this, you  probably can't 
>> do much better than just building a list from the  array. What would 
>> work well would be to build a list, then steal  its memory. I'm not 
>> sure if that's feasible without leaking a  reference to the list though.
>
>
> Can you steal its memory and then give it some dummy memory that it  
> can free without problems, so that the list can be deallocated  
> without trouble? Does anyone know if you can just give the list a  
> NULL pointer for it's memory and then immediately decref it? free 
> (NULL) should always be safe, I think. (??)

That might well work, but now I realize that using a list this way 
probably won't work out well for other reasons.

>> Also, with iterators, specifying dtype will make a huge difference.  
>> If an object has __length_hint__ and you specify dtype, then you  can 
>> preallocate the array as I suggested above. However, if dtype  is not 
>> specified, you still need to build the list completely,  determine 
>> what type it is, allocate the array memory and then copy  the values 
>> into it. Much less efficient!
>
>
> How accurate is __length_hint__ going to be? It could lead to a fair  
> bit of special case code for growing and shrinking the final array if  
> __length_hint__ turns out to be wrong. 

see below.

> Code that python lists already  have, moreover.

If we don't know dtype up front, lists are great. All the code is there 
and we need to look at all of the elements before we know what the 
elements are anyway.

However, if you do know what dtype is the situation is different. Since 
these are generators, the object they create may only last until the 
next next() call if we don't hold onto it. That means that for a matrix 
of size N, generating thw whole list is going to require N*(sizeof(long) 
+ sizeof(pyobjType) + sizeof(dtype)), versus just N*sizeof(dtype) if 
we're careful. I'm not sure what all of those various sizes are, but I'm 
going to guess that we'd be at least doubling our memory.

All is not lost however. When we know the dtype, we should just use a 
*python* array to hold the data. It works just like a list, but on 
packed data.

>
> If the list's memory can be stolen safely, how does this strategy sound:

Let me break this into two cases:

1. We don't know the dtype.

> - Given a generator, build it up into a list internally

+1

> , and then  steal the list's memory.

-0.5

I'm not sure this buys us as much as I thought initially. The list 
memory is PyObject*, so this would only work on dtypes no larger than 
the size of a pointer, usually that means no larger than a long. So, 
basically this would work on most of the integer types, but not the 
floating point types. And, it adds extra complexity to support two 
different cases. I'd be inclined to start with just copying the objects 
out of the list. If someone feels like it later, they can come back and 
try to optimize the case of integers to steal the lists memory..

Keep in mind that once we have a list, we can simple pass it to the 
machinery that already exists for creating arrays from lists making our 
lives much easier.

> - If a dtype is provided, wrap the generator with another generator  
> that casts the original generator's output to the correct dtype. Then  
> use the wrapped generator to create a list of the proper dtype, and  
> steal that list's memory.

-1. This wastes a lot of space and sort of defeats the purpose of the 
whole exercise in my mind.

2. Dtype is known.


The case where dtype is provided is more complicated, but this is the 
case we really want to support well. Actually though, I think we can 
simplify it by judicious punting.

Case 2a. Array is not 1-dimensional. Punt and fallback on the general 
code above. We can determine this simply by testing the first element. 
If it's not int/float/complex/whatever-other-scalar-values-we-have, fall 
back to case 1.

Case 2b: length_ hint is not given. In this case, we build up the array 
in a python array, steal the data, possibly realloc and we're done.

Case 2b length_hint is given. Same as above, but preallocate the 
appropriate amount of memory. Growing if length_hint lies.


>
> A potential problem with stealing list memory is that it could waste  
> memory if the list has more bytes allocated than it is using (I'm not  
> sure if python lists can get this way, but I presume that they resize  
> themselves only every so often, like C++ or Java vectors, so most of  
> the time they have some allocated but unused bytes). If lists have a  
> squeeze method that's guaranteed not to cause any copies, or if this  
> can be added with judicious use of realloc, then that problem is  
> obviated.

I imagine once you steal the memory, realloc would the thing to try. 
However,  I don't think it's worth stealing the memory from lists. I do 
think it's worth stealing the memory from python arrays however, and I'm 
sure that the same issue exists there. We'll have to look at how the 
deallocation for an array works. It probably use Py_XDecref, in which 
case we can just replace the memory with NULL and we'll be fine.

OK, just had a look at the code for the python array object 
(Modules/arraymodule.c). Looks like it'll be a piece of cake. We can 
allocate it to the exact size we want if we have length_hint, otherwise 
resize only overallocates by 6%. That's not enough to worry about 
reallocing. Stealing the data looks like it shouldn't a problem either, 
just NULL ob_item as you suggested.


Regards,

-tim

>
> robert>
>
>> Another note of caution: You are going to have to deal with  
>> iterators of
>> iterators of iterators of.... I'm not sure if that actually overly  
>> complicates
>> matters; I haven't looked at PyArray_New for some time. Enjoy!
>
>
> This is a good point. Numpy does fine with nested lists, but what  
> should it do with nested generators? I originally thought that  
> basically 'array(generator)' should make the exact same thing as  
> 'array([f for f in generator])'. However, for nested generators, this  
> would be an object array of generators.
>
> I'm not sure which is better -- having more special cases for  
> generators that make generators, or having a simple rubric like above  
> for how generators are treated.
>
> Any thoughts?
>
> Zach
>
>
>
>
>
>
>
> -------------------------------------------------------
> This SF.Net email is sponsored by xPML, a groundbreaking scripting 
> language
> that extends applications into web and mobile media. Attend the live 
> webcast
> and join the prime developer group breaking into this new coding 
> territory!
> http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
> _______________________________________________
> Numpy-discussion mailing list
> Numpy-discussion at lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/numpy-discussion
>
>






More information about the Numpy-discussion mailing list