[Numpy-discussion] array constructor from generators?
tim.hochberg at cox.net
Wed Apr 5 12:16:11 CDT 2006
Zachary Pincus wrote:
> [sorry if this comes through twice -- seems to have not sent the
> first time]
I've only seen it once so far, but my numpy mail seems to be coming
through all out of order right now.
> Hi folks,
>> I brought this up last week and Travis was OK with it. I have it on
>> my todo list, but if you are in a hurry you're welcome to do it
> Sorry if that was on the list and I missed it! Hate to be adding more
> noise than signal. At any rate, I'm not in a hurry, but I'd be happy
> to help where I can. (Though for the next week or so I think I'm
There was no real discussion then. I said I thought it was a good idea.
Travis said OK. That was about it.
>> If you do look at it, consider looking into the '__length_hint__
>> parameter that's slated to go into Python 2.5. When this is present,
>> it's potentially a big win, since you can preallocate the array and
>> fill it directly from the iterator. Without this, you probably can't
>> do much better than just building a list from the array. What would
>> work well would be to build a list, then steal its memory. I'm not
>> sure if that's feasible without leaking a reference to the list though.
> Can you steal its memory and then give it some dummy memory that it
> can free without problems, so that the list can be deallocated
> without trouble? Does anyone know if you can just give the list a
> NULL pointer for it's memory and then immediately decref it? free
> (NULL) should always be safe, I think. (??)
That might well work, but now I realize that using a list this way
probably won't work out well for other reasons.
>> Also, with iterators, specifying dtype will make a huge difference.
>> If an object has __length_hint__ and you specify dtype, then you can
>> preallocate the array as I suggested above. However, if dtype is not
>> specified, you still need to build the list completely, determine
>> what type it is, allocate the array memory and then copy the values
>> into it. Much less efficient!
> How accurate is __length_hint__ going to be? It could lead to a fair
> bit of special case code for growing and shrinking the final array if
> __length_hint__ turns out to be wrong.
> Code that python lists already have, moreover.
If we don't know dtype up front, lists are great. All the code is there
and we need to look at all of the elements before we know what the
elements are anyway.
However, if you do know what dtype is the situation is different. Since
these are generators, the object they create may only last until the
next next() call if we don't hold onto it. That means that for a matrix
of size N, generating thw whole list is going to require N*(sizeof(long)
+ sizeof(pyobjType) + sizeof(dtype)), versus just N*sizeof(dtype) if
we're careful. I'm not sure what all of those various sizes are, but I'm
going to guess that we'd be at least doubling our memory.
All is not lost however. When we know the dtype, we should just use a
*python* array to hold the data. It works just like a list, but on
> If the list's memory can be stolen safely, how does this strategy sound:
Let me break this into two cases:
1. We don't know the dtype.
> - Given a generator, build it up into a list internally
> , and then steal the list's memory.
I'm not sure this buys us as much as I thought initially. The list
memory is PyObject*, so this would only work on dtypes no larger than
the size of a pointer, usually that means no larger than a long. So,
basically this would work on most of the integer types, but not the
floating point types. And, it adds extra complexity to support two
different cases. I'd be inclined to start with just copying the objects
out of the list. If someone feels like it later, they can come back and
try to optimize the case of integers to steal the lists memory..
Keep in mind that once we have a list, we can simple pass it to the
machinery that already exists for creating arrays from lists making our
lives much easier.
> - If a dtype is provided, wrap the generator with another generator
> that casts the original generator's output to the correct dtype. Then
> use the wrapped generator to create a list of the proper dtype, and
> steal that list's memory.
-1. This wastes a lot of space and sort of defeats the purpose of the
whole exercise in my mind.
2. Dtype is known.
The case where dtype is provided is more complicated, but this is the
case we really want to support well. Actually though, I think we can
simplify it by judicious punting.
Case 2a. Array is not 1-dimensional. Punt and fallback on the general
code above. We can determine this simply by testing the first element.
If it's not int/float/complex/whatever-other-scalar-values-we-have, fall
back to case 1.
Case 2b: length_ hint is not given. In this case, we build up the array
in a python array, steal the data, possibly realloc and we're done.
Case 2b length_hint is given. Same as above, but preallocate the
appropriate amount of memory. Growing if length_hint lies.
> A potential problem with stealing list memory is that it could waste
> memory if the list has more bytes allocated than it is using (I'm not
> sure if python lists can get this way, but I presume that they resize
> themselves only every so often, like C++ or Java vectors, so most of
> the time they have some allocated but unused bytes). If lists have a
> squeeze method that's guaranteed not to cause any copies, or if this
> can be added with judicious use of realloc, then that problem is
I imagine once you steal the memory, realloc would the thing to try.
However, I don't think it's worth stealing the memory from lists. I do
think it's worth stealing the memory from python arrays however, and I'm
sure that the same issue exists there. We'll have to look at how the
deallocation for an array works. It probably use Py_XDecref, in which
case we can just replace the memory with NULL and we'll be fine.
OK, just had a look at the code for the python array object
(Modules/arraymodule.c). Looks like it'll be a piece of cake. We can
allocate it to the exact size we want if we have length_hint, otherwise
resize only overallocates by 6%. That's not enough to worry about
reallocing. Stealing the data looks like it shouldn't a problem either,
just NULL ob_item as you suggested.
>> Another note of caution: You are going to have to deal with
>> iterators of
>> iterators of iterators of.... I'm not sure if that actually overly
>> matters; I haven't looked at PyArray_New for some time. Enjoy!
> This is a good point. Numpy does fine with nested lists, but what
> should it do with nested generators? I originally thought that
> basically 'array(generator)' should make the exact same thing as
> 'array([f for f in generator])'. However, for nested generators, this
> would be an object array of generators.
> I'm not sure which is better -- having more special cases for
> generators that make generators, or having a simple rubric like above
> for how generators are treated.
> Any thoughts?
> This SF.Net email is sponsored by xPML, a groundbreaking scripting
> that extends applications into web and mobile media. Attend the live
> and join the prime developer group breaking into this new coding
> Numpy-discussion mailing list
> Numpy-discussion at lists.sourceforge.net
More information about the Numpy-discussion