[Numpy-discussion] Ticket review: #848, leak in PyArray_DescrFromType
Michael Abbott
michael@araneidae.co...
Fri Jul 18 06:15:05 CDT 2008
I'm afraid this is going to be a long one, and I don't see any good way to
cut down the quoted text either...
Charles, I'm going to plea with you to read what I've just written and
think about it. I'm trying to make the case as clear as I can. I think
the case is actually extremely simple: the existing @name@_arrtype_name
code is broken.
On Thu, 17 Jul 2008, Charles R Harris wrote:
> On Tue, Jul 15, 2008 at 9:28 AM, Michael Abbott <michael@araneidae.co.uk>
> wrote:
> > > --- numpy/core/src/scalartypes.inc.src (revision 5411)
> > > +++ numpy/core/src/scalartypes.inc.src (working copy)
> > > @@ -1925,19 +1925,30 @@
> > > goto finish;
> > > }
> > >
> > > + Py_XINCREF(typecode);
> > > arr = PyArray_FromAny(obj, typecode, 0, 0, FORCECAST, NULL);
> > > - if ((arr==NULL) || (PyArray_NDIM(arr) > 0)) return arr;
> > > + if ((arr==NULL) || (PyArray_NDIM(arr) > 0)) {
> > > + Py_XDECREF(typecode);
> > > + return arr;
> > > + }
> > > robj = PyArray_Return((PyArrayObject *)arr);
> > >
> > > finish:
> > > - if ((robj==NULL) || (robj->ob_type == type)) return robj;
> > > + if ((robj==NULL) || (robj->ob_type == type)) {
> > > + Py_XDECREF(typecode);
> > > + return robj;
> > > + }
> > > /* Need to allocate new type and copy data-area over */
> > > if (type->tp_itemsize) {
> > > itemsize = PyString_GET_SIZE(robj);
> > > }
> > > else itemsize = 0;
> > > obj = type->tp_alloc(type, itemsize);
> > > - if (obj == NULL) {Py_DECREF(robj); return NULL;}
> > > + if (obj == NULL) {
> > > + Py_XDECREF(typecode);
> > > + Py_DECREF(robj);
> > > + return NULL;
> > > + }
> > > if (typecode==NULL)
> > > typecode = PyArray_DescrFromType(PyArray_@TYPE@);
> > > dest = scalar_value(obj, typecode);
> > > src = scalar_value(robj, typecode);
> > > Py_DECREF(typecode);
> >
> > Ahah. That DECREF balances the original PyArray_DescrFromType, or maybe
> > the later call ... and of course this has to happen on *ALL* return paths.
> > If we now take a closer look at the patch we can see that it's doing two
> > separate things:
> >
> > 1. There's an extra Py_XINCREF to balance the ref count lost to
> > PyArray_FromAny and ensure that typecode survives long enough;
> >
> > 2. Every early return path has an extra Py_XDECREF to balance the creation
> > of typecode.
> >
> > I rest my case for this patch.
>
> I still haven't convinced myself of this.
Seriously?
Please bear with me for a bit then.
Let me try and go back to basics. Reference counting discipline should be
very simple (I did post up some links on my Wed, 9 Jul 2008 08:36:27
posting on this list). Let me try and capture this in the following
principles:
1. Each routine is wholly responsible for the reference counts it creates.
This responsibility can be fulfilled by:
1.a Decrementing the reference count.
1.b Returning the object to the caller (which accounts for exactly one
reference count).
1.c Explicitly placing the object in another reference counted structure
(similarly this accounts for another reference count, and of course
creates some extra obligations which I won't discuss)
1.d Pass the object to a routine which indulges in "reference count
theft". In some sense this can be thought of as an example of 1.c,
and this is how PyList_SetItem and PyTuple_SetItem behave.
2. Reference counts are created by:
2.a Calling a routine which returns an object.
2.b Explicitly incrementing the reference count.
3. When the known reference count on an object reaches zero the object
must be treated as lost (and any further reference to it will be
undefined). In particular, a routine cannot make any assumptions about
reference counts it has not created.
I need to make a bit of a digression on the subject of reference count
theft. I am only aware of three routines that do this:
PyList_SetItem, PyTuple_SetItem and PyArray_FromAny
The first two are effectively assigning the reference into a preexisting
structure, so can be regarded as instances of principle 1.c. The last is
a disaster area. I know (by inspecting the code) that PyArray_FromAny may
choose to erase its typecode argument (by decrementing the reference), but
I can't tell (without digging deeper than I care to go) whether this only
occurs on the NULL return branch.
However, this is a bit of a red herring. If we recognise that
PyArray_FromAny steals the reference count, the remaining analysis will go
through.
A further note on 1.c: when an object is placed in another reference
counted structure its reference count should normally be incremented --
case 1.c is really just an elision of this increment with the decrement
under obligation 1.a. This normal case can be seen half way through
PyArray_Scalar.
Ok, preliminaries over: now to look at the code. Here is a caricature of
the routine @name@_arrtype_new with the important features isolated:
1 PyArray_Descr *typecode = NULL;
2 if (...) goto finish; // _WORK@work@
3 typecode = PyArray_DescrFromType(...);
4 if (...) { ...typecode...; goto finish; }
5 arr = PyArray_FromAny(..., typecode, ...);
6 if (...) return arr;
7 finish:
8 if (...) return robj;
9 if (...) return NULL;
10 if (typecode == NULL) typecode = PyArray_DescrFromType(...);
11 ... scalar_value(..., typecode);...
12 Py_DECREF(typecode);
So let's go through this line by line.
1. We start with NULL typecode. Fine.
2. A hidden goto in a macro. Sweet. Let's have more code like this.
3. Here's the reference count we're responsible for.
4. If obj is NULL we use the typecode
5. otherwise we pass it to PyArray_FromAny.
6. The first early return
7. All paths (apart from 6) come together here.
So at this point let's take stock. typecode is in one of three states:
NULL (path 2, or if creation failed), allocated with a single reference
count (path 4), or lost (path 5). This is not good.
LET ME EMPHASISE THIS: the state of the code at the finish label is
dangerous and simply broken.
The original state at the finish label is indeterminate: typecode has
either been lost by passing it to PyArray_FromAny (in which case we're not
allowed to touch it again), or else it has reference count that we're
still responsible for.
There seems to be a fantasy expressed in a comment in a recent update to
this routine that PyArray_Scalar has stolen a reference, but fortunately a
quick code inspection (of arrayobject.c) quickly refutes this frightening
possibility.
So, the only way to fix the problem at (7) is to unify the two non-NULL
cases. One answer is to add a DECREF at (4), but we see at (11) that we
still need typecode at (7) -- so the only solution is to add an extra
ADDREF just before (5). This then of course sadly means that we also need
an extra DECREF at (6).
PLEASE don't suggest moving the ADDREF until after (6) -- at this point
typecode is lost and may have been destroyed, and relying on any
possibility to the contrary is a recipe for continued screw ups.
The rest is easy. Once we've established the invariant that typecode is
either NULL or has a single reference count at (7) then the two early
returns at (8) and (9) unfortunately need to be augmented with DECREFs.
And we're done.
Responses to your original comments:
> By the time we hit finish, robj is NULL or holds a reference to typecode
> and the NULL case is taken care of up front.
robj has nothing to do with the lifetime management of typecode, the only
issue is the early return. After the finish label typecode is either NULL
(no problem) or else has a single reference count that needs to be
accounted for.
> Later on, the reference to typecode might be decremented,
That *might* is at the heart of the problem. You can't be so cavalier
about managing references.
> perhaps leaving robj crippled, but in that case robj itself is marked
> for deletion upon exit.
Please ignore robj in ths discussion, it's beside the point.
> If the garbage collector can handle zero reference counts I think
> we are alright.
No, no, no. This is nothing to do with the garbage collector. If we
screw up our reference counts here then the garbage collector isn't going
to dig us out of the hole.
> I admit I haven't quite followed all the subroutines and
> macros, which descend into the hazy depths without the slightest bit of
> documentation, but at this point I'm inclined to leave things alone unless
> you have a test that shows a leak from this source.
Part of my point is that proper reference count discipline should not
require any descent into subroutines (except for the very nasty case of
reference theft, which I think is generally agreed to be a bad thing).
As for the test case, try this one (you'll need a debug build):
import numpy
import sys
refs = 0
r = range(100)
refs = sys.gettotalrefcount()
for i in r: float32()
print sys.gettotalrefcount() - refs
You should get one leak per float32() call.
I've just noticed, looking at the latest revision of the code, that
somebody seems under the misapprehension that PyArray_Scalar steals
reference counts. Fortunately this is not true. Maybe this is part of
the confusion?
More information about the Numpy-discussion
mailing list