Strange and hard to reproduce crash

Travis Oliphant oliphant.travis at ieee.org
Mon Oct 23 18:05:53 CDT 2006


Fernando Perez wrote:
> On 10/23/06, Travis Oliphant <oliphant.travis at ieee.org> wrote:
>   
>> Fernando Perez wrote:
>>     
>>> Hi all,
>>>
>>> two colleagues have been seeing occasional crashes from very
>>> long-running code which uses numpy.  We've now gotten a backtrace from
>>> one such crash, unfortunately it uses a build from a few days ago:
>>>
>>>       
>> This looks like a reference-count problem on the data-type objects
>> (probably one of the builtin ones is trying to be released).  The
>> reference count problem is probably hard to track down.
>>
>> A quick fix is to not allow the built-ins to be "freed" (the attempt
>> should never be made, but if it is, then we should just incref the
>> reference count and continue rather than die).
>>
>> Ideally, the reference count problem should be found, but other-wise
>> I'll just insert some print statements if the attempt is made, but not
>> actually do it as a safety measure.
>>     
>
> If you point me to the right place in the sources, I'll be happy to
> add something to my local copy, rebuild numpy and rerun with these
> print statements in place.
>   

I've placed them in SVN (r3384):

arraydescr_dealloc needs to do something like.

if (self->fields == Py_None) {
    print something
    incref(self)
    return;
}

Most likely there is a missing Py_INCREF() before some call that uses 
the data-type object (and consumes it's reference count) --- do you have 
any Pyrex code (it's harder to get it right with Pyrex).
> I realize this is probably a very difficult problem to track down, but
> it really sucks to run a code for 4 days only to have it explode at
> the end.  Right now this is starting to be a serious problem for us as
> we move our codes into large production runs, so I'm willing to put in
> the necessary effort to track it down, though I'll need some guidance
> from our gurus.
>   

Tracking the reference count of the built-in data-type objects should 
not be too difficult.  First, figure out which one is causing problems 
(if you still have the gdb traceback, then go up to the 
arraydescr_dealloc function and look at self->type_num and self->type).

Then, put print statements throughout your code for the reference count 
of this data-type object.

Something like,

sys.getrefcount(numpy.dtype('float'))

would be enough at a looping point in your code.

Good luck,

-Travis



-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642




More information about the Numpy-discussion mailing list