Strange and hard to reproduce crash

Fernando Perez fperez.net at gmail.com
Mon Oct 23 16:39:42 CDT 2006


Hi all,

two colleagues have been seeing occasional crashes from very
long-running code which uses numpy.  We've now gotten a backtrace from
one such crash, unfortunately it uses a build from a few days ago:

In [3]: numpy.__version__
Out[3]: '1.0b5.dev3097'

In [4]: scipy.__version__
Out[4]: '0.5.0.2180'

Because it takes so long to get the code to crash (several days of
100%CPU usage), I can't make a new one right now, but I'll be happy to
restart the same run with a current SVN build if necessary, and post
the results in a few days.

In the meantime, here's a gdb backtrace we were able to get by setting
MALLOC_CHECK_ to 2 and running the python process from within gdb:

Program received signal SIGABRT, Aborted.
[Switching to Thread 1073880896 (LWP 26280)]
0x40000402 in __kernel_vsyscall ()
(gdb) bt
#0  0x40000402 in __kernel_vsyscall ()
#1  0x0042c7d5 in raise () from /lib/tls/libc.so.6
#2  0x0042e149 in abort () from /lib/tls/libc.so.6
#3  0x0046b665 in free_check () from /lib/tls/libc.so.6
#4  0x00466e65 in free () from /lib/tls/libc.so.6
#5  0x005a4ab7 in PyObject_Free () from /usr/lib/libpython2.3.so.1.0
#6  0x403f6336 in arraydescr_dealloc (self=0x40424020) at arrayobject.c:10455
#7  0x403fab3e in PyArray_FromArray (arr=0xe081cb0, newtype=0x40424020, flags=0)
    at arrayobject.c:7725
#8  0x403facc3 in PyArray_FromAny (op=0xe081cb0, newtype=0x0, min_depth=0,
    max_depth=0, flags=0, context=0x0) at arrayobject.c:8178
#9  0x4043bc45 in PyUFunc_GenericFunction (self=0x943a660, args=0xa9dbf2c,
    mps=0xbfc83730) at ufuncobject.c:906
#10 0x40440a04 in ufunc_generic_call (self=0x943a660, args=0xa9dbf2c)
    at ufuncobject.c:2742
#11 0x0057d607 in PyObject_Call () from /usr/lib/libpython2.3.so.1.0
#12 0x0057d6d4 in PyObject_CallFunction () from /usr/lib/libpython2.3.so.1.0
#13 0x403eabb6 in PyArray_GenericBinaryFunction (m1=Variable "m1" is
not available.
) at arrayobject.c:3296
#14 0x0057b7e1 in PyNumber_Check () from /usr/lib/libpython2.3.so.1.0
#15 0x0057c1e0 in PyNumber_Multiply () from /usr/lib/libpython2.3.so.1.0
#16 0x005d16a3 in _PyEval_SliceIndex () from /usr/lib/libpython2.3.so.1.0
#17 0x005d509e in PyEval_EvalCodeEx () from /usr/lib/libpython2.3.so.1.0
#18 0x005d3d8f in _PyEval_SliceIndex () from /usr/lib/libpython2.3.so.1.0
#19 0x005d509e in PyEval_EvalCodeEx () from /usr/lib/libpython2.3.so.1.0
#20 0x00590e2e in PyFunction_SetClosure () from /usr/lib/libpython2.3.so.1.0
#21 0x0057d607 in PyObject_Call () from /usr/lib/libpython2.3.so.1.0
#22 0x00584d98 in PyMethod_New () from /usr/lib/libpython2.3.so.1.0
#23 0x0057d607 in PyObject_Call () from /usr/lib/libpython2.3.so.1.0
#24 0x005b584c in _PyObject_SlotCompare () from /usr/lib/libpython2.3.so.1.0
#25 0x005aec2c in PyType_IsSubtype () from /usr/lib/libpython2.3.so.1.0
#26 0x0057d607 in PyObject_Call () from /usr/lib/libpython2.3.so.1.0
#27 0x005d2b7f in _PyEval_SliceIndex () from /usr/lib/libpython2.3.so.1.0
#28 0x005d509e in PyEval_EvalCodeEx () from /usr/lib/libpython2.3.so.1.0
#29 0x005d3d8f in _PyEval_SliceIndex () from /usr/lib/libpython2.3.so.1.0
#30 0x005d509e in PyEval_EvalCodeEx () from /usr/lib/libpython2.3.so.1.0
#31 0x005d3d8f in _PyEval_SliceIndex () from /usr/lib/libpython2.3.so.1.0
#32 0x005d497b in _PyEval_SliceIndex () from /usr/lib/libpython2.3.so.1.0
#33 0x005d497b in _PyEval_SliceIndex () from /usr/lib/libpython2.3.so.1.0
#34 0x005d497b in _PyEval_SliceIndex () from /usr/lib/libpython2.3.so.1.0
#35 0x005d509e in PyEval_EvalCodeEx () from /usr/lib/libpython2.3.so.1.0
#36 0x005d5362 in PyEval_EvalCode () from /usr/lib/libpython2.3.so.1.0
#37 0x005ee817 in PyErr_Display () from /usr/lib/libpython2.3.so.1.0
#38 0x005ef942 in PyRun_SimpleFileExFlags () from /usr/lib/libpython2.3.so.1.0
#39 0x005f0994 in PyRun_AnyFileExFlags () from /usr/lib/libpython2.3.so.1.0
#40 0x005f568e in Py_Main () from /usr/lib/libpython2.3.so.1.0
#41 0x080485b2 in main ()

# End of BT.

This code is running on a Fedora Core 3 box, with python 2.3.4 and
numpy/scipy compiled using gcc 3.4.4.

I realize that it's extremely difficult to help with so little
information, but unfortunately we have no small test that can
reproduce the problem.  Only our large research codes, when running
for multiple days on a single run, cause this.  Even very intensive
uses of the same code but which last only a few hours never show this.

This code is a long-runing iterative algorithm, so it's basically
applying the same (complex) loop over and over until convergence,
using numpy and scipy pretty extensively throughout.

If super Travis (or anyone else) can have a Eureka moment from the
above backtrace, that would be fantastic.  If there's any other
information you think I may be able to provide, I'll be happy to do my
best.

Cheers,

f

-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642




More information about the Numpy-discussion mailing list