[IPython-dev] Performance sanity check: 7.21s to scatter 5000X1000 float array to 7 engines

Anand Patil anand.prabhakar.patil@gmail....
Sat Jan 12 17:57:25 CST 2008


Hi Fernando,

 What platform are you on?  I've just wasted a few hours chasing the
> same behavior you're seeing, only to find out that openmpi on ubuntu
> gutsy is completely #$@^* broken!!!

Yep, it was Ubuntu Gutsy. Thanks for tracking this down, as
frustrating it is to find that a core package was unreliable I'm glad
to know the problem is not just me. This isn't the first problem I've
had with Ubuntu packages, MPI in particular: as far as I could tell
the LAM package was missing mpirun altogether. I was pretty mad at
Ubuntu for a while... but then I looked up what Ubuntu means out of
curiosity, http://en.wikipedia.org/wiki/Ubuntu_%28philosophy%29 , and
was like 'Way for an operating system to be!'

I ran into a rough spot compiling mpi4py related to 'osc pt2pt' and
managed to figure out by Googling that the subversion head of mpi4py
fixes the problem... but then had to google harder to find the actual
subversion repository in a newsgroup message from 2006. When the
mpi4py page finishes migrating it would be helpful to include a clear
link.

When I did get mpi4py up and running, I ran this script:

from numpy import *
C=ones((100,10),dtype=float)

from ipython1 import *
import ipython1.kernel.api as kernel
rc = kernel.RemoteController(('127.0.0.1',10105))
rc.resetAll()
rc.executeAll('from mpi4py import MPI as mpi')
rc.executeAll('from numpy import *')
rc.push(0,C=C)
rc.execute(0,'mpi.COMM_WORLD.Send(C,1)')

When C was 10 by 10, the last line sent it like a champ, but when C
was 10 by 100 or larger the ipengines hung up altogether, I had to
kill them with KILL (though mysteriously the log says they received
TERM). Most of the log follows this email, I lost the top bit because
I was running them in a screen.

I've found an alternative solution to my parallelization problem based
on OpenMP. It's less nice than an IPython based solution would be
assuming data passing is fast enough, but I can move forward on my own
now if you want to drop this case and move on to other things. If
you'd like to pursue the bug, on the other hand, I'd be happy to keep
iterating.

Cheers,
Anand


Log from ipengines:

mpirun -n 7 ipengine --mpi=mpi4py

...

2008/01/12 16:39 -0700 [-] Log opened.
2008/01/12 16:39 -0700 [-] MPI started with rank = 7 and size = 2
2008/01/12 16:39 -0700 [-] Log opened.
2008/01/12 16:39 -0700 [-] MPI started with rank = 7 and size = 3
2008/01/12 16:39 -0700 [-] Log opened.
2008/01/12 16:39 -0700 [-] MPI started with rank = 7 and size = 4
2008/01/12 16:39 -0700 [-] Log opened.
2008/01/12 16:39 -0700 [-] MPI started with rank = 7 and size = 5
2008/01/12 16:39 -0700 [-] Log opened.
2008/01/12 16:39 -0700 [-] MPI started with rank = 7 and size = 6
2008/01/12 16:39 -0700 [-] Starting factory
<ipython1.kernel.enginepb.PBEngineClientFactory object at 0x10b9610>
2008/01/12 16:39 -0700 [-] Starting factory
<ipython1.kernel.enginepb.PBEngineClientFactory object at 0x10b9610>
2008/01/12 16:39 -0700 [-] Starting factory
<ipython1.kernel.enginepb.PBEngineClientFactory object at 0x10b9610>
2008/01/12 16:39 -0700 [-] Starting factory
<ipython1.kernel.enginepb.PBEngineClientFactory object at 0x10b9610>
2008/01/12 16:39 -0700 [-] Starting factory
<ipython1.kernel.enginepb.PBEngineClientFactory object at 0x10b9610>
2008/01/12 16:39 -0700 [-] Starting factory
<ipython1.kernel.enginepb.PBEngineClientFactory object at 0x10b9610>
2008/01/12 16:39 -0700 [Broker,client] got ID: 7
2008/01/12 16:39 -0700 [Broker,client] got ID: 0
2008/01/12 16:39 -0700 [Broker,client] got ID: 1
2008/01/12 16:39 -0700 [Broker,client] got ID: 2
2008/01/12 16:39 -0700 [Broker,client] got ID: 5
2008/01/12 16:39 -0700 [Broker,client] got ID: 3
2008/01/12 16:39 -0700 [Broker,client] got ID: 4
[hokhmah:21241] [0,0,0]-[0,1,1] mca_oob_tcp_msg_recv: readv failed:
Connection reset by peer (104)
2008/01/12 16:43 -0700 [-] Received SIGTERM, shutting down.
2008/01/12 16:43 -0700 [Broker,client] Stopping factory
<ipython1.kernel.enginepb.PBEngineClientFactory object at 0x10b9610>
2008/01/12 16:43 -0700 [-] Main loop terminated.
2008/01/12 16:43 -0700 [-] Received SIGTERM, shutting down.
2008/01/12 16:43 -0700 [Broker,client] Stopping factory
<ipython1.kernel.enginepb.PBEngineClientFactory object at 0x10b9610>
2008/01/12 16:43 -0700 [-] Main loop terminated.
[hokhmah:21245] *** Process received signal ***
[hokhmah:21245] Signal: Segmentation fault (11)
[hokhmah:21245] Signal code: Address not mapped (1)
[hokhmah:21245] Failing at address: 0x100
[hokhmah:21245] [ 0] /lib/libpthread.so.0 [0x2adace056100]
[hokhmah:21245] [ 1] /usr/local/lib/openmpi/mca_pml_ob1.so [0x2adadcac0202]
[hokhmah:21245] [ 2]
/usr/local/lib/openmpi/mca_btl_sm.so(mca_btl_sm_component_progress+0x797)
[0x2adadd0d7e57]
[hokhmah:21245] [ 3]
/usr/local/lib/openmpi/mca_bml_r2.so(mca_bml_r2_progress+0x2a)
[0x2adadcccc16a]
[hokhmah:21245] [ 4]
/usr/local/lib/libopen-pal.so.0(opal_progress+0x4a) [0x2adad7976c4a]
[hokhmah:21245] [ 5]
/usr/local/lib/openmpi/mca_oob_tcp.so(mca_oob_tcp_msg_wait+0x1a)
[0x2adad94302ca]
[hokhmah:21245] [ 6]
/usr/local/lib/openmpi/mca_oob_tcp.so(mca_oob_tcp_recv+0x36d)
[0x2adad94340cd]
[hokhmah:21245] [ 7]
/usr/local/lib/libopen-rte.so.0(mca_oob_recv_packed+0x33)
[0x2adad773a553]
[hokhmah:21245] [ 8]
/usr/local/lib/openmpi/mca_gpr_proxy.so(orte_gpr_proxy_put+0x21a)
[0x2adad9843cda]
[hokhmah:21245] [ 9]
/usr/local/lib/libopen-rte.so.0(orte_smr_base_set_proc_state+0x2d9)
[0x2adad7750a79]
[hokhmah:21245] [10]
/usr/local/lib/libmpi.so.0(ompi_mpi_finalize+0x13f) [0x2adad74abc1f]
[hokhmah:21245] [11] /usr/lib/python2.5/site-packages/mpi4py/_mpi.so
[0x2adad7267295]
[hokhmah:21245] [12] /usr/bin/python(Py_Finalize+0x135) [0x4ab225]
[hokhmah:21245] [13] /usr/bin/python [0x4aacec]
[hokhmah:21245] [14] /usr/bin/python(PyErr_PrintEx+0x19a) [0x4aaeea]
[hokhmah:21245] [15] /usr/bin/python(PyRun_SimpleFileExFlags+0x107) [0x4ab6f7]
[hokhmah:21245] [16] /usr/bin/python(Py_Main+0x935) [0x414725]
[hokhmah:21245] [17] /lib/libc.so.6(__libc_start_main+0xf4) [0x2adace909b44]
[hokhmah:21245] [18] /usr/bin/python [0x413c69]
[hokhmah:21245] *** End of error message ***
2008/01/12 16:43 -0700 [-] Received SIGTERM, shutting down.
2008/01/12 16:43 -0700 [Broker,client] Stopping factory
<ipython1.kernel.enginepb.PBEngineClientFactory object at 0x10b9610>
2008/01/12 16:43 -0700 [-] Main loop terminated.
2008/01/12 16:43 -0700 [-] Received SIGTERM, shutting down.
2008/01/12 16:43 -0700 [Broker,client] Stopping factory
<ipython1.kernel.enginepb.PBEngineClientFactory object at 0x10b9610>
2008/01/12 16:43 -0700 [-] Main loop terminated.
2008/01/12 16:43 -0700 [-] Received SIGTERM, shutting down.
2008/01/12 16:43 -0700 [Broker,client] Stopping factory
<ipython1.kernel.enginepb.PBEngineClientFactory object at 0x10b9610>
2008/01/12 16:43 -0700 [-] Main loop terminated.
2008/01/12 16:43 -0700 [-] Received SIGTERM, shutting down.
2008/01/12 16:43 -0700 [Broker,client] Stopping factory
<ipython1.kernel.enginepb.PBEngineClientFactory object at 0x10b9610>
2008/01/12 16:43 -0700 [-] Main loop terminated.
7 processes killed (possibly by Open MPI)


More information about the IPython-dev mailing list