[IPython-User] Parallel Engines Failing To Register With MPIEngineSetLauncher
Mon Nov 19 21:47:45 CST 2012
For the record, this is indirectly related to an issue I logged on the IPython GitHub issue tracker (https://github.com/ipython/ipython/issues/2589) regarding the creation of large numbers of cluster engines through MPI, using MPIEngineSetLauncher. Although the original issue has been solved (thanks, Min!), the problems I mentioned in my final reply remain. At this stage, I'm not sure whether they constitute a bug or just user error - hence, I'm bringing it to the mailing list. :)
Some technical background: I'm trying to run a large number of distinct tasks (environmental models) in parallel over a portion of a compute cluster. I'm operating with IPython 13.1 (on top of Python 2.7.2) in a SuSE Linux environment, and creating engines using mpiexec from OpenMPI 1.4.3. The compute cluster itself is composed of about 150 nodes, each of which contains 12 cores; all communications go over an Infiniband interconnect. Access to cores/nodes is handled through a Torque batch system, which wraps user code up in cpusets. All nodes see a shared GPFS filesystem. I can give more tech details if necessary, but I don't know if the little details and precise specs are all that relevant.
I'm starting the controller with ipcontroller, and once it's up and running, spooling up the engines with ipcluster engines -n <number of available cores>. There are no obvious connectivity problems - generally speaking, the engines that mpiexec launches on nodes remote to the cluster can talk to the controller. However, it is often the case that not all of the engines manage to register successfully on startup, and end up hanging around as orphaned MPI processes. This is only apparent in the logs as entries of the type "[IPControllerApp] registration::purging stalled registration: 75" and similar. This occurs transiently and relatively unpredictably - I once managed to spool up 600 engines without a problem, and I've asked for 36 engines only to have 10 of them fail to register - but it occurs with greater frequency as the number of engines increases. It seems likely to me that this is related to the flood of registration requests that happen when all the MPI processes come online nigh-simultaneously, but maybe I'm completely misunderstanding the situation.
Apart from the obvious annoyance of not making effective use of allocated resources, this is an issue for my code. Each model is a distinct task that will take anywhere between a few tens of seconds and a few tens of minutes to run, so the problem should be quite amenable to the kind of task-based parallelism that IPython's LoadBalancedView offers (although there's no reason I couldn't use a DirectView instead, for what it's worth). However, model computation happens in phases, and at the start of each phase, a set of base data (which will likely range up to the hundreds of megabytes) needs to be synchronized across the remote namespaces of the engines. Handing this data out through the hub using, e.g., a push to a DirectView is a potentially serious bottleneck, particularly for large numbers of engines (indeed, it regularly crashes engines and/or the hub when I'm running more than 100 engines); a much better solution is to push the data to a single engine, and then use MPI (through mpi4py) to broadcast the data from there across the engines. This works well, is stable, and is significantly (i.e. orders of magnitude) faster than individual pushes. But if any one of the MPI processes has failed to register with the hub, they cannot be included in the broadcast command, which causes all the engines to deadlock unrecoverably. More broadly, the inability to rely on full registration of the MPI universe spells danger for the use of any MPI collective operation.
So: does anyone have any suggestions for what I could be doing wrong, or a way in which I might fix this? (If it is a process flood, perhaps it would be useful to introduce some random jitter into startup times?) Happy to provide more information if necessary.
Cheers in advance,
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the IPython-User