[IPython-dev] Using IPython Cluster with SGE -- help needed

Andreas Hilboll lists@hilboll...
Mon Aug 5 08:45:08 CDT 2013

Am 04.08.2013 16:20, schrieb Matthieu Brucher:
> Hi,
> I guess we may want to start with the ipython documentation on this
> topic: http://ipython.org/ipython-doc/stable/parallel/parallel_process.html
> Cheers,
> 2013/8/4 Andreas Hilboll <lists@hilboll.de>:
>> Hi,
>> I would like to use IPython for calculations on our cluster. It's a
>> total of 11 compute + 1 management nodes (all running Linux), and we're
>> using SGE's qsub to submit jobs. The $HOME directory is shared via NFS
>> between all the nodes.
>> Even after reading the documentation, I'm unsure about how to get things
>> running. I assume that I'll have to execute ``ìpcluster -n 16`` on all
>> compute nodes (they have 16 cores each). I'd have the ipython shell
>> (notebook won't work due to firewall restrictions I cannot change) on
>> the management node. But how does the management node know about the
>> kernels which are running on the compute nodes and waiting for a job?
>> And how can I tell the management node that it shall use qsub to submit
>> the jobs to the individual kernels?
>> As I think this is a common use case, I'd be willing to write up a nice
>> tutorial about the setup, but I fear I need some help from you guys to
>> get things running ...
>> Cheers,
>> -- Andreas.
>> _______________________________________________
>> IPython-dev mailing list
>> IPython-dev@scipy.org
>> http://mail.scipy.org/mailman/listinfo/ipython-dev

Okay, thanks to the good docs, I was able to start a cluster:

(test_py27)hilboll@login:~> ipcluster start --profile=nexus_py2.7 -n 12
2013-08-05 15:26:04,264.264 [IPClusterStart] Using existing profile dir:
2013-08-05 15:26:04.272 [IPClusterStart] Starting ipcluster with
2013-08-05 15:26:04.273 [IPClusterStart] Creating pid file:
2013-08-05 15:26:04.273 [IPClusterStart] Starting Controller with
2013-08-05 15:26:04.289 [IPClusterStart] Job submitted with job id: '60'
2013-08-05 15:26:05.289 [IPClusterStart] Starting 12 Engines with
2013-08-05 15:26:05.306 [IPClusterStart] Job submitted with job id: '61'
2013-08-05 15:26:35.351 [IPClusterStart] Engines appear to have started

However, using qstat, I can only see one job in the queue, which is the

hilboll@login:~> qstat
job-ID  prior   name       user         state submit/start at     queue
                         slots ja-task-ID
     60 0.57500 ipython    hilboll      r     08/05/2013 15:26:06
all.q@login.cluster                1

I used the following job template:

c.SGEEngineSetLauncher.batch_template = '''#!/bin/bash
#$ -N ipython #- Name optional!
#$ -q all.q #- Nutze die Queue 'all.q'.
#$ -S /bin/bash #- erforderlich !
#$ -V #- Verwendet Pfade wie in aktueller Shell
#$ -j y #- merge STDOUT and STDERR
#$ -o log_ipython_{n}.log

source /hb/hilboll/local/anaconda/bin/activate test_py27
mpiexec -n {n} ipengine --profile-dir={profile_dir}

If I use a 'blank' ``ipengine --profile-dir={profile_dir}`` instead of
the mpiexec call, I get exactly two jobs in the queue, one for the
controller and one for the first engine.

My naive understanding would be that exactly {n} jobs get submitted via
the SGEEngineSetLauncher. Is my expectation wrong?

In the logfile, I get this here, 12 times:

2013-08-05 15:26:09.038 [IPEngineApp] Registration timed out after 2.0

Any help resolving this issue is greatly appreciated :)


-- Andreas.

More information about the IPython-dev mailing list