[IPython-dev] Using IPython Cluster with SGE -- help needed

Matthieu Brucher matthieu.brucher@gmail....
Mon Aug 5 09:02:47 CDT 2013


Hi,

I don't know why the registration was not complete. Is your home
folder the same on all nodes and on the login node?
You won't see 12 jobs. You asked for 12 engines, and they will all be
submitted in one job and the 12 engines will be started by mpiexec -n
12. This is the standard way of using batch schedulers. Ask for some
cores, run an mpi application on these cores.

You can also try to submit additional engines now that the controller
is up and running. Check that the configuration files are present and
readable.

Cheers,


2013/8/5 Andreas Hilboll <lists@hilboll.de>:
> Am 04.08.2013 16:20, schrieb Matthieu Brucher:
>> Hi,
>>
>> I guess we may want to start with the ipython documentation on this
>> topic: http://ipython.org/ipython-doc/stable/parallel/parallel_process.html
>>
>> Cheers,
>>
>> 2013/8/4 Andreas Hilboll <lists@hilboll.de>:
>>> Hi,
>>>
>>> I would like to use IPython for calculations on our cluster. It's a
>>> total of 11 compute + 1 management nodes (all running Linux), and we're
>>> using SGE's qsub to submit jobs. The $HOME directory is shared via NFS
>>> between all the nodes.
>>>
>>> Even after reading the documentation, I'm unsure about how to get things
>>> running. I assume that I'll have to execute ``ìpcluster -n 16`` on all
>>> compute nodes (they have 16 cores each). I'd have the ipython shell
>>> (notebook won't work due to firewall restrictions I cannot change) on
>>> the management node. But how does the management node know about the
>>> kernels which are running on the compute nodes and waiting for a job?
>>> And how can I tell the management node that it shall use qsub to submit
>>> the jobs to the individual kernels?
>>>
>>> As I think this is a common use case, I'd be willing to write up a nice
>>> tutorial about the setup, but I fear I need some help from you guys to
>>> get things running ...
>>>
>>> Cheers,
>>>
>>> -- Andreas.
>>> _______________________________________________
>>> IPython-dev mailing list
>>> IPython-dev@scipy.org
>>> http://mail.scipy.org/mailman/listinfo/ipython-dev
>>
>>
>>
>
> Okay, thanks to the good docs, I was able to start a cluster:
>
> (test_py27)hilboll@login:~> ipcluster start --profile=nexus_py2.7 -n 12
> 2013-08-05 15:26:04,264.264 [IPClusterStart] Using existing profile dir:
> u'/gpfs/hb/hilboll/.config/ipython/profile_nexus_py2.7'
> 2013-08-05 15:26:04.272 [IPClusterStart] Starting ipcluster with
> [daemon=False]
> 2013-08-05 15:26:04.273 [IPClusterStart] Creating pid file:
> /gpfs/hb/hilboll/.config/ipython/profile_nexus_py2.7/pid/ipcluster.pid
> 2013-08-05 15:26:04.273 [IPClusterStart] Starting Controller with
> SGEControllerLauncher
> 2013-08-05 15:26:04.289 [IPClusterStart] Job submitted with job id: '60'
> 2013-08-05 15:26:05.289 [IPClusterStart] Starting 12 Engines with
> SGEEngineSetLauncher
> 2013-08-05 15:26:05.306 [IPClusterStart] Job submitted with job id: '61'
> 2013-08-05 15:26:35.351 [IPClusterStart] Engines appear to have started
> successfully
>
> However, using qstat, I can only see one job in the queue, which is the
> controller:
>
> hilboll@login:~> qstat
> job-ID  prior   name       user         state submit/start at     queue
>                          slots ja-task-ID
> -----------------------------------------------------------------------------------------------------------------
>      60 0.57500 ipython    hilboll      r     08/05/2013 15:26:06
> all.q@login.cluster                1
>
>
> I used the following job template:
>
> c.SGEEngineSetLauncher.batch_template = '''#!/bin/bash
> #$ -N ipython #- Name optional!
> #$ -q all.q #- Nutze die Queue 'all.q'.
> #$ -S /bin/bash #- erforderlich !
> #$ -V #- Verwendet Pfade wie in aktueller Shell
> #$ -j y #- merge STDOUT and STDERR
> #$ -o log_ipython_{n}.log
>
> source /hb/hilboll/local/anaconda/bin/activate test_py27
> mpiexec -n {n} ipengine --profile-dir={profile_dir}
> '''
>
> If I use a 'blank' ``ipengine --profile-dir={profile_dir}`` instead of
> the mpiexec call, I get exactly two jobs in the queue, one for the
> controller and one for the first engine.
>
> My naive understanding would be that exactly {n} jobs get submitted via
> the SGEEngineSetLauncher. Is my expectation wrong?
>
> In the logfile, I get this here, 12 times:
>
> 2013-08-05 15:26:09.038 [IPEngineApp] Registration timed out after 2.0
> seconds
>
> Any help resolving this issue is greatly appreciated :)
>
> Cheers,
>
> -- Andreas.



-- 
Information System Engineer, Ph.D.
Blog: http://matt.eifelle.com
LinkedIn: http://www.linkedin.com/in/matthieubrucher
Music band: http://liliejay.com/


More information about the IPython-dev mailing list