[IPython-User] [IPython-user] IPython LSF support

eklavyaa eklavyaa@gmail....
Thu Sep 8 17:37:59 CDT 2011


This approach worked quite well, except for one caveat discussed below.


To summarize, now the only change I have in ipcluster_config.py from the
default is
c.IPClusterStart.engine_launcher_class = 'LSFEngineSetLauncher'
The ipcontroller_config.py edit remains the same:
c.HubFactory.ip = '*'
(I have called this profile lsf2)

So the controller gets started locally on my dedicated node (which shares a
common filesystem with the LSF nodes) and when the engines start on the LSF
nodes, they can see the controller. I tested out an example and everything
seems to be working well.



However, if I try to do it by starting the engines separately using
"ipcluster engines" or "ipengine", things don't seem to work. Here's what I
am doing.

1. First start controller + engines using 
ipcluster start --n=2 --profile=lsf2

(Note that the behavior below is the same even if I start the controller
alone using  ipcontroller --profile=lsf2)

This results in the controller starting locally and 2 engines starting on
the LSF nodes.

2. Now add more engines using
ipcluster engines --profile=lsf2 --n=2

--> This results in the engines starting *locally* and not on the LSF Here's
the log

(vpython-272) username@machinename:~/.ipython$ ipcluster engines
--profile=lsf2 --n=2
[IPClusterEngines] Using existing profile dir:
u'/home/unix/username/.ipython/profile_lsf2'
[IPClusterEngines] IPython cluster: started
[IPClusterEngines] Starting engines with [daemon=False]
[IPClusterEngines] Starting 2 engines
[IPClusterEngines] Process
'/home/unix/username/work/software/vpython-272/bin/python2.7' started: 11997
[IPClusterEngines] Starting LocalEngineSetLauncher:
['/home/unix/username/work/software/vpython-272/bin/python2.7',
u'/home/unix/username/work/software/vpython-272/lib/python2.7/site-packages/IPython/parallel/apps/ipengineapp.py',
'--log-to-file', '--log-level=20',
u'--profile-dir=/home/unix/username/.ipython/profile_lsf2']
[IPClusterEngines] Process
'/home/unix/username/work/software/vpython-272/bin/python2.7' started: 11998
[IPClusterEngines] Process 'engine set' started: [None, None]
[IPClusterEngines] [IPEngineApp] Using existing profile dir:
u'/home/unix/username/.ipython/profile_lsf2'
[IPClusterEngines] [IPEngineApp] Using existing profile dir:
u'/home/unix/username/.ipython/profile_lsf2'

2. Trying adding more engines using --engines=LSFEngineSetLauncher
--> This submits jobs to LSF but times out, I guess because I have not
specified the profile. However, adding --profile=lsf2 doesn't help either.

(vpython-272)username@machinename:~$ ipcluster engines
--engines=LSFEngineSetLauncher --n=2                                
 
[IPClusterEngines] Using existing profile dir:
u'/home/unix/username/.ipython/profile_default'
[IPClusterEngines] IPython cluster: started
[IPClusterEngines] Starting engines with [daemon=False]
[IPClusterEngines] Starting 2 engines
[IPClusterEngines] Starting 2 engines with LSFEngineSetLauncher: ['bsub',
u'./lsf_engines']
[IPClusterEngines] adding job array settings to batch script
[IPClusterEngines] Writing instantiated batch script: ./lsf_engines
[IPClusterEngines] Job submitted with job id: '8833039'
[IPClusterEngines] Process 'bsub' started: '8833039'


Please let me know how I can add engines, using an the existing profile
(such as lsf2 above). 

Thanks!

-E




 


On Thu, Sep 8, 2011 at 13:37, eklavyaa <eklavyaa@gmail.com> wrote:

>
> Based on MinRK's suggestion, I tried increasing the timeout for engine
> registration by increasing
> c.EngineFactory.timeout and this seems to be a temporary fix to the
> problem.
>
> However, the issue here is that there is no guarantee that the controller
> node will get allocated first (even if that job is submitted first),
> especially if I request for a large number of nodes. This is true in most
> case for LSF / SGE / PBS. Further, it is typically the case the controller
> would need to run much longer than engines (since not all engines get
> allocated right away), and getting long jobs on LSF is difficult.
>


> I have a dedicated node on which I can locally start a controller. This
> node
> has a common filesystem with the LSF nodes, on which I intend to start the
> engines. I am guaranteed that the controller process will start instantly
> since its a dedicated node, so if I set c.EngineFactory.timeout=10, I can
> be
> almost certain that the timeout issue will not take place.
>
> Does this make sense? If so, please let me know (or please guide me to the
> documentation) as to how I can setup a controller locally, and engines on
> LSF nodes.
>

Yes, this makes perfect sense, and is in fact how I run most often.  This is
precisely why the Controller and Engine launchers are separate.  There is no
need for the ControllerLauncher to match that of the engines, and the
default (Local) launcher is often just fine with any/all of the engine
launchers.  If you are starting the controller on a node on the same cluster
(or at least same filesystem and accessible on the network), then simply
leaving the ControllerLauncher as the default local launcher should be the
only change necessary.

Extra note:

In fact, there is not even any need to start the controller with ipcluster.
 All `ipcluster start` does is run `ipcontroller` once, and `ipengine` n
times.  The various launchers simply wrap these one-line calls in extremely
basic batch-files with knowledge of how to start/stop jobs.  If you want to
run ipcontroller manually (especially useful for debug output), then you can
skip the controller-launching step of ipcluster with `ipcluster engines`,
which is identical to `ipcluster start`, only omitting the startup of a
controller.  This allows you to run LSF engines on a cluster, with a
controller that is arbitrarily elsewhere on the internet.

-MinRK


-- 
View this message in context: http://old.nabble.com/IPython-LSF-support-tp32375842p32427637.html
Sent from the IPython - User mailing list archive at Nabble.com.



More information about the IPython-User mailing list