[IPython-User] [IPython-user] IPython LSF support

MinRK benjaminrk@gmail....
Thu Sep 8 17:45:36 CDT 2011


You are running into the fact that IPClusterStart is a subclass of
IPClusterEngines.

Setting `IPClusterEngines.engine_launcher_class` will set the launcher for
both `ipcluster engines` and `ipcluster start`, but setting
`IPClusterStart.engine_launcher_class` will only set it for `ipcluster
start`.

-MinRK

On Thu, Sep 8, 2011 at 15:37, eklavyaa <eklavyaa@gmail.com> wrote:

>
> This approach worked quite well, except for one caveat discussed below.
>
>
> To summarize, now the only change I have in ipcluster_config.py from the
> default is
> c.IPClusterStart.engine_launcher_class = 'LSFEngineSetLauncher'
> The ipcontroller_config.py edit remains the same:
> c.HubFactory.ip = '*'
> (I have called this profile lsf2)
>
> So the controller gets started locally on my dedicated node (which shares a
> common filesystem with the LSF nodes) and when the engines start on the LSF
> nodes, they can see the controller. I tested out an example and everything
> seems to be working well.
>
>
>
> However, if I try to do it by starting the engines separately using
> "ipcluster engines" or "ipengine", things don't seem to work. Here's what I
> am doing.
>
> 1. First start controller + engines using
> ipcluster start --n=2 --profile=lsf2
>
> (Note that the behavior below is the same even if I start the controller
> alone using  ipcontroller --profile=lsf2)
>
> This results in the controller starting locally and 2 engines starting on
> the LSF nodes.
>
> 2. Now add more engines using
> ipcluster engines --profile=lsf2 --n=2
>
> --> This results in the engines starting *locally* and not on the LSF
> Here's
> the log
>
> (vpython-272) username@machinename:~/.ipython$ ipcluster engines
> --profile=lsf2 --n=2
> [IPClusterEngines] Using existing profile dir:
> u'/home/unix/username/.ipython/profile_lsf2'
> [IPClusterEngines] IPython cluster: started
> [IPClusterEngines] Starting engines with [daemon=False]
> [IPClusterEngines] Starting 2 engines
> [IPClusterEngines] Process
> '/home/unix/username/work/software/vpython-272/bin/python2.7' started:
> 11997
> [IPClusterEngines] Starting LocalEngineSetLauncher:
> ['/home/unix/username/work/software/vpython-272/bin/python2.7',
>
> u'/home/unix/username/work/software/vpython-272/lib/python2.7/site-packages/IPython/parallel/apps/ipengineapp.py',
> '--log-to-file', '--log-level=20',
> u'--profile-dir=/home/unix/username/.ipython/profile_lsf2']
> [IPClusterEngines] Process
> '/home/unix/username/work/software/vpython-272/bin/python2.7' started:
> 11998
> [IPClusterEngines] Process 'engine set' started: [None, None]
> [IPClusterEngines] [IPEngineApp] Using existing profile dir:
> u'/home/unix/username/.ipython/profile_lsf2'
> [IPClusterEngines] [IPEngineApp] Using existing profile dir:
> u'/home/unix/username/.ipython/profile_lsf2'
>
> 2. Trying adding more engines using --engines=LSFEngineSetLauncher
> --> This submits jobs to LSF but times out, I guess because I have not
> specified the profile. However, adding --profile=lsf2 doesn't help either.
>
> (vpython-272)username@machinename:~$ ipcluster engines
> --engines=LSFEngineSetLauncher --n=2
>
> [IPClusterEngines] Using existing profile dir:
> u'/home/unix/username/.ipython/profile_default'
> [IPClusterEngines] IPython cluster: started
> [IPClusterEngines] Starting engines with [daemon=False]
> [IPClusterEngines] Starting 2 engines
> [IPClusterEngines] Starting 2 engines with LSFEngineSetLauncher: ['bsub',
> u'./lsf_engines']
> [IPClusterEngines] adding job array settings to batch script
> [IPClusterEngines] Writing instantiated batch script: ./lsf_engines
> [IPClusterEngines] Job submitted with job id: '8833039'
> [IPClusterEngines] Process 'bsub' started: '8833039'
>
>
> Please let me know how I can add engines, using an the existing profile
> (such as lsf2 above).
>
> Thanks!
>
> -E
>
>
>
>
>
>
>
> On Thu, Sep 8, 2011 at 13:37, eklavyaa <eklavyaa@gmail.com> wrote:
>
> >
> > Based on MinRK's suggestion, I tried increasing the timeout for engine
> > registration by increasing
> > c.EngineFactory.timeout and this seems to be a temporary fix to the
> > problem.
> >
> > However, the issue here is that there is no guarantee that the controller
> > node will get allocated first (even if that job is submitted first),
> > especially if I request for a large number of nodes. This is true in most
> > case for LSF / SGE / PBS. Further, it is typically the case the
> controller
> > would need to run much longer than engines (since not all engines get
> > allocated right away), and getting long jobs on LSF is difficult.
> >
>
>
> > I have a dedicated node on which I can locally start a controller. This
> > node
> > has a common filesystem with the LSF nodes, on which I intend to start
> the
> > engines. I am guaranteed that the controller process will start instantly
> > since its a dedicated node, so if I set c.EngineFactory.timeout=10, I can
> > be
> > almost certain that the timeout issue will not take place.
> >
> > Does this make sense? If so, please let me know (or please guide me to
> the
> > documentation) as to how I can setup a controller locally, and engines on
> > LSF nodes.
> >
>
> Yes, this makes perfect sense, and is in fact how I run most often.  This
> is
> precisely why the Controller and Engine launchers are separate.  There is
> no
> need for the ControllerLauncher to match that of the engines, and the
> default (Local) launcher is often just fine with any/all of the engine
> launchers.  If you are starting the controller on a node on the same
> cluster
> (or at least same filesystem and accessible on the network), then simply
> leaving the ControllerLauncher as the default local launcher should be the
> only change necessary.
>
> Extra note:
>
> In fact, there is not even any need to start the controller with ipcluster.
>  All `ipcluster start` does is run `ipcontroller` once, and `ipengine` n
> times.  The various launchers simply wrap these one-line calls in extremely
> basic batch-files with knowledge of how to start/stop jobs.  If you want to
> run ipcontroller manually (especially useful for debug output), then you
> can
> skip the controller-launching step of ipcluster with `ipcluster engines`,
> which is identical to `ipcluster start`, only omitting the startup of a
> controller.  This allows you to run LSF engines on a cluster, with a
> controller that is arbitrarily elsewhere on the internet.
>
> -MinRK
>
>
> --
> View this message in context:
> http://old.nabble.com/IPython-LSF-support-tp32375842p32427637.html
> Sent from the IPython - User mailing list archive at Nabble.com.
>
> _______________________________________________
> IPython-User mailing list
> IPython-User@scipy.org
> http://mail.scipy.org/mailman/listinfo/ipython-user
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.scipy.org/pipermail/ipython-user/attachments/20110908/1edc20a5/attachment.html 


More information about the IPython-User mailing list