[IPython-User] [IPython-user] IPython LSF support

Johann Cohen-Tanugi johann.cohentanugi@gmail....
Fri Sep 9 04:16:24 CDT 2011


hi there, sorry for entering late into the discussion, I am the author 
for the little code that was necessary to run LSF. I am very happy to 
see that with suggestions from Min indeed someone else managed to make 
good use of it.
Indeed, in my mind it is quite natural to have the controller on a 
dedicated node, as for now I only experienced issues sending it to the 
batch as well.

I will follow the traffic more closely in case you have other questions, 
but I know much less than Min on the system, and the LSF code added was 
really quite trivial to implement, given the fantastic work on ipython 
upstream.

best,
Johann

On 09/08/2011 11:06 PM, MinRK wrote:
>
>
> On Thu, Sep 8, 2011 at 13:37, eklavyaa <eklavyaa@gmail.com 
> <mailto:eklavyaa@gmail.com>> wrote:
>
>
>     Based on MinRK's suggestion, I tried increasing the timeout for engine
>     registration by increasing
>     c.EngineFactory.timeout and this seems to be a temporary fix to
>     the problem.
>
>     However, the issue here is that there is no guarantee that the
>     controller
>     node will get allocated first (even if that job is submitted first),
>     especially if I request for a large number of nodes. This is true
>     in most
>     case for LSF / SGE / PBS. Further, it is typically the case the
>     controller
>     would need to run much longer than engines (since not all engines get
>     allocated right away), and getting long jobs on LSF is difficult.
>
>
>
>     I am thinking of working around this problem using the following
>     approach.
>     Please let me know whether this makes sense.
>
>     I have a dedicated node on which I can locally start a controller.
>     This node
>     has a common filesystem with the LSF nodes, on which I intend to
>     start the
>     engines. I am guaranteed that the controller process will start
>     instantly
>     since its a dedicated node, so if I set
>     c.EngineFactory.timeout=10, I can be
>     almost certain that the timeout issue will not take place.
>
>     Does this make sense? If so, please let me know (or please guide
>     me to the
>     documentation) as to how I can setup a controller locally, and
>     engines on
>     LSF nodes.
>
>
> Yes, this makes perfect sense, and is in fact how I run most often. 
>  This is precisely why the Controller and Engine launchers are 
> separate.  There is no need for the ControllerLauncher to match that 
> of the engines, and the default (Local) launcher is often just fine 
> with any/all of the engine launchers.  If you are starting the 
> controller on a node on the same cluster (or at least same filesystem 
> and accessible on the network), then simply leaving the 
> ControllerLauncher as the default local launcher should be the only 
> change necessary.
>
> Extra note:
>
> In fact, there is not even any need to start the controller with 
> ipcluster.  All `ipcluster start` does is run `ipcontroller` once, and 
> `ipengine` n times.  The various launchers simply wrap these one-line 
> calls in extremely basic batch-files with knowledge of how to 
> start/stop jobs.  If you want to run ipcontroller manually (especially 
> useful for debug output), then you can skip the controller-launching 
> step of ipcluster with `ipcluster engines`, which is identical to 
> `ipcluster start`, only omitting the startup of a controller.  This 
> allows you to run LSF engines on a cluster, with a controller that is 
> arbitrarily elsewhere on the internet.
>
> -MinRK
>
>
>
>     -E
>
>
>
>     [resending since message was apparently not accepted the first time.]
>
>     It seems to work with the following edits to the configuration
>
>     The config files were created using
>     ipython profile create --parallel --profile=lsf
>
>     ipcluster_config.py edits:
>     c.IPClusterStart.controller_launcher_class = 'LSFControllerLauncher'
>     c.IPClusterStart.engine_launcher_class = 'LSFEngineSetLauncher'
>
>     ipcontroller_config.py edits:
>     c.HubFactory.ip = '*'
>
>
>     I was able to start the controller and engine using
>     ipcluster start --n=2 --profile=lsf
>
>     However, the engines would occasionally time-out. Here are the
>     logs of two
>     engines from the same session
>
>     $ cat ipengine.e.8694278
>     [IPEngineApp] Using existing profile dir:
>     u'/home/unix/username/.ipython/profile_lsf'
>     [IPEngineApp] Loading url_file
>     u'/home/unix/shsingh/.ipython/profile_lsf/security/ipcontroller-engine.json'
>     [IPEngineApp] Registering with controller at tcp://X.X.X.X:36875
>     [IPEngineApp] Completed registration with id 0
>     [IPEngineApp] Engine Interrupted, shutting down...
>
>
>     $ cat ipengine.e.8694300
>     [IPEngineApp] Using existing profile dir:
>     u'/home/unix/username/.ipython/profile_lsf'
>     [IPEngineApp] Loading url_file
>     u'/home/unix/shsingh/.ipython/profile_lsf/security/ipcontroller-engine.json'
>     [IPEngineApp] Registering with controller at tcp://X.X.X.X:36875
>     [IPEngineApp] Registration timed out after 2.0 seconds
>
>     The first one was able to register, while the second one was not
>     and would
>     thus time-out.
>
>     My guess is that it is something to do with the fact that the
>     engines might
>     be getting allocated to a node before the controller, but can't
>     verify this.
>
>     Note that this time-out issue happens most, but not all, the time.
>
>     Any ideas on what might be going wrong?
>
>     -E
>
>
>
>     Yes, 0.11 has LSF support.  Just use the
>     LSFEngineSetLauncher/LSFControllerLauncher,
>     the same as you would with PBS or SGE.
>
>     It is basically untested, as the author of the LSF launchers is
>     the only one
>     to have tested it, to our knowledge, so please let us know about
>     shortcomings, etc..  I should scan through the docs to make sure
>     they aren't
>     out of sync.
>
>     -MinRK
>
>
>
>
>     --
>     View this message in context:
>     http://old.nabble.com/IPython-LSF-support-tp32375842p32426952.html
>     Sent from the IPython - User mailing list archive at Nabble.com.
>
>     _______________________________________________
>     IPython-User mailing list
>     IPython-User@scipy.org <mailto:IPython-User@scipy.org>
>     http://mail.scipy.org/mailman/listinfo/ipython-user
>
>
>
> -- 
> This message has been scanned for viruses and
> dangerous content by *MailScanner* <http://www.mailscanner.info/>, and is
> believed to be clean.
>
>
> _______________________________________________
> IPython-User mailing list
> IPython-User@scipy.org
> http://mail.scipy.org/mailman/listinfo/ipython-user
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.scipy.org/pipermail/ipython-user/attachments/20110909/7e91cb13/attachment.html 


More information about the IPython-User mailing list