[IPython-User] [IPython-user] IPython LSF support

eklavyaa eklavyaa@gmail....
Thu Sep 8 15:37:17 CDT 2011


Based on MinRK's suggestion, I tried increasing the timeout for engine
registration by increasing 
c.EngineFactory.timeout and this seems to be a temporary fix to the problem.

However, the issue here is that there is no guarantee that the controller
node will get allocated first (even if that job is submitted first),
especially if I request for a large number of nodes. This is true in most
case for LSF / SGE / PBS. Further, it is typically the case the controller
would need to run much longer than engines (since not all engines get
allocated right away), and getting long jobs on LSF is difficult.


I am thinking of working around this problem using the following approach.
Please let me know whether this makes sense.

I have a dedicated node on which I can locally start a controller. This node
has a common filesystem with the LSF nodes, on which I intend to start the
engines. I am guaranteed that the controller process will start instantly
since its a dedicated node, so if I set c.EngineFactory.timeout=10, I can be
almost certain that the timeout issue will not take place. 

Does this make sense? If so, please let me know (or please guide me to the
documentation) as to how I can setup a controller locally, and engines on
LSF nodes.


-E



[resending since message was apparently not accepted the first time.]

It seems to work with the following edits to the configuration

The config files were created using
ipython profile create --parallel --profile=lsf

ipcluster_config.py edits:
c.IPClusterStart.controller_launcher_class = 'LSFControllerLauncher' 
c.IPClusterStart.engine_launcher_class = 'LSFEngineSetLauncher' 

ipcontroller_config.py edits:
c.HubFactory.ip = '*'


I was able to start the controller and engine using
ipcluster start --n=2 --profile=lsf

However, the engines would occasionally time-out. Here are the logs of two
engines from the same session

$ cat ipengine.e.8694278
[IPEngineApp] Using existing profile dir:
u'/home/unix/username/.ipython/profile_lsf'
[IPEngineApp] Loading url_file
u'/home/unix/shsingh/.ipython/profile_lsf/security/ipcontroller-engine.json'
[IPEngineApp] Registering with controller at tcp://X.X.X.X:36875
[IPEngineApp] Completed registration with id 0
[IPEngineApp] Engine Interrupted, shutting down...


$ cat ipengine.e.8694300
[IPEngineApp] Using existing profile dir:
u'/home/unix/username/.ipython/profile_lsf'
[IPEngineApp] Loading url_file
u'/home/unix/shsingh/.ipython/profile_lsf/security/ipcontroller-engine.json'
[IPEngineApp] Registering with controller at tcp://X.X.X.X:36875
[IPEngineApp] Registration timed out after 2.0 seconds

The first one was able to register, while the second one was not and would
thus time-out.

My guess is that it is something to do with the fact that the engines might
be getting allocated to a node before the controller, but can't verify this.

Note that this time-out issue happens most, but not all, the time. 

Any ideas on what might be going wrong?

-E



Yes, 0.11 has LSF support.  Just use the
LSFEngineSetLauncher/LSFControllerLauncher,
the same as you would with PBS or SGE.

It is basically untested, as the author of the LSF launchers is the only one
to have tested it, to our knowledge, so please let us know about
shortcomings, etc..  I should scan through the docs to make sure they aren't
out of sync.

-MinRK




-- 
View this message in context: http://old.nabble.com/IPython-LSF-support-tp32375842p32426952.html
Sent from the IPython - User mailing list archive at Nabble.com.



More information about the IPython-User mailing list