[IPython-User] Getting setup on a remote cluster w/ Sun Grid Engine.

MinRK benjaminrk@gmail....
Tue Nov 15 12:42:42 CST 2011


On Tue, Nov 15, 2011 at 09:57, Ariel Rokem <arokem@gmail.com> wrote:

>
>
> On Mon, Nov 14, 2011 at 7:45 PM, MinRK <benjaminrk@gmail.com> wrote:
>
>>
>>
>> On Mon, Nov 14, 2011 at 19:10, Ariel Rokem <arokem@gmail.com> wrote:
>>
>>> Hi everyone,
>>>
>>> Following up on this thread, I am trying to get this working on the SGE
>>> on our local cluster (thankfully, everyone is away at a conference, so I
>>> have the cluster pretty much to myself. Good week for experimenting...).
>>>
>>> I updated my fork from ipython/master this afternoon and followed the
>>> instructions below. I am getting the following behavior:
>>>
>>> celadon:~  $ipcluster start --n=10 --profile=sge
>>> [IPClusterStart] Using existing profile dir:
>>> u'/home/arokem/.config/ipython/profile_sge'
>>> [IPClusterStart] Starting ipcluster with [daemon=False]
>>> [IPClusterStart] Creating pid file:
>>> /home/arokem/.config/ipython/profile_sge/pid/ipcluster.pid
>>> [IPClusterStart] Starting PBSControllerLauncher: ['qsub',
>>> u'./sge_controller']
>>> [IPClusterStart] adding job array settings to batch script
>>> ERROR:root:Error in periodic callback
>>> Traceback (most recent call last):
>>>   File "/usr/lib64/python2.7/site-packages/zmq/eventloop/ioloop.py",
>>> line 423, in _run
>>>     self.callback()
>>>   File
>>> "/home/arokem/usr/local/lib/python2.7/site-packages/IPython/parallel/apps/ipclusterapp.py",
>>> line 497, in start_controller
>>>     self.controller_launcher.start()
>>>   File
>>> "/home/arokem/usr/local/lib/python2.7/site-packages/IPython/parallel/apps/launcher.py",
>>> line 1022, in start
>>>     return super(SGEControllerLauncher, self).start(1)
>>>   File
>>> "/home/arokem/usr/local/lib/python2.7/site-packages/IPython/parallel/apps/launcher.py",
>>> line 936, in start
>>>     self.write_batch_script(n)
>>>   File
>>> "/home/arokem/usr/local/lib/python2.7/site-packages/IPython/parallel/apps/launcher.py",
>>> line 925, in write_batch_script
>>>     script_as_string = self.formatter.format(self.batch_template,
>>> **self.context)
>>>   File "/usr/lib64/python2.7/string.py", line 545, in format
>>>     return self.vformat(format_string, args, kwargs)
>>>   File "/usr/lib64/python2.7/string.py", line 549, in vformat
>>>     result = self._vformat(format_string, args, kwargs, used_args, 2)
>>>   File
>>> "/home/arokem/usr/local/lib/python2.7/site-packages/IPython/utils/text.py",
>>> line 652, in _vformat
>>>     obj = eval(field_name, kwargs)
>>>   File "<string>", line 1, in <module>
>>> NameError: name 'n' is not defined
>>> [IPClusterStart] Starting 10 engines
>>> [IPClusterStart] Starting 10 engines with SGEEngineSetLauncher: ['qsub',
>>> u'./sge_engines']
>>> [IPClusterStart] adding job array settings to batch script
>>> [IPClusterStart] Writing instantiated batch script: ./sge_engines
>>> [IPClusterStart] Job submitted with job id: '430658'
>>> [IPClusterStart] Process 'qsub' started: '430658'
>>> [IPClusterStart] Engines appear to have started successfully
>>>
>>> It looks like something goes wrong (the NameError), but then the jobs
>>> get submitted and for a brief time, qmon does acknowledge the existence of
>>> a list of jobs with that id, but then it disappears (somehow gets deleted?)
>>> from qmon almost immediately and when I try to initialize a parallel.Client
>>> with the "sge" profile in an ipython session, I get a "TimeoutError: Hub
>>> connection request timed out". I also tried initializing ipcluster with the
>>> default profile and run some computations and I am getting the
>>> approximately 7-fold expected speed-up (on an 8 core machine), so some
>>> things do work. Does anyone have any idea what is going wrong with the SGE?
>>>
>>
>> This is a horrible typo that crept in when I did some reorganization in
>> the launchers.  Should be fixed in master.
>>
>
> Yes - fixed. I don't see that NameError anymore. Thanks!
>
>
>
>> The TimeoutError in the client generally means that the controller isn't
>> running, or at least isn't where connection files claimed it to be.
>>
>>
> OK - I think that the controller really was not there before, but now it
> is being started, but I am still having trouble getting my engines to
> persist on the sge. I see them get created through qmon, as well as the
> ipcontroller, but then the engine jobs are almost immediately deleted from
> the "running jobs". The controller job persists and when I initialize a
> client I don't get a TimeoutError, but rather get a client object with an
> empty ids list. Is that still a problem with the connection files? Are
> those the ones that are under ~.config/ipython/profile_sge/security?
>

It could be, or it could be an issue of the engines giving up too soon, if
the controller isn't ready for them.  What is the output of the engine jobs?

The most likely cases:

1. the engines start before the controller has written the connection
files. They will wait up to `IPEngineApp.wait_for_url_file` for that file
to exist (default 5s), then give up.
2. same as 1., but old files exist, so the connection info will be stale.
 This is normally addressed by setting `IPControllerApp.reuse_files=True`,
but I'm not sure that works when the controller is started by SGE, where it
won't consistently be on the same host. You may want to manually empty the
security dir (IPYTHON_DIR/profile_sge/security) prior to starting the
cluster, to prevent this case.
3. connection info is wrong - the Controller is not listening on the right
interface, or is listening on localhost only (the default).  This is
`HubFactory.ip`
4. regular timeout (controlled by `Engine.timeout`) - the connection info
is correct, but the controller does not respond promptly (This value can
need to be large in cases where `reuse_files=True`, and the
controller/engines start simultaneously or out-of-order).

The easiest solution to all this is usually to increase
IPClusterStart.delay, which is a delay (in seconds) between starting the
Controller and starting the engines when you do `ipcluster start`.  This is
less effective in SGE, where the time between calling `ipcluster start` and
the jobs actually starting on nodes can be hours - so a few seconds of
delay in submitting the batch jobs has no effect.  It may be sufficient if
your queue is clear, and jobs start right away.

Depending on your sysadmin, it may make sense to *not* start the Controller
with SGE, and only entrust SGE with the engines.  This gives you more
control over the order of events.  There is no need *in general* for your
Controller and Engines to use the same launchers.  There is a reason they
are separate config variables.

-MinRK


>
>
>>
>>
>>
>>
>
>>> Thanks,
>>>
>>> Ariel
>>>
>>>
>>>
>>>
>>> On Wed, Aug 24, 2011 at 3:07 PM, MinRK <benjaminrk@gmail.com> wrote:
>>>
>>>> On Wed, Aug 24, 2011 at 15:05, Dharhas Pothina
>>>> <Dharhas.Pothina@twdb.state.tx.us> wrote:
>>>> >
>>>> > I was able to start the engines and they were submitted to the queue
>>>> > properly but I do not have a json file in the corresponding security
>>>> folder.
>>>> > Do I need to do something to generate it.
>>>>
>>>> The JSON file is written by ipcontroller, so it will only show up
>>>> after the controller has started.
>>>>
>>>> >
>>>> > - dharhas
>>>> >
>>>> >>>> MinRK <benjaminrk@gmail.com> 8/24/2011 4:44 PM >>>
>>>> > On a login node on the cluster:
>>>> >
>>>> > # create profile with default parallel config files, called sge
>>>> > [login] $> ipython profile create sge --parallel
>>>> >
>>>> > edit IPYTHON_DIR/profile_sge/ipcontroller_config.py, adding the line:
>>>> >
>>>> > c.HubFactory.ip = '0.0.0.0'
>>>> >
>>>> > to instruct the controller to listen on all interfaces.
>>>> >
>>>> > Edit IPYTHON_DIR/profile_sge/ipcluster_config.py, adding the line:
>>>> >
>>>> > c.IPClusterEngines.engine_launcher_class = 'SGEEngineSetLauncher'
>>>> > c.IPClusterStart.controller_launcher_class = 'SGEControllerLauncher'
>>>> >
>>>> > # optional: specify a queue for all:
>>>> > c.SGELauncher.queue = 'short'
>>>> > To instruct ipcluster to use SGE to launch the engines and the
>>>> controller
>>>> >
>>>> > At this point, you can start 10 engines and a controller with:
>>>> >
>>>> > [login] $> ipcluster start -n 10 --profile=sge
>>>> >
>>>> > Now the only file you will need to connect to the cluster will be in:
>>>> >
>>>> > IPYTHON_DIR/profile_sge/security/ipcontroller_client.json
>>>> >
>>>> > Just move that file around, and you will be able to connect clients.
>>>> > To connect from a laptop, you will probably need to specify a login
>>>> > node as the ssh server when you do:
>>>> >
>>>> > from IPython import parallel
>>>> >
>>>> > rc = parallel.Client('/path/to/ipcontroller_client.json',
>>>> > sshserver='you@login.mycluster.etc')
>>>> >
>>>> > -MinRK
>>>> >
>>>> >
>>>> > On Wed, Aug 24, 2011 at 13:18, Dharhas Pothina
>>>> > <Dharhas.Pothina@twdb.state.tx.us> wrote:
>>>> >> Hi All,
>>>> >>
>>>> >> We have managed to parallelize one of our spatial interpolation
>>>> scripts
>>>> >> very
>>>> >> easily with the new ipython parallel. Thanks for developing such a
>>>> great
>>>> >> tool, it was fairly easy to get working. Now we are trying to set
>>>> things
>>>> >> up
>>>> >> to run on our internal cluster and I'm having difficulties
>>>> understanding
>>>> >> how
>>>> >> to configure things.
>>>> >>
>>>> >> What I would like to do is have ipython running on a local machine
>>>> >> (windows
>>>> >> & linux) connect to the cluster, request some nodes through SGE and
>>>> run
>>>> >> the
>>>> >> computation. I'm not quite getting what goes where from the
>>>> documentation.
>>>> >>
>>>> >> I think I understood the PBS example but I'm still not understanding
>>>> where
>>>> >> I
>>>> >> would put the connection information to log into the cluster. I would
>>>> >> really
>>>> >> appreciate a step by step of what files need to be where and any
>>>> example
>>>> >> config files for an SGE setup.
>>>> >>
>>>> >> thanks,
>>>> >>
>>>> >> - dharhas
>>>> >>
>>>> >>
>>>> >>
>>>> >>
>>>> >>
>>>> >> _______________________________________________
>>>> >> IPython-User mailing list
>>>> >> IPython-User@scipy.org
>>>> >> http://mail.scipy.org/mailman/listinfo/ipython-user
>>>> >>
>>>> >>
>>>> > _______________________________________________
>>>> > IPython-User mailing list
>>>> > IPython-User@scipy.org
>>>> > http://mail.scipy.org/mailman/listinfo/ipython-user
>>>> >
>>>> > _______________________________________________
>>>> > IPython-User mailing list
>>>> > IPython-User@scipy.org
>>>> > http://mail.scipy.org/mailman/listinfo/ipython-user
>>>> >
>>>> >
>>>> _______________________________________________
>>>> IPython-User mailing list
>>>> IPython-User@scipy.org
>>>> http://mail.scipy.org/mailman/listinfo/ipython-user
>>>>
>>>
>>>
>>> _______________________________________________
>>> IPython-User mailing list
>>> IPython-User@scipy.org
>>> http://mail.scipy.org/mailman/listinfo/ipython-user
>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.scipy.org/pipermail/ipython-user/attachments/20111115/66a6e482/attachment.html 


More information about the IPython-User mailing list