[IPython-User] Running multiple ipclusters on remote cluster w/ Sun Grid Engine.

MinRK benjaminrk@gmail....
Wed Sep 14 17:34:18 CDT 2011


On Wed, Sep 14, 2011 at 14:32, Dharhas Pothina <
Dharhas.Pothina@twdb.state.tx.us> wrote:

>
>  It may be more a case of differing nomenclature. To me a profile/profile
> name is something you set up once and applies to a class of things. i.e
> within SGE we have a parallel environment (or profile) called mpich and when
> we tell any script to use that particular parallel environment it sets
> things up a certain way. When you actually submit a job to SGE using that
> profile it gets a jobid which is what you can use to track or kill the
> actual job.
>

There is not a 1:1 correspondence of jobid to IPython cluster.  The
controller may or may not be run via SGE, and you can have an arbitrary
number of SGE jobs corresponding to engines.  Only in the case of non-SGE
controller and single group of SGE engines is there 1 jobid per cluster.
 SGE Controller+engines will have two job IDs, and if you add/remove engines
over time, there can be 0-many job IDs associated with the cluster, and the
active job ids are a function of time.  The *only* constant is the
controller (again, which may or may not have a job id at all), and which
will ultimately become resumable, so even its job id / pid cannot be assumed
to be constant.


>
>  The 1-1 correspondence makes sense if you plan to have the ipcluster
> running continuously of a certain number of cluster nodes and keep
> connecting and disconnecting with local ipython clients.
>

In fact, it makes sense for *all* cases that don't include running multiple
simultaneous clusters with identical configuration.


>  To me the use case that makes sense is different. We submit a job to run
> on a certain number of nodes and after the job s completed those the nodes
> are released for other non ipython runs like our fortran hydro models. In
> that case the 'profile' is what tells it how to submit a job to the sge
> queue etc and the job-id or controller-id is what we use to run the job/kill
> the job etc. Maybe --controller-id flag could be an optional parameter.
>

There is a bit of mismatch in design goals in IPython profiles, due to their
evolution.  The entire profile system in IPython was developed for the
purpose of consolidating the information about configuring and connecting to
a single cluster instance (including repeated runs, but never simultaneous).
 This has been expanded and adopted by IPython as a whole, for managing
configurations and runtime files, and has come to mean something slightly
different as a result.  The parallel code has not been changed to consider
these ideas yet.

I think the restriction that a cluster is a singleton per-profile will
remain, unless you specify a new cluster_id *for each additional cluster*.
 The benefits of this assumption are just far too great to not make it by
default.


>
>  Another feature request is some way of knowing when the engines have all
> started up, depending on how busy the cluster SGE queue is the engine may
> not start up immediately. Right now, I'm using a while loop that checks for
> the presence of the json file every 5 seconds. This works but seem
> inelegant.
>

Yes, this would certainly be useful.  Right now, there is no notion of a
queued state for jobs, but it could conceivably be added (pull requests are
welcome!).  I should note that polling for the JSON file only detects when
the *controller* is running, and has nothing to do with engines.  Engines do
not necessarily write any files.

SGE (as with all batch systems) already provides you with queue monitoring
tools - there's no need to poll the filesystem, as you can just use qstat
directly to see when engines have started.


>
>  let me know if this use case makes sense or if I'm missing something in
> the way these features were designed to be used.
>

I think this use case does make sense, but we are just running into the
issue that ipcluster is not meant to solve every problem, and the false
notion that ipcluster is the only (or even primary) way to start IPython. In
fact, I think it is used far more often than it should be.  I think people
wrongly assume that ipcluster is the only way to start an IPython cluster,
when it is in fact only a convenient way to start *simple* clusters.

ipcluster is intended as an extremely basic launcher.  Its purpose is to
handle the simple cases of starting zero-to-one controller and one-to-many
engines in various environments.  *All it does* is start these other
processes with a bit of abstraction regarding what starting/stopping means
with respect to qsub, mpi, etc.  It was never meant to handle every case,
and never will.  Writing your own scripts, that call ipengine/ipcontroller
directly, to submit via qsub will frequently be a better solution than
ipcluster.

It is not at all difficult to replicate the subset of what ipcluster does
for your environment with *much* simpler code that would ultimately be more
useful and controllable for you.

-MinRK


>
>  - dharhas
>
>
> >>> MinRK <benjaminrk@gmail.com> 9/14/2011 2:12 PM >>>
>
>
>   On Wed, Sep 14, 2011 at 11:13, Fernando Perez <fperez.net@gmail.com>
>  wrote:
>
>>  Hi Dharhas,
>>
>> On Wed, Sep 14, 2011 at 6:59 AM, Dharhas Pothina
>>
>> <Dharhas.Pothina@twdb.state.tx.us> wrote:
>>
>> > I ended up writing a script that connected to the cluster and made a
>> copy of
>> > an already created profile with a new unique name, started ipcluster,
>> waited
>> > till the json file was created and then retrieved the json file for use
>> in a
>> > local client, runs my script and then cleans up afterwards.
>> >
>> > This seems to be working fairly well except when the local script exits
>> > because of an error. In that case, I need to log in and stop the
>> engines,
>> > clean up files etc manually.
>>
>>   OK. We probably should remove the assumption of a 1 to 1 mapping
>> between profiles and running clusters, but that will require a fair
>> bit of reorganization of code that uses that assumption, so I'm glad
>> you found a solution for now.
>>
>
>   Yes, it's a pretty big deal that the only thing engines and clients need
> to know to connect to a cluster is the profile name. That is lost entirely
> if we allow multiple clusters with a single profile, since profile name
> becomes ambiguous. We would then need to add a second layer of specification
> for which controller to use within a given profile, e.g.:
>
>
>   ipengine --profile=mysge --controller-id=12345
>
>
>   I think I could add support for exactly this without much code change at
> all, though.
>
>
>   Feature Request opened on GitHub:
> https://github.com/ipython/ipython/issues/794
>
>
>> Cheers,
>>
>> f
>>
>
>
> _______________________________________________
> IPython-User mailing list
> IPython-User@scipy.org
> http://mail.scipy.org/mailman/listinfo/ipython-user
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.scipy.org/pipermail/ipython-user/attachments/20110914/56b28f90/attachment.html 


More information about the IPython-User mailing list