[IPython-User] Limitations (?) of ipython SGE support

Michael Hanke michael.hanke@gmail....
Wed Jan 19 16:45:04 CST 2011


Hi,

we also started playing with ipcluster for python-based computing and
found it difficult to integrate into our Condor pool. Most problems seem
to come from the fact that acquisition of resources is detached from the
actual usage or demand. Users tend to allocate more resources than they
need -- "to be sure". While an idle ipengine doesn't eat much CPU time
that becomes a problem when users reuse them for multiple "session".
memory consumption accumulates and constantly occupies resources that
could be better used for other jobs -- total utilization of the cluster
goes down.

If I force users to only use a small amount of engines, they have to
wait longer for their results, maybe despite the cluster not being at
100% load at a particular time. ipengine jobs also have to be protected
to not get killed by another incoming job from a user with higher
priority, which somewhat invalidates the fairshare idea. ipcluster users
tend to have low priority, because they have constantly running
processes.

If ipcontroller could spawn new engines on demand and kill stale ones if not
used for a while (or when the original client detached), that would
help.

(please also see my comment below)

On Mon, Jan 17, 2011 at 09:33:47PM -0800, Brian Granger wrote:
> > Thanks for your reply. It seems that interactive use is your main goal.
> > However, wouldn't out be better to submit every task separately using qsub
> > (instead of submitting ipengines)?
> 
> There are a couple of reasons for doing this (just one long task):
> 
> * Our scheduler has much lower lantency and overhead than that of SGE.

but it also lacks proper resource allocation. I cannot say that my
ipengine job will now spawn 8 processes that will use all of a node and
conflict with other running jobs, because initially it looked like a
single CPU process. It also doesn't negotiate resources with other
concurrent requests, i.e. other running ipclusters of other users.

In general it seems that it doesn't scale well beyond the set of
machines that I can control myself and monitor myself and handle
resource conflicts myself. System like Condor and SGE are powerful
frameworks to achieve this type of resource management. It would be
great if IPython could better integrate with them.

Regarding Condor [0]: The authors have expressed their interest in this
use case, and would be willing to work on a better integration.

> * The ipython engines have persistent namespaces.  Thus each task can
> read/write from/to that namespace and subsequent tasks will be able to
> see those changes.  This is a huge diference if you need to do some
> lengthy initialization before doing the tasks.  Keeping things in
> memory is a huge benefit.

That is true, but there are also problems with growing memory footprint
as outlined above, because the workers are not at all coupled to any
task. One would need to perform manual cleanups of the namespace -- our
users don't do that.


> But,
> 
> For really long running tasks (long enough to run into the queue time
> limits), using ipython doesn't make a lot of sense.  But this is a
> usage case we should be able to cover better.  Will have to think
> about that at some point.

I don't think it is only about long running tasks. I believe it is more
about managing shared resources in an environment with potentially
conflicting demand of multiple users. If fear that one would have to
reimplement a full-blown SGE or Condor to be able to handle that
properly -- or integrate with existing solutions.

Michael

[0] http://www.cs.wisc.edu/condor/


-- 
Michael Hanke
http://mih.voxindeserto.de


More information about the IPython-User mailing list