[IPython-User] Limitations (?) of ipython SGE support

Brian Granger ellisonbg@gmail....
Fri Jan 21 12:18:17 CST 2011


> we also started playing with ipcluster for python-based computing and
> found it difficult to integrate into our Condor pool. Most problems seem
> to come from the fact that acquisition of resources is detached from the
> actual usage or demand.

Yes, that is a very good way of summarizing the model.  It is worth
mentioning that to have interactivity, that is almost implied.  People
using something interactively, tend to start and stop their work

> Users tend to allocate more resources than they
> need -- "to be sure". While an idle ipengine doesn't eat much CPU time
> that becomes a problem when users reuse them for multiple "session".
> memory consumption accumulates and constantly occupies resources that
> could be better used for other jobs -- total utilization of the cluster
> goes down.

Yes, and I think there are two issues here:

* Having IPython support non-interactive workloads by better
integrating with the job schedulers.
* Figuring out how to efficiently schedule interactive IPython jobs in
a way that keep cluster utilization high.

> If I force users to only use a small amount of engines, they have to
> wait longer for their results, maybe despite the cluster not being at
> 100% load at a particular time. ipengine jobs also have to be protected
> to not get killed by another incoming job from a user with higher
> priority, which somewhat invalidates the fairshare idea. ipcluster users
> tend to have low priority, because they have constantly running
> processes.

One idea is to have dedicated, smaller clusters or multicore machines
to handle the interactive ipython jobs.  Another option is to
completely get rid of existing jobs schedulers and create something
that is more focused on this type of workload (ha, ha, how easy to
say, hard to do...).  But, the most realistic set of options involve
IPython getting better and working with existing schedulers.

> If ipcontroller could spawn new engines on demand and kill stale ones if not
> used for a while (or when the original client detached), that would
> help.

Yes, that is definitely something we have thought about.  It would
require a bit of reworking how ipcontroller works, but it is not

> (please also see my comment below)
> On Mon, Jan 17, 2011 at 09:33:47PM -0800, Brian Granger wrote:
>> > Thanks for your reply. It seems that interactive use is your main goal.
>> > However, wouldn't out be better to submit every task separately using qsub
>> > (instead of submitting ipengines)?
>> There are a couple of reasons for doing this (just one long task):
>> * Our scheduler has much lower lantency and overhead than that of SGE.
> but it also lacks proper resource allocation. I cannot say that my
> ipengine job will now spawn 8 processes that will use all of a node and
> conflict with other running jobs, because initially it looked like a
> single CPU process. It also doesn't negotiate resources with other
> concurrent requests, i.e. other running ipclusters of other users.

Yep, it is its own little universe, a scheduler within a scheduler or so.

> In general it seems that it doesn't scale well beyond the set of
> machines that I can control myself and monitor myself and handle
> resource conflicts myself. System like Condor and SGE are powerful
> frameworks to achieve this type of resource management. It would be
> great if IPython could better integrate with them.

I agree.

> Regarding Condor [0]: The authors have expressed their interest in this
> use case, and would be willing to work on a better integration.

Michael, could you put us in touch with them?

>> * The ipython engines have persistent namespaces.  Thus each task can
>> read/write from/to that namespace and subsequent tasks will be able to
>> see those changes.  This is a huge diference if you need to do some
>> lengthy initialization before doing the tasks.  Keeping things in
>> memory is a huge benefit.
> That is true, but there are also problems with growing memory footprint
> as outlined above, because the workers are not at all coupled to any
> task. One would need to perform manual cleanups of the namespace -- our
> users don't do that.

Yes, many types of tasks don't need the persistent namespaces and for
those, it is a liability.

Lots to think about.  Min, do you have any thoughts?



>> But,
>> For really long running tasks (long enough to run into the queue time
>> limits), using ipython doesn't make a lot of sense.  But this is a
>> usage case we should be able to cover better.  Will have to think
>> about that at some point.
> I don't think it is only about long running tasks. I believe it is more
> about managing shared resources in an environment with potentially
> conflicting demand of multiple users. If fear that one would have to
> reimplement a full-blown SGE or Condor to be able to handle that
> properly -- or integrate with existing solutions.
> Michael
> [0] http://www.cs.wisc.edu/condor/
> --
> Michael Hanke
> http://mih.voxindeserto.de
> _______________________________________________
> IPython-User mailing list
> IPython-User@scipy.org
> http://mail.scipy.org/mailman/listinfo/ipython-user

Brian E. Granger, Ph.D.
Assistant Professor of Physics
Cal Poly State University, San Luis Obispo

More information about the IPython-User mailing list