[IPython-User] Large Parallel Runs

MinRK benjaminrk@gmail....
Thu Aug 30 19:09:19 CDT 2012


On Thu, Aug 30, 2012 at 3:06 PM, Constantine Evans <cevans@evanslabs.org>wrote:

> Hello everyone,
>
> I'm currently having some difficulty with ipcontroller seeming to
> choke when given too many tasks on too large a cluster, and was
> wondering whether anyone else had experienced this.
>
> I'm using a cluster with the following configuration:
> * ipcontroller running on one machine, with 7 ipengines
> * ipengines running on 19 other machines with between 2 and 8
> instances per machine (1 per core), all connecting to ipcontroller via
> ssh. There are 72 ipengines in total.
> * the client running on my laptop and connected via ssh. My laptop is
> also one of the 19 machines
>
> On this setup, giving 1600 tasks seemed to work relatively well.
>
> However, giving it 16000 of the same tasks doesn't seem to be working.
> With perhaps only a few hundred tasks completed, queue_status() is
> only telling me about 4500 tasks unassigned at this point, at least
> twenty minutes after starting. Tasks are completing very slowly, and
> most of the ipengines seem to be idle.
>
> The only thing using significant CPU is ipcontroller, which is taking
> up 100% of its core. It doesn't seem to be using significant memory,
> however (<300MB).
>
> Has anyone else run into limitations like these? Is there some way
> around them? Do I simply have a bad configuration, or is there
> something more fundamental that might be wrong here?
>

Unfortunately it can be any or all of the above.

A few questions:

1. IPython version (git hash if tracking git)
2. HubFactory.db_class, if specified
3. TaskScheduler.hwm (assuming you are using load-balancing), if specified
4. What is the nature of your tasks (data arguments / return values,
relationship between tasks, characteristic time, etc.)?

Due to the use of multiprocessing, the Hub and all Schedulers will identify
themselves as 'ipcontroller', if you look at all of these processes ordered
by PID, can you tell which one is using 100% CPU?

If it's not the first, then it's likely the task scheduler choking on the
long queue.

Two options might help alleviating this:

1. increase TaskScheduler.hwm to something like 100 (that's the number of
tasks that are allowed to be assigned to a given task), or you can try it
with `0`, which means that tasks will be greedily assigned to engines as
fast as possible.
2. Depending on what features you use, you might try using the pure zmq
scheduler with:

    TaskScheduler.scheme_name = 'pure'

The pure zmq scheduler does not support any of the advanced features of the
Python scheduler (fault tolerance, retries, dependencies, etc.), but it is
extremely lightweight.



>
> Regards,
> Constantine Evans
> _______________________________________________
> IPython-User mailing list
> IPython-User@scipy.org
> http://mail.scipy.org/mailman/listinfo/ipython-user
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.scipy.org/pipermail/ipython-user/attachments/20120830/7c874807/attachment-0001.html 


More information about the IPython-User mailing list