[IPython-dev] DAG Dependencies
Thu Oct 28 14:33:00 CDT 2010
On Thu, Oct 28, 2010 at 11:55 AM, MinRK <firstname.lastname@example.org> wrote:
> On Thu, Oct 28, 2010 at 10:46, Brian Granger <email@example.com> wrote:
>> On Thu, Oct 28, 2010 at 12:57 AM, MinRK <firstname.lastname@example.org> wrote:
>> > Hello,
>> > In order to test/demonstrate arbitrary DAG dependency support in the new
>> > ZMQ
>> > Python scheduler, I wrote an example using NetworkX, as Fernando
>> > suggested.
>> > It generates a random DAG with a given number of nodes and edges, runs a
>> > set
>> > of empty jobs (one for each node) using the DAG as a dependency graph,
>> > where
>> > each edge represents a job depending on another.
>> > It then validates the results, ensuring that no job ran before its
>> > dependencies, and draws the graph, with nodes arranged in X according to
>> > time, which means that all arrows must point to the right if the
>> > time-dependencies were met.
>> Very impressive demo and test. Here is a very significant benchmark
>> we could do with this...
>> 1. Make each node do a time.sleep(rand_time) where rand_time is a
>> random time interval over some range of times.
>> 2. For a DAG of such tasks, you can calculate the fastest possible
>> parallel execution time by finding the shortest path through the DAG,
>> where, by shortest path, I mean the path where the sum of rand_time's
>> on that path is the smallest.
> It's actually slightly more complicated than that, because T_best should
> actually be the *longest* path from a root to any terminus. Remember that a
> node depends on all its parents, so the longest path is the earliest start
> time for a given node. Since It's a DAG, there can't be any loops that
> would mess up the length of your route. It would be shortest if I set
> mode='any' on the dependencies, and even then, T_best would be the *longest*
> path of the collection of shortest paths from each root to each node.
> It's been a long time since the DAG unit of my Automata and Languages
> course, but I'll put this together.
Absolutely, my brain was thinking longest, but it came out shortest.
>> Call that time T_best. By analyzing
>> the DAG, you can also tell the number of engines required to acheive
>> that T_best. We can also calculate things like the parallel and
>> serial fraction of the DAG to find the max speedup.
>> 3. Run that same DAG on 1, 2, 4, 8, ... engines to see how close we
>> can get to T_best and the max_speedup.
>> This would be a very rigorous way of testing the system over a variety
>> of different types of loads.
>> > It happily handles pretty elaborate (hundreds of edges) graphs.
>> That is quite impressive, but what is the limitation? It should be
>> able to do 1000s or more of edges right?
> The limitation is that my tasks take O(1 sec) to run, and I didn't want to
> wait for several minutes for the test to complete :). There should be no
> problem internally with millions of tasks and dependencies. The performance
> limitation will be the set methods used to check whether a dependency has
> been met and the memory footprint of the sets themselves. Quickly checking
> with %timeit, the set checks with 100k dependencies and 1M msg_ids to check
> against still only take ~5ms on my laptop (200ns for 10 dependencies and 10M
> msg_ids to check). My interactive session starts running into memory issues
> with 100M ids, so that's something to consider.
OK, this makes sense.
>> > Too bad I didn't have this done for today's Py4Science talk.
>> Yes, defiinitely, that would have been "epic" as my teenage son would say.
>> > Script can be found here:
>> > http://github.com/minrk/ipython/blob/newparallel/examples/demo/dagdeps.py
>> > -MinRK
>> Brian E. Granger, Ph.D.
>> Assistant Professor of Physics
>> Cal Poly State University, San Luis Obispo
Brian E. Granger, Ph.D.
Assistant Professor of Physics
Cal Poly State University, San Luis Obispo
More information about the IPython-dev