[IPython-dev] DAG Dependencies
Thu Oct 28 13:55:18 CDT 2010
On Thu, Oct 28, 2010 at 10:46, Brian Granger <firstname.lastname@example.org> wrote:
> On Thu, Oct 28, 2010 at 12:57 AM, MinRK <email@example.com> wrote:
> > Hello,
> > In order to test/demonstrate arbitrary DAG dependency support in the new
> > Python scheduler, I wrote an example using NetworkX, as Fernando
> > It generates a random DAG with a given number of nodes and edges, runs a
> > of empty jobs (one for each node) using the DAG as a dependency graph,
> > each edge represents a job depending on another.
> > It then validates the results, ensuring that no job ran before its
> > dependencies, and draws the graph, with nodes arranged in X according to
> > time, which means that all arrows must point to the right if the
> > time-dependencies were met.
> Very impressive demo and test. Here is a very significant benchmark
> we could do with this...
> 1. Make each node do a time.sleep(rand_time) where rand_time is a
> random time interval over some range of times.
> 2. For a DAG of such tasks, you can calculate the fastest possible
> parallel execution time by finding the shortest path through the DAG,
> where, by shortest path, I mean the path where the sum of rand_time's
> on that path is the smallest.
It's actually slightly more complicated than that, because T_best should
actually be the *longest* path from a root to any terminus. Remember that a
node depends on all its parents, so the longest path is the earliest start
time for a given node. Since It's a DAG, there can't be any loops that
would mess up the length of your route. It would be shortest if I set
mode='any' on the dependencies, and even then, T_best would be the *longest*
path of the collection of shortest paths from each root to each node.
It's been a long time since the DAG unit of my Automata and Languages
course, but I'll put this together.
> Call that time T_best. By analyzing
> the DAG, you can also tell the number of engines required to acheive
> that T_best. We can also calculate things like the parallel and
> serial fraction of the DAG to find the max speedup.
> 3. Run that same DAG on 1, 2, 4, 8, ... engines to see how close we
> can get to T_best and the max_speedup.
> This would be a very rigorous way of testing the system over a variety
> of different types of loads.
> > It happily handles pretty elaborate (hundreds of edges) graphs.
> That is quite impressive, but what is the limitation? It should be
> able to do 1000s or more of edges right?
The limitation is that my tasks take O(1 sec) to run, and I didn't want to
wait for several minutes for the test to complete :). There should be no
problem internally with millions of tasks and dependencies. The performance
limitation will be the set methods used to check whether a dependency has
been met and the memory footprint of the sets themselves. Quickly checking
with %timeit, the set checks with 100k dependencies and 1M msg_ids to check
against still only take ~5ms on my laptop (200ns for 10 dependencies and 10M
msg_ids to check). My interactive session starts running into memory issues
with 100M ids, so that's something to consider.
> > Too bad I didn't have this done for today's Py4Science talk.
> Yes, defiinitely, that would have been "epic" as my teenage son would say.
> > Script can be found here:
> > -MinRK
> Brian E. Granger, Ph.D.
> Assistant Professor of Physics
> Cal Poly State University, San Luis Obispo
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the IPython-dev