[IPython-dev] Suggestions for implementing parallel algorithms?

Albert Strasheim fullung at gmail.com
Mon Nov 13 18:23:48 CST 2006


Hello all

On Thu, 09 Nov 2006, Brian Granger wrote:

> This sounds like a nice application of IPython1.  Fernando and I have
> had a number of talks about this exact issue lately.  One things I
> should say before all else.  In spite of there being years of research
> in parallel computing and algorithms, there is basically no research
> on *interactive* parallel computing.  I am not sure you need your
> application to be interactive, but if you do, at some level, you are
> in new territory.

My application probably doesn't fall into the interactive category at 
present. But some speaker verification and identification systems could 
definately be used interactively.

> With that said, there are some things to keep in mind.
> 
> First, it is important to realize that all the previous work done on
> parallel algorithm/application development still applies even though
> the result can be used interactively.  For example, if you need to
> move data around between the ipython engines, you should still use MPI
> - and all the guidelines for using MPI still apply.

Okay. This makes sense.
 
> The new and interesting question is really "how and where do you want
> to interact with the parallel application as a human user?"  There are
> many models of how you would want your application abstracted for
> interactive usage.  And at some level, you may want to have the
> interactive API very different from the underlying computational model
> and parallel algorithm.  You may want to hide the parallelism or you
> may find it better to show it explicitely in the API.

Interesting. I see what you're saying. For speaker verification 
systems, you could have an interactive parallel application in the case 
where you provide a system for a bunch of operators to investigate 
phone calls with. The operators might be wearing trenchcoats. ;-)
 
> In my own work, I have tended to factor my application into units that
> can perform the basic computational tasks for both a serial and
> parallel versions of the code.  I then use these as building blocks to
> build the parallel and serial version.  If the low level components
> are factored well, the high level algorithm is typically very short
> and I don't mind maintaining both a serial and a parallel version.

What is still unclear to me is how you call these components? I'm 
trying to avoid putting too much code in strings that get passed to 
executeAll and the like. If you have something foo that does N things 
on each node, how do set things up so that you don't have to write N 
lines in N strings to N executeAll calls? Presumably one would want to 
to just write executeAll('foo(a,b,c)') where you wrote foo as a normal 
Python function.

Something like this:

In [17]: f = lambda x: x
In [18]: rc.pushAll(f=f)
In [19]: rc.pushAll('f(10)')

However, lambda functions don't pickle:

Object cannot be serialized:  f Can't pickle <type 'function'>: 
attribute lookup __builtin__.function failed
Out[18]: False

or something like this:

In [36]: rc.getIDs()
Out[36]: [0, 1]

In [37]: def f(x): return 10*x
   ....:

In [38]: rc.pushAll(f=f)
Out[38]: True

However, this nukes the engines:

exceptions.AttributeError: 'module' object has no attribute 'f'

If I distribute a module containing the functions I want to execute to 
all the nodes beforehand, it might be able to unpickle the function on 
the engines. Still have to try this.

I wonder if it would work with functions declared inside instance 
methods. Where would the engines find these functions? My thinking here 
is that you could write parallel algorithms like this:

class ParallelFoo:
  def parallelop(self, data, controller):
      def something_to_execute_everywhere(x):
          return 10*x
      controller.scatterAll('x', data)
      controller.pushAll(f=something_to_execute_everywhere)
      controller.executeAll('f(x)')
 
Make sense? Any suggestions for doing this kind of thing?

> For many things I do like the scatterAll/executeAll/gatherAll style of
> computation - it is extremely lightweight and easy to implement.  The
> one thing to be careful of though is to not use this approach when MPI
> is more appropriate.  Testing the scaling of your application with
> quickly reveal if there are problems like this.

For my application, getting the data to the engines probably needs some 
thought. Prior to training the world model with K-means and GMM EM I 
need to spread out tens to hundreds of hours of speech between the 
engines.

Correct me if I'm wrong, but since the client and controller aren't 
part of the MPI universe, I have to use something like scatterAll here?

On my client I would typically have a directory with a few hundred 
files that I want to distribute to the engines. A quick 'n dirty 
implementation could read all these files into memory and scatter them 
to the engines.

This approach will run into problems when my dataset becomes bigger 
than the memory on my client machine (reasonably likely) and is 
probably going to be reasonably (very?) slow anyway.

Next idea: scatter filenames (or some other kind of data ID) and let 
the engines query a central server for the data, maybe via HTTP or FTP 
(or maybe do something that exposes a PyTables database to the 
network). Alternatively, keep all the data on each engine machine. My 
data is probably too big for this approach.

Next idea after that: since I have bunch of machines with disk space to 
spare, I could spread out multiple copies of my dataset across these 
machines. Then I could still scatter names and let the engines do some 
kind of lookup to find machines that have the data it is looking for 
and have it then get it from one of these at random. Basically the 
previous idea + load balancing.

Next idea after that: in a separate process on each engine machine, run 
something akin to a BitTorrent client that downloads files from a 
torrent as the local engine needs them. When this starts up, the client 
machine could seed the data until there is at least one copy of each 
file distributed across the engine machines.

Next idea after that: figure out a way to prefer scattering of data to 
kernels that already have the data available. I think the BOINC folks 
call this concept locality scheduling:

http://boinc.berkeley.edu/sched_locality.php

Other idea: instead of all this network I/O, keep a subset of the data 
on each engine machine and use locality scheduling to ensure that only 
machines that have certain data get work related to that data scattered 
to them. At this point, scatter probably isn't the right word anymore.

How fast this stuff is going to be... we'll see. :-)

Do you guys have large datasets to deal with? Any thoughts on doing this 
kind of thing?

> >I'm hoping I can avoid this duplication. My first idea is to make something
> >like a LocalController that implements IPython1's IController interface in 
> >a
> >way that makes sense for single node operation. This way, I can implement 
> >my
> >algorithm once in terms of IController operations, test it easily, and 
> >later
> >by simply setting a controller property on a instance of the class
> >implementing the algorithm, decide whether it runs on a single node or in
> >parallel.
> 
> I had not thought of that before, but it does make sense.  It is sort
> of similar to building objects that hide whether the object is being
> used in a parallel/serial context.  It is surely worth trying this
> approach, but I am not sure how it would turn out in your case.

I'd like to explore this idea further if I can figure out the "right" 
way for algorithms to tell the engines how to do a piece of work from 
inside a method.

> I don't know if this helps, but I would love to see what you end up
> trying and what you find most useful - I am curious about all these
> things myself.

Thanks for your inputs. Much appreciated.

Cheers,

Albert


More information about the IPython-dev mailing list