[IPython-dev] pushAll and executeAll speed issue

Brian Granger ellisonbg.net at gmail.com
Thu Nov 9 00:14:41 CST 2006

> I'm doing some experiments with IPython1 on Windows. I'm using DeinoMPI's
> GUI on top of mpi4py to start two kernels on my Core 2 Duo system. They
> connect to the controller without problems. I make a RemoteController and
> off I go. This is too cool.

Yes, cool, but it is also dangerous.....read on...

> (By the way: as I understand it, MPI is currently only being used to start
> the kernels. From what I can see, it is not yet being used inside things
> like pushAll, etc. Is this the next step to use the MPI functions if they
> are available? Anybody working on this? If not, maybe I can try my hand at
> it. Any hints or suggestions on where to start appreciated.)

There are two ways one might way to use MPI:

1.  The send data between engines.  This is supported already through
the usage of various python mpi bindings like mpi4py.  Our docs have
some details on how to get started with this, but it can be subtle
depending on your mpi implementation etc.  Please feel free to bug us
about this.

2.  To send data between the client (your python script) and the
engines.  At first this may seem attractive, but it is not really what
you want to do.  The reason is that then you IPython/Python script
also has to be part of the MPi universe.  That make it impossible to
do many important things:

 - Run your script on your laptop, but run the engines/controller on a
remote cluster.
 - Disconnect/reconnect to/from the controller.
 - multiple users connecting to the controller

Because of these reason, we feel that it is a "feature" that MPi is
not used to send data between the client (your python script or an
interactive session) and the engines.

With that said, you shouldn't be sending lots of data between the
client and engines anyway - see below for more discussion of this...

> Anyway, back to my experiment. The script I'm using is attached.
> It starts off with some local benchmarks running on a single core. I get:
> cPickle dumps time 4.95649055955
> cPickle loads time 2.39126201794
> local map time 10.3841574837
> Then I push to all and time it. Here I get:
> total pushAll time 70.6287321433
> I was expecting something proportional to dumpsTime+N*loadsTime (if I
> understand correctly that it is serially pushing out the data to each of the
> N kernels) or maybe even less. I noticed that this section gets dramatically
> slower even with a relatively small increase in data size. While this is
> busy, I see Python using up all the CPU on one core as expected, but I also
> see that the Windows "System" process is very busy.
> Next up, the timing over the executeAll call that does the map:
> executeAll time for map 139.186242386
> but the kernels print:
> engine map time 12.8248084603
> engine map time 11.3900467797
> which is only slightly more than the single core time. However, I have to
> wonder where the heck the rest of the time went?! :-)
> Do you guys see similar timings on Linux? Any thoughts on where my time is
> going? Could it be Twisted? I tried with the latest from SVN and the
> official 2.4.0 release for Windows.

The current incarnation of your script has a few problems:

1.  It creates multiple copies of a large object on 4 processes on
your system (client, controller, and 2 engines).  Even though the
object is "not that big" when you have multple copies around the
memory usage gets out of controller.  Fernando was seeing his 2 Gb
system nearly maxed out.
It is even worse though than just 4 copies.  If you go through your
script you will see that the client and each engine has 2-3 copies of
the object.  For certain periods of time, the controller can also have
multiple copies.  Most of these problems (for objects the size of
yours) would go away if the controller and engines were all running on
different systems.

2.  There is no real parallelism being expressed.  This is because
pushAll sends the entire object to each engine.  Are you meaning to
use scatterAll?  This partitions a sequence and sends the partitions
to each engine.

3.  One of the golden rules of parallel computing is "never send
something over the wire that can easily be recreated in place"  The
following code is much better:

rc.execute(0,'a = N.arange(0,50000)')
rc.execute(1,'a = N.arnage(50000, 100000)')
rc.executeAll('b = a*a')
result = gatherAll('b')                             # Now gather the
result locally

With this you have full parallelism and only three small strings (not
the actual objects) are sent over the wire.  Code like this will scale
to many processes.

We have really tried to make it easy to send objects between the
client and engines.  But, this usage pattern is not something we are
used to.  By this, I mean it is *very* easy (I have done it many times
with IPython1 :)) to get carried away with how easy it is to send
things around and forget about the realities of what we are really
doing underneath the hood.

With that said, we are very aware of a need to improve how IPython
handles these cases.  There are lots of optimizaitons we can and need
to do and we also need to provide some protections that prevent crazy
things from being done too easily.  We have out work cut out or us and
having users like yourself hammering on things makes our job easier.



More information about the IPython-dev mailing list