[IPython-User] parallel computing with ipython: reproducibility and intermediate results

Christoph Groth cwg@falma...
Wed Oct 19 04:00:11 CDT 2011


Dear ipythonistas,

my current workflow for running (mostly python-written) simulations
running on a cluster involves starting and controlling from a unix
shell, evaluating the results with scripts written in sh, awk, and
python and visualizing data mostly with gnuplot.

I'm considering consolidating this with the help of ipython.  Using a
single environment for coding and running the simulations as well as
analyzing the results could potentially simplify things a lot.

I read the relevant parts of the ipython manual and I like it's approach
to parallel computing.  However, I wonder how to achieve two (in my
opinion) crucial goals: reproducibility and availability of intermediate
results.  I'm very interested to learn how others deal with those,
especially in the context of python and ipython.

What follows is a description of the two features of my current workflow
which I would like to move over to ipython.

*** Reproducibility ***

My simulations are run as shell commands (with subcommands and
parameters).  The simulations potentially consume input (on stdin) and
produce output (on stdout).  The output depends only on the input and
the parameters to the command.

The output of each command begins with a few lines of comments which
include all the information necessary for re-doing the calculation (all
parameters, RNG seed used, software version of the simulation).

In this way, by simply keeping the output file (if applicable together
with the input file), I'm able to reproduce any computation at any time
in the future.

I like this automatic documenting of runs a lot.  It has proven to be
extremely useful many times.

Logging complete ipython sessions seems to be a possible solution.  But
if the commands only depend on their input (and not on any global
state), a better solution should be possible.  I could imagine having a
"Result" type along with functions which consume and produce results.  A
result would remember its history and it would be possible to save it to
disk.

*** Inspecting intermediate results ***

My simulations are often embarrassingly parallel and involve calculating
the same quantities many times (with different RNG seeds) and averaging
the results.  Often, it's difficult to tell in advance how much
averaging will be necessary or possible in the available time.  Also,
it's crucial to be able to look at the partially finished averaged
results as soon as possible.

With my current workflow the outputs of the various runs are typically
accumulated in text files.  Each non-averaged result is typically
represented by a single line of the form

<param0> <param1> ... <result0> <result1> ....

I have a script which makes it very easy to average all the results with
matching parameters.  I can run this script once _some_ results are
available and look at the preliminary averaged results.

If I would be using ipython's parallel "map", I don't see any way of
being able to look at the results before the whole calculation is
finished.  Before considering writing my own version of map (which would
support a "peek" method), I'd like to ask how others solve this problem.

Christoph



More information about the IPython-User mailing list