[IPython-user] usage of Task's depend

John Swinbank swinbank@transientskp....
Wed Apr 8 07:57:02 CDT 2009


Hello,

I've been enjoying working on some parallel data processing using
StringTasks. However, I've come upon a question; any words of wisdom
would be gratefully received!

I want to process some data files using a series of ipengines spread
across a cluster. The catch is that not all of the data will exist on
each node, and, indeed, the disposition isn't well known in advance.
Obviously, I'd like to process each file only on a node on which it
exists.

My hope was to achieve this using the "depend" argument to StringTask.
I've populated the engine properties with a list of the files available
to that engine, and tried something like the following:

  def check_for_file(name):
      return lambda props: name in props['filenames']

  for filename in filename_list:
      task = client.StringTask(
          "result = process_file(filename)",
          push=dict(filename=filename),
	  pull="result",
	  depend=check_for_file(filename)
      )
      tc.run(task)

Of course, this fails:

  ValueError: Sorry, cannot pickle code objects with closures

(Somewhat optimistically, I also tried something very similar using
functools.partial, but partial objects also aren't pickleable!)

I've worked around this for now by redefining StringTask to accept
another argument ("depend_args"), and changed its check_depend method to
run:

  if self.depend is not None:
      return self.depend(properties, self.depend_args)

I can then pass in the filename when setting up the task and examine it
when checking dependencies.

Naively, it seems like allowing more arguments to the depend function
could be useful in a range of situations. Is there a smarter way of
going about things than my workaround?

Thanks for any suggestions!

John


More information about the IPython-user mailing list