[IPython-user] usage of Task's depend

Brian Granger ellisonbg.net@gmail....
Wed Apr 8 11:10:04 CDT 2009


John,

The current design of the task dependency system and properties has a
number of flaws and unfortunately, you are running into all of them.

But, I think I have good news.  I think I have figured out a way of
handling this that is much more robust and doesn't have the
limitations the current approach does.

I will try to make some changes to IPython itself, but for now I think
you can try the approach yourself.  Here is the idea...

* Completely forget about the properties and depends stuff.  Forget it exists.

* Test for a dependency in your string task itself.  There, you have
full access to your engines namespace.

* If a task doesn't have the right dependencies, just raise an
exception.  Something like this:

t = """
if not name in files:
    raise Exception('dependency not ")
else:
    # do the real work
"""

* The task will then fail if the dependencies are not met.  To get the
task to reschedule itself, just specify a number of task retries.  You
will just need to give sufficient retries that the task will
eventually be run on an engine that has the dependency.

Eventually I will make a custom exception type that you can raise in
this situation, like "TaskRejectError."  I will also eventually remove
the dependency/properties stuff.  But for now, this should get you
going.  Let me know if you have other questions.

Cheers,

Brian


On Wed, Apr 8, 2009 at 5:57 AM, John Swinbank <swinbank@transientskp.org> wrote:
> Hello,
>
> I've been enjoying working on some parallel data processing using
> StringTasks. However, I've come upon a question; any words of wisdom
> would be gratefully received!
>
> I want to process some data files using a series of ipengines spread
> across a cluster. The catch is that not all of the data will exist on
> each node, and, indeed, the disposition isn't well known in advance.
> Obviously, I'd like to process each file only on a node on which it
> exists.
>
> My hope was to achieve this using the "depend" argument to StringTask.
> I've populated the engine properties with a list of the files available
> to that engine, and tried something like the following:
>
>  def check_for_file(name):
>      return lambda props: name in props['filenames']
>
>  for filename in filename_list:
>      task = client.StringTask(
>          "result = process_file(filename)",
>          push=dict(filename=filename),
>          pull="result",
>          depend=check_for_file(filename)
>      )
>      tc.run(task)
>
> Of course, this fails:
>
>  ValueError: Sorry, cannot pickle code objects with closures
>
> (Somewhat optimistically, I also tried something very similar using
> functools.partial, but partial objects also aren't pickleable!)
>
> I've worked around this for now by redefining StringTask to accept
> another argument ("depend_args"), and changed its check_depend method to
> run:
>
>  if self.depend is not None:
>      return self.depend(properties, self.depend_args)
>
> I can then pass in the filename when setting up the task and examine it
> when checking dependencies.
>
> Naively, it seems like allowing more arguments to the depend function
> could be useful in a range of situations. Is there a smarter way of
> going about things than my workaround?
>
> Thanks for any suggestions!
>
> John
> _______________________________________________
> IPython-user mailing list
> IPython-user@scipy.org
> http://mail.scipy.org/mailman/listinfo/ipython-user
>


More information about the IPython-user mailing list