[IPython-user] usage of Task's depend
Brian Granger
ellisonbg.net@gmail....
Wed Apr 8 11:10:04 CDT 2009
John,
The current design of the task dependency system and properties has a
number of flaws and unfortunately, you are running into all of them.
But, I think I have good news. I think I have figured out a way of
handling this that is much more robust and doesn't have the
limitations the current approach does.
I will try to make some changes to IPython itself, but for now I think
you can try the approach yourself. Here is the idea...
* Completely forget about the properties and depends stuff. Forget it exists.
* Test for a dependency in your string task itself. There, you have
full access to your engines namespace.
* If a task doesn't have the right dependencies, just raise an
exception. Something like this:
t = """
if not name in files:
raise Exception('dependency not ")
else:
# do the real work
"""
* The task will then fail if the dependencies are not met. To get the
task to reschedule itself, just specify a number of task retries. You
will just need to give sufficient retries that the task will
eventually be run on an engine that has the dependency.
Eventually I will make a custom exception type that you can raise in
this situation, like "TaskRejectError." I will also eventually remove
the dependency/properties stuff. But for now, this should get you
going. Let me know if you have other questions.
Cheers,
Brian
On Wed, Apr 8, 2009 at 5:57 AM, John Swinbank <swinbank@transientskp.org> wrote:
> Hello,
>
> I've been enjoying working on some parallel data processing using
> StringTasks. However, I've come upon a question; any words of wisdom
> would be gratefully received!
>
> I want to process some data files using a series of ipengines spread
> across a cluster. The catch is that not all of the data will exist on
> each node, and, indeed, the disposition isn't well known in advance.
> Obviously, I'd like to process each file only on a node on which it
> exists.
>
> My hope was to achieve this using the "depend" argument to StringTask.
> I've populated the engine properties with a list of the files available
> to that engine, and tried something like the following:
>
> def check_for_file(name):
> return lambda props: name in props['filenames']
>
> for filename in filename_list:
> task = client.StringTask(
> "result = process_file(filename)",
> push=dict(filename=filename),
> pull="result",
> depend=check_for_file(filename)
> )
> tc.run(task)
>
> Of course, this fails:
>
> ValueError: Sorry, cannot pickle code objects with closures
>
> (Somewhat optimistically, I also tried something very similar using
> functools.partial, but partial objects also aren't pickleable!)
>
> I've worked around this for now by redefining StringTask to accept
> another argument ("depend_args"), and changed its check_depend method to
> run:
>
> if self.depend is not None:
> return self.depend(properties, self.depend_args)
>
> I can then pass in the filename when setting up the task and examine it
> when checking dependencies.
>
> Naively, it seems like allowing more arguments to the depend function
> could be useful in a range of situations. Is there a smarter way of
> going about things than my workaround?
>
> Thanks for any suggestions!
>
> John
> _______________________________________________
> IPython-user mailing list
> IPython-user@scipy.org
> http://mail.scipy.org/mailman/listinfo/ipython-user
>
More information about the IPython-user
mailing list