[IPython-User] Thread-safety of IPython.kernel.client
Søren Gammelmark
gammelmark@phys.au...
Thu Aug 19 02:50:24 CDT 2010
Hi everyone
First of all thankyou for an extremely useful tool in the IPython and
it's ability to help with cluster computing!
To what degree is IPython.kernel.client thread-safe? (i.e. safe in the
sense of the threading-module). I have a problem when I run several
threads each of which are sending commands to individual ipengine's from
a Queue.Queue. It seems like one of the engines is getting the same
commands twice: From the log I have something like this for ipengine id 2
2010-08-18 19:24:36+0200 [-] Performing reset on 2
2010-08-18 19:24:36+0200 [-] Performing reset on 2
2010-08-18 19:24:36+0200 [-] Performing push on 2
2010-08-18 19:24:36+0200 [-] Performing push on 2
2010-08-18 19:24:36+0200 [-] Performing execute on 2
2010-08-18 19:24:36+0200 [-] Performing execute on 2
Where the other engines go through a single reset-push-execute cycle
(which is consistent with my program).
2010-08-18 19:31:19+0200 [-] Performing reset on 3
2010-08-18 19:31:19+0200 [-] Performing push on 3
2010-08-18 19:31:19+0200 [-] Performing execute on 3
I suspect that this messes up the pull I have to do later (if I reset
before the pull, I cannot get the stuff back). Another, and possibly
related issue is a QueueCleared exception. The funny thing in these
cases is that the system complains about an exception from e.g. 'push'
in the QueueCleared is from the 'pull' (and similar pull/execute):
one or more exceptions from call to method: push
[Engine Exception]QueueCleared: 'pull' ('filename',) {}
[Engine Exception]
...
one or more exceptions from call to method: pull
[Engine Exception]QueueCleared: 'execute' ('filename = task.run()',) {}
[Engine Exception]
No traceback available
Does this make any sense or do you need more information? For the
record, the problem only arises when running on multiple nodes. I have
tested the programs on my own machine (with multiple cores and
ipengines), where it seems to work without problems. The problems also
only happens three times during a 12 hour run (and late in the run at
that), so it is not very systematic. Therefore I have no idea where to
start investigating this.
Hope you can help
Søren Gammelmark
P.S: If you are wondering why I do not use that TaskClient it is because
I would like to do extra postprocessing on the results after the tasks
are finished, i.e. transferring data between files and networks. If you
know of an obvious way to do this with less chance of error I would be
quite interested.
More information about the IPython-User
mailing list