[IPython-dev] Parallel computing segfault behavior

MinRK benjaminrk@gmail....
Mon Feb 3 19:40:43 CST 2014


Supervisord might be the easiest way to relaunch engines, but you can run a
Client and watch for engine unregistration notifications, and when they
come, just start a new engine in its place.


On Wed, Jan 29, 2014 at 9:48 AM, Drain, Theodore R (392P) <
theodore.r.drain@jpl.nasa.gov> wrote:

>  I'd be interested in an automatic restart capability as well.  We have
> some very long running jobs where a loss of one more more engines might be
> a problem.  Could you outline what you mean by an "extra watcher"?  Is that
> just a Client object that polls the engine id's to see if they change (I
> assume UUID's would be needed, not the simple integer id's)?
>
> Thanks,
> Ted
>
>  ------------------------------
> *From:* ipython-dev-bounces@scipy.org [ipython-dev-bounces@scipy.org] on
> behalf of Min RK [benjaminrk@gmail.com]
> *Sent:* Wednesday, January 29, 2014 9:29 AM
> *To:* IPython developers list
> *Cc:* IPython developers list
> *Subject:* Re: [IPython-dev] Parallel computing segfault behavior
>
>
>
> On Jan 29, 2014, at 5:56, Patrick Fuller <patrickfuller@gmail.com> wrote:
>
>  Thanks for that code! It's good to know that the remaining cores are
> still working and that the results are all recoverable.
>
>  One last question: each segfault offlines an engine, which means that
> the cluster slows down and eventually crashes as the number of segfaults
> approaches the number of ipengines. Should the controller instead start new
> engines to take the place of killed ones?
>
>
>  No, engine management is done by the user at this point, the controller
> never starts an engine. If you want to monitor the cluster and bring up
> replacement engines, this is not hard to do with an extra watcher (or
> starting engines with supervisord, etc.)
>
>
>  Thanks,
> Pat
>
> On Tuesday, January 28, 2014, MinRK <benjaminrk@gmail.com> wrote:
>
>>
>>
>> On Tue, Jan 28, 2014 at 5:04 PM, Patrick Fuller <patrickfuller@gmail.com<http://UrlBlockedError.aspx>
>> > wrote:
>>
>>> ...the difference being that this would require starting a new engine on
>>> each segfault
>>>
>>>
>>> On Tuesday, January 28, 2014, Patrick Fuller <patrickfuller@gmail.com<http://UrlBlockedError.aspx>>
>>> wrote:
>>>
>>>> I guess my question is more along the lines of: should the cluster
>>>> continue on to complete the queued jobs (as it would if the segfaults were
>>>> instead python exceptions)?
>>>
>>>
>>  I see what you mean - the generator halts when it sees an exception, so
>> it's inconvenient to get the successes, while ignoring the failures. I
>> guess we could add separate methods that only iterate through just the
>> successful results.
>>
>>  As far as task submission goes, it does indeed do what you seem to
>> expect, so it's just viewing the results where there is an issue.
>>
>>  Here is an example <http://nbviewer.ipython.org/gist/minrk/8680688> of
>> iterating through only the successful results of a map that segfaults.
>>
>>  -MinRK
>>
>>
>>
>> On Tuesday, January 28, 2014, MinRK <benjaminrk@gmail.com> wrote:
>>
>> I get an EngineError when an engine dies running a task:
>>
>>  http://nbviewer.ipython.org/gist/minrk/8679553
>>
>>  I think this is the desired behavior.
>>
>>
>> On Tue, Jan 28, 2014 at 2:18 PM, Patrick Fuller <patrickfuller@gmail.com>wrote:
>>
>>  Hi,
>>
>> Has there been any discussion around how ipython parallel handles
>> segfaulting?
>>
>> To make this question more specific, the following code will cause some
>> workers to crash. All results will become unreadable (or at least
>> un-iterable), and future runs require a restart of the cluster. Is this
>> behavior intended, or is it just something that hasn’t been discussed?
>>
>> from IPython.parallel import Clientfrom random import random
>> def segfaulty_function(random_number, chance=0.25):
>>     if random_number < chance:
>>         import ctypes
>>         i = ctypes.c_char('a')
>>         j = ctypes.pointer(i)
>>         c = 0
>>         while True:
>>             j[c] = 'a'
>>             c += 1
>>         return j
>>     else:
>>         return random_number
>>
>> view = Client(profile="something-parallel-here").load_balanced_view()
>> results = view.map(segfaulty_function, [random() for _ in range(100)])
>> for i, result in enumerate(results):
>>     print i, result
>>
>> Backstory: Recently I’ve been working with a large monte carlo library
>> that segfaults for, like, no reason at all. It’s due to some weird
>> underlying random number issue and happens once every 5-10 thousand runs. I
>> currently have each worker spin out a child process to isolate the
>> occasional segfault, but this seems excessive. (I'm also trying to fix the
>> source of the segfaults, but debugging is a slow process.)
>>
>> Thanks,
>> Pat
>>
>> _______________________________________________
>> IPython-dev mailing list
>> IPython-dev@scipy.org
>> http://mail.sci <http://mail.scipy.org/mailman/listinfo/ipython-dev>
>>
>>
>>     _______________________________________________
> IPython-dev mailing list
> IPython-dev@scipy.org
> http://mail.scipy.org/mailman/listinfo/ipython-dev
>
>
> _______________________________________________
> IPython-dev mailing list
> IPython-dev@scipy.org
> http://mail.scipy.org/mailman/listinfo/ipython-dev
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.scipy.org/pipermail/ipython-dev/attachments/20140203/fad282e3/attachment-0001.html 


More information about the IPython-dev mailing list