[SciPy-dev] cow: 'Connection reset by peer' timeout problem?

Simon Saubern simon.saubern at molsci.csiro.au
Wed Jan 29 19:34:11 CST 2003


I'm re-posting this message here as I didn't get any replies on the 
scipy-users list:

I've been using cow to try out some distributed calculations. 
Everything works fine if I use a subset of my data, but when I use 
the full set I get "error: (10054, 'Connection reset by peer')" 
messages on the master unit (see below for full output).

I can operate on larger and larger subsets until I get to the point 
where if the slaves take more than about 4 minutes to complete a 
task, the above error appears at the master.

That is, connections are established (confirmed using netstat), 
processing occurs on the slaves and keeps going, but the master times 
out after about 4min.

Is this a 'keep alive' problem? If so, how can I extend the time out period?

The setup:
10 x slave + master, all Win2K SP-3
Python 2.2.2
latest scipy binary for Win

cowname['data']=data # a list 35000 long.
lendata=range(7000) # just use a subset
bessy=None
while not bessy:
     bessy=cowname.loop_code('do something;do 
something;calc=function(data[x])',loop_var='x',inputs={'x':lendata},returns=['calc'])
     bessy gets processed here

'data' is quite large and takes a while to transfer over the network. 
But by doing it once and looping over the index, I minimize network 
movements. The 'python' process on each slave uses about 85MB.

Increasing 'lendata' eventually causes the 'Connection reset by peer' 
message to appear.

Any pointers welcomed.

---------------error output


   File "C:\PROGRA~1\Python22\Lib\site-packages\scipy\cow\cow.py", 
line 823, in loop_code
     return self.loop_send_recv(package,loop_data,loop_var)
   File "C:\PROGRA~1\Python22\Lib\site-packages\scipy\cow\cow.py", 
line 847, in loop_send_recv
     results = self._send_recv(package,addendums)
   File "C:\PROGRA~1\Python22\Lib\site-packages\scipy\cow\cow.py", 
line 345, in _send_recv
     self.last_results = self._recv()
   File "C:\PROGRA~1\Python22\Lib\site-packages\scipy\cow\cow.py", 
line 303, in _recv
     results.append(worker.recv())
   File 
"C:\PROGRA~1\Python22\Lib\site-packages\scipy\cow\sync_cluster.py", 
line 404, in recv
     package = self.channel.read()
   File 
"C:\PROGRA~1\Python22\Lib\site-packages\scipy\cow\sync_cluster.py", 
line 164, in read
     x = self.rfile.read()
   File "c:\Program Files\Python22\lib\socket.py", line 228, in read
     new = self._sock.recv(k)
error: (10054, 'Connection reset by peer')
>>>
------------
-- 

Cheers,

Simon



More information about the Scipy-dev mailing list