[IPython-User] Advice re parallelizing many quick calculations
Junkshops
junkshops@gmail....
Thu Jun 28 16:59:20 CDT 2012
Hi all,
Total noob to the iPython notebook and parallelization here, but so far
it's incredibly cool. My only minor gripe at this point is that it's
difficult to transfer notebook contents to other media formats, such as
this email (unless I'm missing a feature somewhere or there's some
clever way I haven't discovered). A download as txt or rtf feature would
be handy.
Anyway, I've been reading the docs and playing with parallelization of a
simple routine, but so far I've only managed to make it much slower, so
I'm hoping someone might be able to give me some advice. I'm worried
that perhaps the calculations are so simple, no matter how I parallelize
it the costs of un/packing and transporting the objects might dwarf the
calcs.
I'm running ipcluster and the notebook server on a VM with gobs of
processors and memory and very few users, and connecting FF7 to the nb
server via an ssh tunnel. Startup commands are thus:
-bash-4.1$ ipcluster start -n 4 &
-bash-4.1$ ipython notebook --pylab inline --no-browser &
Here's what I've tried so far:
In [1]: from IPython.parallel import Client
import time
In [2]: c = Client()
lview = c.load_balanced_view()
dview = c[:]
print lview, dview
<LoadBalancedView None> <DirectView [0, 1, 2, 3]>
In [7]: data[-10:]
Out[7]:
[['40', '20', '0.82', '0.99', '0.4', '7', '2569'],
['40', '25', '0.82', '0.99', '0.4', '7', '2569'],
['40', '10', '0.82', '0.99', '0.4', '8', '2569'],
['40', '15', '0.82', '0.99', '0.4', '8', '2569'],
['40', '20', '0.82', '0.99', '0.4', '8', '2569'],
['40', '25', '0.82', '0.99', '0.4', '8', '2569'],
['40', '10', '0.82', '0.99', '0.4', '9', '2569'],
['40', '15', '0.82', '0.99', '0.4', '9', '2569'],
['40', '20', '0.82', '0.99', '0.4', '9', '2569'],
['40', '25', '0.82', '0.99', '0.4', '9', '2569']]
In [8]: len(data)
Out[8]: 1000000
In [9]: testdata = data[:100000]
For the purposes of testing reduce dataset size. Results on the smaller
set more or less scale when the full set is processed.
The following function could be improved for speed but for the point of
this test it doesn't matter.
In [10]:
def scaleArray(expt):
return [int(expt[0]) / 5 - 2,
int(expt[1]) / 5 - 2,
(float(expt[2]) - 0.82) * 50,
(float(expt[3]) - 0.91) * 100,
float(expt[4]) * 10,
int(expt[5]),
int(expt[6])
]
In [11]: t = time.clock()
map(scaleArray, testdata)
time.clock() - t
Out[11]: 1.2399999999999993
In [12]: t = time.clock()
dview.map(scaleArray, testdata)
time.clock() - t
Out[12]: 19.82
In [13]: t = time.clock()
dview.map_sync(scaleArray, testdata)
time.clock() - t
Out[13]: 19.989999999999998
In [14]: t = time.clock()
asr = dview.map_async(scaleArray, testdata)
asr.wait()
time.clock() - t
Out[14]: 21.949999999999996
In [92]: t = time.clock()
lview.map(scaleArray, data)
time.clock() - t
---------------------------------------------------------------------------
KeyboardInterrupt Traceback (most recent call last)
*snip*
The above command ran for ~9 hours with no results (vs ~10 min for a
direct view) on the full data set, so not bothering with the 10x smaller
set.
In [15]:
def scaleArray2(data):
scaled = []
for expt in data:
scaled.append([int(expt[0]) / 5 - 2,
int(expt[1]) / 5 - 2,
(float(expt[2]) - 0.82) * 50,
(float(expt[3]) - 0.91) * 100,
float(expt[4]) * 10,
int(expt[5]),
int(expt[6])
])
return scaled
def chunks(l, n):
return [l[i:i+n] for i in range(0, len(l), n)]
In [16]: chsize = len(testdata) / len(c.ids)
In [17]: t = time.clock()
map(scaleArray2, chunks(testdata,chsize))
time.clock() - t
Out[17]: 1.8100000000000023
In [18]: t = time.clock()
dview.map(scaleArray2, chunks(testdata,chsize))
time.clock() - t
Out[18]: 16.810000000000002
In [19]: t = time.clock()
dview.map_sync(scaleArray2, chunks(testdata,chsize))
time.clock() - t
Out[19]: 17.060000000000002
In [20]: t = time.clock()
asr = dview.map_async(scaleArray2, chunks(testdata,chsize))
asr.wait()
time.clock() - t
Out[20]: 19.099999999999994
In [21]: t = time.clock()
dc = chunks(testdata, chsize)
for i in range(0, len(dc)):
print "Submitting job to engine", i, "at", time.clock()
c[i].apply(scaleArray2, dc[i])
time.clock() - t
Submitting job to engine 0 at 123.88
Submitting job to engine 1 at 127.45
Submitting job to engine 2 at 130.98
Submitting job to engine 3 at 134.55
Out[21]: 15.319999999999993
In [25]: t = time.clock()
dc = chunks(testdata, chsize)
asrlist = []
for i in range(0, len(dc)):
print "Submitting job to engine", i, "at", time.clock()
asrlist.append(c[i].apply_async(scaleArray2, dc[i]))
for i in range(0, len(asrlist)):
asrlist[i].wait()
print "Job done on engine", i, "at", time.clock()
time.clock() - t
Submitting job to engine 0 at 222.45
Submitting job to engine 1 at 226.43
Submitting job to engine 2 at 230.41
Submitting job to engine 3 at 234.38
Job done on engine 0 at 249.41
Job done on engine 1 at 249.41
Job done on engine 2 at 249.41
Job done on engine 3 at 249.41
Out[25]: 26.969999999999999
Either I'm doing something very wrong (probably) or this type of problem
just isn't amenable to parallelization (which I don't *think* should be
the case, especially when the data set is just being broken into four
chunks and sent to four engines - but maybe the cost of pickling and
transporting all that data dwarfs the cost of running the calculations).
Anyway. If you haven't deleted this overly long email yet, I'd very much
appreciate any advice you can give me.
Cheers, Gavin
More information about the IPython-User
mailing list