[Numpy-discussion] Profiling numpy ? (parts written in C)
Francesc Altet
faltet at carabos.com
Wed Dec 20 06:08:34 CST 2006
A Dimecres 20 Desembre 2006 03:36, David Cournapeau escrigué:
> Francesc Altet wrote:
> > A Dimarts 19 Desembre 2006 08:12, David Cournapeau escrigué:
> >> Hi,
> >>
> >> Following the discussion on clip and other functions which *may* be
> >> slow in numpy, I would like to know if there is a way to easily profile
> >> numpy, ie functions which are written in C.
> >> For example, I am not sure to understand why a function like take(a,
> >> b) with a a double 256x4 array and b a 8000x256 int array takes almost
> >> 200 ms on a fairly fast CPU; in the source code, I can see that numpy
> >> uses memmove, and I know memmove to be slower than memcpy. Is there an
> >> easy way to check that this is coming from memmove (case in which
> >> nothing much can be done to improve the situation I guess), and not from
> >> something else ?
>
> Concerning the memmove vs memcpy: the problem I was speaking about is in
> another function (numpy take), where the problem is much bigger speed wise.
Ops. Yes, you are right. Out of curiosity, I've looked into this as
well, and created a small script to benchmark take(), which is listed
at the end of the message.
I've used a = double 256x4 and b = int 80x256 (my b is quite smaller
than yours, mainly because of the time that it takes to generate with
my current approach; by the way, if anybody knows an easy way to do a
cartesian product with numpy, please, tell me), but I don't think this
is going to influence the timings.
With it, here is the cProfile for take(a,b) (1000 iterations):
2862 function calls in 6.907 CPU seconds
ncalls tottime percall cumtime percall filename:lineno(function)
1000 6.679 0.007 6.679 0.007 {method 'take'
of 'numpy.ndarray' objects}
1 0.138 0.138 0.138 0.138 {numpy.core.multiarray.array}
1 0.059 0.059 6.907 6.907 prova.py:30(bench_take)
and here the output from oprofile for the same benchmark:
Profiling through timer interrupt
samples % image name symbol name
1360 83.0281 libc-2.3.6.so memmove
167 10.1954 multiarray.so PyArray_TakeFrom
11 0.6716 multiarray.so .plt
and, if we replace memmove by memcpy:
Profiling through timer interrupt
samples % image name symbol name
1307 82.0980 libc-2.3.6.so memcpy
178 11.1809 multiarray.so PyArray_TakeFrom
13 0.8166 multiarray.so .plt
So, again, it seems like we have the same pattern than with .clip()
method: the bottleneck is in calling memmove (the same would go for
memcpy) for every element in the array a that has to be taken.
A workaround for speeding this up (as suggested by Travis in his
excellent book) is to replace:
c = take(a, b)
by
c = a.flat[b]
Here are the cProfile results for the version using fancy indexing:
862 function calls in 3.455 CPU seconds
ncalls tottime percall cumtime percall filename:lineno(function)
1 3.300 3.300 3.455 3.455 prova.py:30(bench_take)
1 0.132 0.132 0.132 0.132 {numpy.core.multiarray.array}
257 0.018 0.000 0.018 0.000 {map}
which is 2x faster than the take approach.
The oprofile output for the fancy indexing approach:
samples % image name symbol name
426 53.2500 multiarray.so iter_subscript
277 34.6250 multiarray.so DOUBLE_copyswap
9 1.1250 python2.5 PyString_FromFormatV
seems to tell us that memmove/memcopy are not called at all, but
instead the DOUBLE_copyswap function. This is in fact an apparence,
because if we look at the code of DOUBLE_copyswap (found in
arraytypes.inc.src):
@fname at _copyswap (void *dst, void *src, int swap, void *arr)
{
if (src != NULL) /* copy first if needed */
memcpy(dst, src, sizeof(@type@));
[where the numpy code generator is replacing @fname@ by DOUBLE]
we see that memcpy is called under the hood (I don't know why oprofile
is not able to detect this call anymore).
After looking at the function, and remembering what Charles Harris
said in a previous message about the convenience to use a simple type
specific assignment, I've ended replacing the memcpy. Here it is the
patch:
--- numpy/core/src/arraytypes.inc.src (revision 3487)
+++ numpy/core/src/arraytypes.inc.src (working copy)
@@ -997,11 +997,11 @@
}
static void
- at fname@_copyswap (void *dst, void *src, int swap, void *arr)
+ at fname@_copyswap (@type@ *dst, @type@ *src, int swap, void *arr)
{
if (src != NULL) /* copy first if needed */
- memcpy(dst, src, sizeof(@type@));
+ *dst = *src;
if (swap) {
register char *a, *b, c;
and after this, timings seems to improve a bit. With CProfile:
862 function calls in 3.251 CPU seconds
Ordered by: internal time, call count
ncalls tottime percall cumtime percall filename:lineno(function)
1 3.092 3.092 3.251 3.251 prova.py:31(bench_take)
1 0.135 0.135 0.135 0.135 {numpy.core.multiarray.array}
257 0.018 0.000 0.018 0.000 {map}
which is around a 6% faster. With oprofile:
samples % image name symbol name
525 64.7349 multiarray.so iter_subscript
186 22.9346 multiarray.so DOUBLE_copyswap
8 0.9864 python2.5 PyString_FromFormatV
so, DOUBLE_copyswap seems around a 50% faster (186 samples vs 277) now
due to the use of the type specific assignment trick.
It seems to me that the above patch is safe, and besides, the complete
test suite in numpy passes (in fact, it runs around a 6% faster), so
perhaps it would be a nice thing to apply it. In this sense, it would
be good to do a overhauling of the NumPy code so as to discover other
places where this trick can be applied.
There is still a long way to get an optimal replacement for the take()
method (can iter_subscript be further optimised?), but this may help a
bit.
Cheers,
--------------------------------------------------------------------------
import numpy
niter = 1000
d0, d1 = (256, 4)
#d0, d1 = (6, 4)
def generate_data_2d(d0, d1):
return numpy.random.randn(d0, d1)
def cartesian(Lists):
import operator
if Lists:
result = map(lambda I: (I,), Lists[0])
for list in Lists[1:]:
curr = []
for item in list:
new = map(operator.add, result, [(item,)]*len(result))
curr[len(curr):] = new
result = curr
else:
result = []
return result
def generate_data_indices(d0, d1, N1, N2):
iaxis = numpy.random.randint(0, d0, N1)
jaxis = numpy.random.randint(0, d1, N2)
return numpy.array(cartesian([iaxis.tolist(), jaxis.tolist()]))
def bench_take():
a = generate_data_2d(d0, d1)
b = generate_data_indices(d0, d1, 80, 256)
#b = generate_data_indices(d0, d1, 8, 2)
for i in range(niter):
#c = numpy.take(a, b)
c = a.flat[b] # equivalent using fancy indexing
if __name__ == '__main__':
# test take
import pstats
import cProfile as prof
profile_wanted = False
if not profile_wanted:
bench_take()
else:
profile_file = 'take.prof'
prof.run('bench_take()', profile_file)
stats = pstats.Stats(profile_file)
stats.strip_dirs()
stats.sort_stats('time', 'calls')
stats.print_stats()
--------------------------------------------------------------------------
--
>0,0< Francesc Altet http://www.carabos.com/
V V Cárabos Coop. V. Enjoy Data
"-"
More information about the Numpy-discussion
mailing list