[Numpy-discussion] numpy.concatenate slower than slice copying

Francesc Alted faltet@pytables....
Thu Aug 19 19:35:37 CDT 2010


2010/8/18, Zbyszek Szmek <zbyszek@in.waw.pl>:
> thank you for your detailed answer. It seems that memcpy which should always
> be faster then memmove is sometimes slower! What happens is that
> using the slice assignment calls memmove() which calls
> _wordcopy_fwd_aligned() [1]
> which is apparently faster than memcpy() [2]
>
> [1]
> http://www.eglibc.org/cgi-bin/viewcvs.cgi/trunk/libc/string/wordcopy.c?rev=77&view=auto
> [2]
> http://www.eglibc.org/cgi-bin/viewcvs.cgi/trunk/libc/sysdeps/x86_64/memcpy.S?rev=11186&view=markup
>
> I guess that you're not seeing the difference because I'm using an
> amd64 specific memcpy written in assembly, and you're using a i586
> implementation.  I've tried to reproduce the problem in a C program,
> but there the memcpy is always much faster than memmove, as should be.
>
> I've verified that the difference between memcpy and memmove is the
> problem by patching array_concatenate to always use memmove:
> diff --git a/numpy/core/src/multiarray/multiarraymodule.c
> b/numpy/core/src/multiarray/multia
> index de63f33..e7f8643 100644
> --- a/numpy/core/src/multiarray/multiarraymodule.c
> +++ b/numpy/core/src/multiarray/multiarraymodule.c
> @@ -437,7 +437,7 @@ PyArray_Concatenate(PyObject *op, int axis)
>      data = ret->data;
>      for (i = 0; i < n; i++) {
>          numbytes = PyArray_NBYTES(mps[i]);
> -        memcpy(data, mps[i]->data, numbytes);
> +        memmove(data, mps[i]->data, numbytes);
>          data += numbytes;
>      }
>
> which gives a speedup the same as using the slice assignment:
> zbyszek@ameba ~/mdp/tmp % python2.6 del_cum3.py numpy 10000 1000 10 10
> problem size: (10000x1000) x 10 = 10^8
> 0.814s  <----- without the patch
>
> zbyszek@ameba ~/mdp/tmp %
> PYTHONPATH=/var/tmp/install/lib/python2.6/site-packages python2.6
> del_cum3.py numpy 10000 1000 10 10
> problem size: (10000x1000) x 10 = 10^8
> 0.637s  <----- with the stupid patch

Ok.  So it is pretty clear that the flaw is a bad performance of
memcpy on your platform.  If you can confirm this, then would be nice
if you can report that to the memcpy mantainer for the glibc project.

> Probably the architecture (and thus glibc implementation) is more
> important than the operating system. But the problem is very much
> dependent on the size of the arrays, so probably on aligment and other
> details

Yes.  But if memmove is faster than memcpy, then I'd say that
something is wrong with memcpy.  Another possibility is that the
malloc in `numpy.concatenate` is different than the malloc in
`numpy.empty`, and that they return memory blocks with different
alignments; that could explain the difference in performance too
(although this possibility is remote, IMO).

>> Now the new method (carray) with compression level 1 (note the new
>> parameter at the end of the command line):
>>
>> faltet@ubuntu:~/carray$ PYTHONPATH=. python bench/concat.py carray
>> 1000000 10 3 1
>> problem size: (1000000) x 10 = 10^7
>> time for concat: 0.186s
>> size of the final container: 5.076 MB
>
> This looks very interesting! Do you think it would be possible to
> automatically 'guess' if such compression makes sense and just use
> it behind the scenes as 'decompress-on-write'? I'll try to do some
> benchmarking tomorrow...

I'd say that, on relatively new processors (i.e. processors with
around 3 MB of cache and a couple of cores or more), carray would be
in general faster than a pure ndarray approach for most of cases.  But
indeed, benchmarking is the best way to tell.

Cheers,

-- 
Francesc Alted


More information about the NumPy-Discussion mailing list