[Numpy-discussion] numpy speed question

Francesc Alted faltet@pytables....
Fri Nov 26 12:03:03 CST 2010


A Thursday 25 November 2010 11:13:49 Jean-Luc Menut escrigué:
> Hello all,
> 
> I have a little question about the speed of numpy vs IDL 7.0. I did a
> very simple little check by computing just a cosine in a loop. I was
> quite surprised to see an order of magnitude of difference between
> numpy and IDL, I would have thought that for such a basic function,
> the speed would be approximatively the same.
> 
> I suppose that some of the difference may come from  the default data
> type of 64bits in numpy and 32 bits in IDL. Is there a way to change
> the numpy default data type (without recompiling) ?
> 
> And I'm not an expert at all, maybe there is a better explanation,
> like a better use of the several CPU core by IDL ?

As others have already point out, you should make sure that you use 
numpy.cos with arrays in order to get good performance.

I don't know whether IDL is using multi-cores or not, but if you are 
looking for ultimate performance, you can always use Numexpr that makes 
use of multicores.  For example, using a machine with 8 cores (w/ 
hyperthreading), we have:

>>> from math import pi
>>> import numpy as np
>>> import numexpr as ne
>>> i = np.arange(1e6)
>>> %timeit np.cos(2*pi*i/100.)
10 loops, best of 3: 85.2 ms per loop
>>> %timeit ne.evaluate("cos(2*pi*i/100.)")
100 loops, best of 3: 8.28 ms per loop

If you don't have a machine with a lot of cores, but still want to get 
good performance, you can still link Numexpr against Intel's VML (Vector 
Math Library).  For example, using Numexpr+VML with only one core (in 
another machine):

>>> %timeit np.cos(2*pi*i/100.)
10 loops, best of 3: 66.7 ms per loop
>>> ne.set_vml_num_threads(1)
>>> %timeit ne.evaluate("cos(2*pi*i/100.)")
100 loops, best of 3: 9.1 ms per loop

which also gives a pretty good speedup.  Curiously, Numexpr+VML is not 
that good at using multicores in this case:

>>> ne.set_vml_num_threads(2)
>>> %timeit ne.evaluate("cos(2*pi*i/100.)")
10 loops, best of 3: 14.7 ms per loop

I don't really know why Numexpr+VML is taking more time using 2 threads 
than only one, but it is probably due to Numexpr requiring better fine-
tuning in combination with VML :-/

-- 
Francesc Alted


More information about the NumPy-Discussion mailing list