[Numpy-discussion] Benchmak on record arrays
Francesc Alted
faltet@pytables....
Thu May 28 12:25:42 CDT 2009
A Wednesday 27 May 2009 17:31:20 Nicolas Rougier escrigué:
> Hi,
>
> I've written a very simple benchmark on recarrays:
>
> import numpy, time
>
> Z = numpy.zeros((100,100), dtype=numpy.float64)
> Z_fast = numpy.zeros((100,100), dtype=[('x',numpy.float64),
> ('y',numpy.int32)])
> Z_slow = numpy.zeros((100,100), dtype=[('x',numpy.float64),
> ('y',numpy.bool)])
>
> t = time.clock()
> for i in range(10000): Z*Z
> print time.clock()-t
>
> t = time.clock()
> for i in range(10000): Z_fast['x']*Z_fast['x']
> print time.clock()-t
>
> t = time.clock()
> for i in range(10000): Z_slow['x']*Z_slow['x']
> print time.clock()-t
>
>
> And got the following results:
> 0.23
> 0.37
> 3.96
>
> Am I right in thinking that the last case is quite slow because of some
> memory misalignment between float64 and bool or is there some machinery
> behind that makes things slow in this case ? Should this be mentioned
> somewhere in the recarray documentation ?
Yes, I can reproduce your results, and I must admit that a 10x slowdown is a
lot. However, I think that this affects mostly to small record arrays (i.e.
those that fit in CPU cache), and mainly in benchmarks (precisely because they
fit well in cache). You can simulate a more real-life scenario by defining a
large recarray that do not fit in CPU's cache. For example:
In [17]: Z = np.zeros((1000,1000), dtype=np.float64) # 8 MB object
In [18]: Z_fast = np.zeros((1000,1000), dtype=[('x',np.float64),
('y',np.int64)]) # 16 MB object
In [19]: Z_slow = np.zeros((1000,1000), dtype=[('x',np.float64),
('y',np.bool)]) # 9 MB object
In [20]: x_fast = Z_fast['x']
In [21]: timeit x_fast * x_fast
100 loops, best of 3: 5.48 ms per loop
In [22]: x_slow = Z_slow['x']
In [23]: timeit x_slow * x_slow
100 loops, best of 3: 14.4 ms per loop
So, the slowdown is less than 3x, which is a more reasonable figure. If you
need optimal speed for operating with unaligned columns, you can use numexpr.
Here it is an example of what you can expect from it:
In [24]: import numexpr as nx
In [25]: timeit nx.evaluate('x_slow * x_slow')
100 loops, best of 3: 11.1 ms per loop
So, the slowdown is just 2x instead of 3x, which is near optimal for the
unaligned case.
Numexpr also seems to help for small recarrays that fits in cache (i.e. for
benchmarking purposes ;) :
# Create a 160 KB object
In [26]: Z_fast = np.zeros((100,100), dtype=[('x',np.float64),('y',np.int64)])
# Create a 110 KB object
In [27]: Z_slow = np.zeros((100,100), dtype=[('x',np.float64),('y',np.bool)])
In [28]: x_fast = Z_fast['x']
In [29]: timeit x_fast * x_fast
10000 loops, best of 3: 20.7 µs per loop
In [30]: x_slow = Z_slow['x']
In [31]: timeit x_slow * x_slow
10000 loops, best of 3: 149 µs per loop
In [32]: timeit nx.evaluate('x_slow * x_slow')
10000 loops, best of 3: 45.3 µs per loop
Hope that helps,
--
Francesc Alted
More information about the Numpy-discussion
mailing list