[Numpy-discussion] ANN: Numexpr 1.1, an efficient array evaluator
Francesc Alted
faltet@pytables....
Wed Jan 21 05:41:34 CST 2009
A Tuesday 20 January 2009, Andrew Collette escrigué:
> Works much, much better with the current svn version. :) Numexpr now
> outperforms everything except the "simple" technique, and then only
> for small data sets.
Correct. This is because of the cost of parsing the expression and
initializing the virtual machine. However, as soon as the sizes of the
operands exceeds the cache of your processor you are starting to see
the improvement in performance.
> Along the lines you mentioned I noticed that simply changing from a
> shape of (100*100*100,) to (100, 100, 100) results in nearly a factor
> of 2 worse performance, a factor which seems constant when changing
> the size of the data set.
Sorry, but I cannot reproduce this. When using the expression:
"63 + (a*b) + (c**2) + b"
I get on my machine (Core2@3 GHz, running openSUSE Linux 11.1):
1000000 f8 (average of 10 runs)
Simple: 0.0278068065643
Numexpr: 0.00839750766754
Chunked: 0.0266514062881
(100, 100, 100) f8 (average of 10 runs)
Simple: 0.0277318000793
Numexpr: 0.00851640701294
Chunked: 0.0346593856812
and these are the expected results (i.e. no change in performance due to
multidimensional arrays). Even for larger arrays, I don't see nothing
unexpected:
10000000 f8 (average of 10 runs)
Simple: 0.334054994583
Numexpr: 0.110022115707
Chunked: 0.29678030014
(100, 100, 100, 10) f8 (average of 10 runs)
Simple: 0.339299607277
Numexpr: 0.111632704735
Chunked: 0.375299096107
Can you tell us which platforms are you using?
> Is this related to the way numexpr handles
> broadcasting rules? It would seem the memory contents should be
> identical for these two cases.
>
> Andrew
>
> On Tue, Jan 20, 2009 at 6:13 AM, Francesc Alted <faltet@pytables.org>
wrote:
> > A Tuesday 20 January 2009, Andrew Collette escrigué:
> >> Hi Francesc,
> >>
> >> Looks like a cool project! However, I'm not able to achieve the
> >> advertised speed-ups. I wrote a simple script to try three
> >> approaches to this kind of problem:
> >>
> >> 1) Native Python code (i.e. will try to do everything at once
> >> using temp arrays) 2) Straightforward numexpr evaluation
> >> 3) Simple "chunked" evaluation using array.flat views. (This
> >> solves the memory problem and allows the use of arbitrary Python
> >> expressions).
> >>
> >> I've attached the script; here's the output for the expression
> >> "63 + (a*b) + (c**2) + sin(b)"
> >> along with a few combinations of shapes/dtypes. As expected,
> >> using anything other than "f8" (double) results in a performance
> >> penalty. Surprisingly, it seems that using chunks via array.flat
> >> results in similar performance for f8, and even better performance
> >> for other dtypes.
> >
> > [clip]
> >
> > Well, there were two issues there. The first one is that when
> > transcendental functions are used (like sin() above), the
> > bottleneck is on the CPU instead of memory bandwidth, so numexpr
> > speedups are not so high as usual. The other issue was an actual
> > bug in the numexpr code that forced a copy of all multidimensional
> > arrays (I normally only use undimensional arrays for doing
> > benchmarks). This has been fixed in trunk (r39).
> >
> > So, with the fix on, the timings are:
> >
> > (100, 100, 100) f4 (average of 10 runs)
> > Simple: 0.0426136016846
> > Numexpr: 0.11350851059
> > Chunked: 0.0635252952576
> > (100, 100, 100) f8 (average of 10 runs)
> > Simple: 0.119254398346
> > Numexpr: 0.10092959404
> > Chunked: 0.128384995461
> >
> > The speed-up is now a mere 20% (for f8), but at least it is not
> > slower. With the patches that recently contributed Georg for using
> > Intel's VML, the acceleration is a bit better:
> >
> > (100, 100, 100) f4 (average of 10 runs)
> > Simple: 0.0417867898941
> > Numexpr: 0.0944641113281
> > Chunked: 0.0636183023453
> > (100, 100, 100) f8 (average of 10 runs)
> > Simple: 0.120059680939
> > Numexpr: 0.0832288980484
> > Chunked: 0.128114104271
> >
> > i.e. the speed-up is around 45% (for f8).
> >
> > Moreover, if I get rid of the sin() function and use the expresion:
> >
> > "63 + (a*b) + (c**2) + b"
> >
> > I get:
> >
> > (100, 100, 100) f4 (average of 10 runs)
> > Simple: 0.0119329929352
> > Numexpr: 0.0198570966721
> > Chunked: 0.0338240146637
> > (100, 100, 100) f8 (average of 10 runs)
> > Simple: 0.0255623102188
> > Numexpr: 0.00832500457764
> > Chunked: 0.0340095996857
> >
> > which has a 3.1x speedup (for f8).
> >
> >> FYI, the current tar file (1.1-1) has a glitch related to the
> >> VERSION file; I added to the bug report at google code.
> >
> > Thanks. Will focus on that asap. Mmm, seems like there is stuff
> > enough for another release of numexpr. I'll try to do it soon.
> >
> > Cheers,
> >
> > --
> > Francesc Alted
> > _______________________________________________
> > Numpy-discussion mailing list
> > Numpy-discussion@scipy.org
> > http://projects.scipy.org/mailman/listinfo/numpy-discussion
>
> _______________________________________________
> Numpy-discussion mailing list
> Numpy-discussion@scipy.org
> http://projects.scipy.org/mailman/listinfo/numpy-discussion
--
Francesc Alted
More information about the Numpy-discussion
mailing list