[Numpy-discussion] Re : Alternative to record array

Jean-Baptiste Rudant boogaloojb@yahoo...
Fri Jan 2 12:47:11 CST 2009

Thank you for everything, it works fine ant it is very helpful.


Jean-Baptiste Rudant

De : Francesc Alted <faltet@pytables.org>
À : Discussion of Numerical Python <numpy-discussion@scipy.org>
Envoyé le : Mardi, 30 Décembre 2008, 16h34mn 27s
Objet : Re: [Numpy-discussion] Alternative to record array

A Tuesday 30 December 2008, Francesc Alted escrigué:
> A Monday 29 December 2008, Jean-Baptiste Rudant escrigué:
> The difference for both approaches is that the row-wise arrangement
> is more efficient when data is iterated by field, while the
> column-wise one is more efficient when data is iterated by column. 
> This is why you are seeing the increase of 4x in performance
> --incidentally, by looking at both data arrangements, I'd expect an
> increase of just 2x (the stride count is 2 in this case), but I
> suspect that there are hidden copies during the increment operation
> for the record array case.

As I was mystified about this difference in speed, I kept investigating 
and I think I have an answer for the difference in the expected 
speed-up in the unary increment operator over a recarray field.  After 
looking at the numpy code, it turns out that the next statement:

data.ages += 1

is more or less equivalent to:

a = data.ages
a[:] = a + 1

i.e. a temporary is created (for keeping the result of 'a + 1') and then 
assigned to the 'ages' column.  As it happens that, in this sort of 
operations, the memory copies are the bottleneck, the creation of the 
first temporary introduced a slowdown of 2x (due to the strided column) 
and the assignment represents the additional 2x (4x in total).  
However, the next idiom:

a = data.ages
a += 1

effectively removes the need for the temporary copy and is 2x faster 
than the original "data.ages += 1".  This can be seen in the next 
simple benchmark:

import numpy, timeit

count = 10e6
ages  = numpy.random.randint(0,100,count)
weights = numpy.random.randint(1,200,count)
data = numpy.rec.fromarrays((ages,weights),names='ages,weights')

timer = timeit.Timer('data.ages += 1','from __main__ import data')
print "v0-->", timer.timeit(number=10)
timer = timeit.Timer('a=data.ages; a[:] = a + 1','from __main__ import 
print "v1-->", timer.timeit(number=10)
timer = timeit.Timer('a=data.ages; a += 1','from __main__ import data')
print "v2-->", timer.timeit(number=10)
timer = timeit.Timer('ages += 1','from __main__ import ages')
print "v3-->", timer.timeit(number=10)

which produces the next output on my laptop:

v0--> 2.98340201378
v1--> 3.22748112679
v2--> 1.5474319458
v3--> 0.809724807739

As a final comment, I suppose that unary operators (+=, -=...) can be 
optimized in the context of recarray columns in numpy, but I don't 
think it is worth the effort:  when really high performance is needed 
for operating with columns in the context of recarrays, a column-wise 
approach is best.


Francesc Alted
Numpy-discussion mailing list

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://projects.scipy.org/pipermail/numpy-discussion/attachments/20090102/18b73341/attachment.html 

More information about the Numpy-discussion mailing list