[Numpy-discussion] RE: default axis for numarray
eric jones
eric at enthought.com
Tue Jun 11 15:01:02 CDT 2002
> From: cbarker at localhost.localdomain
[mailto:cbarker at localhost.localdomain]
>
> eric jones wrote:
> > The default axis choice influences how people choose to lay out
their
> > data in arrays. If the default is to sum down columns, then users
lay
> > out their data so that this is the order of computation.
>
> This is absolutely true. I definitely choose my data layout to that
the
> various rank reducing operators do what I want. Another reason to have
> consistency. So I don't really care which way is default, so the
default
> might as well be the better performing option.
>
> Of course, compatibility with previous versions is helpful
too...arrrgg!
>
> What kind of a performance difference are we talking here anyway?
Guess I ought to test instead of just saying it is so...
I ran the following test of summing 200 sets of 10000
numbers. I expected a speed-up of about 2... I didn't get it. They
are pretty much the same speed on my machine.?? (more later)
C:\WINDOWS\system32>python
ActivePython 2.2.1 Build 222 (ActiveState Corp.) based on
Python 2.2.1 (#34, Apr 15 2002, 09:51:39) [MSC 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> from Numeric import *
>>> import time
>>> a = ones((10000,200),Float) * arange(10000)[:,NewAxis]
>>> b = ones((200,10000),Float) * arange(10000)[NewAxis,:]
>>> t1 = time.clock();x=sum(a,axis=0);t2 = time.clock();print t2-t1
0.0772411018719
>>> t1 = time.clock();x=sum(b,axis=-1);t2 = time.clock();print t2-t1
0.079615705348
I also tried FFT, and did see a difference -- a speed up of 1.5+:
>>> q = ones((1024,1024),Float)
>>> t1 = time.clock();x = FFT.fft(q,axis=0);t2 = time.clock();print
t2-t1
0.907373143793
>>> t1 = time.clock();x= FFT.fft(q,axis=-1);t2 = time.clock();print
t2-t1
0.581641800843
>>> .907/.581
1.5611015490533564
Same in scipy
>>> from scipy import *
>>> a = ones((1024,1024),Float)
>>> import time
>>> t1 = time.clock(); q = fft(a,axis=0); t2 = time.clock();print t2-t1
0.870259488287
>>> t1 = time.clock(); q = fft(a,axis=-1); t2 = time.clock();print t2-t1
0.489512214541
>>> t1 = time.clock(); q = fft(a,axis=0); t2 = time.clock();print t2-t1
0.849266317367
>>> .849/.489
1.7361963190184049
So why is sum() the same speed for both cases? I don't know. I wrote
a quick C program that is similar to how Numeric loops work, and I saw
about a factor of 4 improvement by summing rows instead columns:
C:\home\eric\wrk\axis_speed>gcc -O2 axis.c
C:\home\eric\wrk\axis_speed>a
summing rows (sec): 0.040000
summing columns (sec): 0.160000
pass
These numbers are more like what I expected to see in the Numeric tests,
but they are strange when compared to the Numeric timings -- the row sum
is twice as fast as Numeric while the column sum is twice as slow.
Because all the work is done in C and we're summing reasonably long
arrays, the Numeric and C versions should be roughly the same speed.
I can understand why summing rows is twice as fast in my C routine --
the Numeric loop code is not going to win awards for being optimal.
What I don't understand is why the column summation is twice as slow in
my C code as in Numeric. This should not be. I've posted it below in
case someone can enlighten me.
I think in general, you should see a speed up of 1.5+ when the summing
over the "faster" axis. This holds true for fft in Python and my sum
in C. As to why I don't in Numeric's sum(), I'm not sure. It is
certainly true that non-strided access makes the best use of
cache and *usually* is faster.
eric
------------------------------------------------------------------------
--
#include <malloc.h>
#include <time.h>
int main()
{
double *a, *sum1, *sum2;
int i, j, si, sj, ind, I, J;
int small=200, big=10000;
time_t t1, t2;
I = small;
J = big;
si = big;
sj = 1;
a = (double*)malloc(I*J*sizeof(double));
sum1 = (double*)malloc(small*sizeof(double));
sum2 = (double*)malloc(small*sizeof(double));
//set memory
for(i = 0; i < I; i++)
{
sum1[i] = 0;
sum2[i] = 0;
ind = si * i;
for(j = 0; j < J; j++)
{
a[ind] = (double)j;
ind += sj;
}
ind += si;
}
t1 = clock();
for(i = 0; i < I; i++)
{
sum1[i] = 0;
ind = si * i;
for(j = 0; j < J; j++)
{
sum1[i] += a[ind];
ind += sj;
}
ind += si;
}
t2 = clock();
printf("summing rows (sec): %f\n", (t2-t1)/(float)CLOCKS_PER_SEC);
I = big;
J = small;
sj = big;
si = 1;
t1 = clock();
//set memory
for(i = 0; i < I; i++)
{
ind = si * i;
for(j = 0; j < J; j++)
{
a[ind] = (double)i;
ind += sj;
}
ind += si;
}
for(j = 0; j < J; j++)
{
sum2[j] = 0;
ind = sj * j;
for(i = 0; i < I; i++)
{
sum2[j] += a[ind];
ind += si;
}
}
t2 = clock();
printf("summing columns (sec): %f\n",
(t2-t1)/(float)CLOCKS_PER_SEC);
for (i=0; i < small; i++)
{
if(sum1[i] != sum2[i])
printf("failure %d, %f %f\n", i, sum1[i], sum2[i]);
}
printf("pass %f\n", sum1[0]);
return 0;
}
More information about the Numpy-discussion
mailing list