# [Numpy-discussion] On loop and broadcasting (again)

David Cournapeau david at ar.media.kyoto-u.ac.jp
Thu Oct 5 08:37:33 CDT 2006

Hi,

The email from Albert made me look again on some surprising results I
got a few months ago when starting my first "serious" numpy project. I
noticed that when computing multivariate gaussian densities, centering
the data was more expensive than everything else, including
exponentiation. Now that I have some experience with numpy, and
following the previous discussion, I tried the following script:

import numpy
from numpy.random import randn

import numpy as N

return X - mu

return N.transpose(X.T - mu.T)

def center_repmat(X, mu):
n   = X.shape[0]
return X - N.repmat(mu, n, 1)

def center_outer(X, mu):
n   = X.shape[0]
return X - N.outer(N.ones(n), mu)

def center_dot(X, mu):
n   = X.shape[0]
return X - N.dot(N.ones((n, 1)), mu[N.newaxis, :])

def center_manual(X, mu):
return X - mu

def bench():
n       = 1e5
d       = 30
niter   = 10

X   = randn(n, d)
mu  = randn(d)
va  = randn(d) ** 2

mur = N.repmat(mu, n, 1)

for i in range(niter):
y2  = center_repmat(X, mu)
y3  = center_dot(X, mu)
y4  = center_outer(X, mu)
y5  = center_manual(X, mur)

if __name__ == '__main__':
import hotshot, hotshot.stats
profile_file    = 'storage.prof'
prof    = hotshot.Profile(profile_file, lineevents=1)
prof.runcall(bench)
print p.sort_stats('cumulative').print_stats(20)
prof.close()

If X is an array containing n samples of d dimension,  and mu an array
with d elements, I want to compute x - mu, for each row x of X. The idea
is to compare timing of "manual broadcasting" using repmat vs
- center_broadcast expect mu to be a d elements array, and uses the
- center_broadcast2 expect same args, but tranpose to force a C-like
storage.
mu a (n, d) array with n times the same row
- center_outer and center_dot implements manually the repmat using
matrix product (under matlab, this was much faster on large matrices)
- center_manual expects mu to be already in a (n, d) form with n times
the same row: this is to see the cost of the substraction itself.

The results are the following:

10    4.204    0.420    4.204    0.420
10    0.475    0.048    3.449    0.345
storage_email.py:18(center_outer)
10    2.964    0.296    2.964    0.296
/usr/lib/python2.4/site-packages/numpy/core/numeric.py:227(outer)
10    2.767    0.277    2.768    0.277
10    0.485    0.049    1.288    0.129
storage_email.py:14(center_repmat)
10    1.217    0.122    1.227    0.123
storage_email.py:22(center_dot)
11    0.883    0.080    0.883    0.080
/usr/lib/python2.4/site-packages/numpy/lib/shape_base.py:522(repmat)
10    0.457    0.046    0.457    0.046
storage_email.py:26(center_manual)
5    0.137    0.027    0.137    0.027
/usr/lib/python2.4/site-packages/numpy/core/fromnumeric.py:375(sum)
20    0.020    0.001    0.020    0.001
/usr/lib/python2.4/site-packages/numpy/core/numeric.py:527(ones)
31    0.000    0.000    0.000    0.000
/usr/lib/python2.4/site-packages/numpy/core/numeric.py:125(asarray)
10    0.000    0.000    0.000    0.000
/usr/lib/python2.4/site-packages/numpy/core/fromnumeric.py:116(transpose)

As you can see, there is a quite a difference between implementations:
using repmat is the fastest, with repmat taking twice as much time as
the substraction itself. Replacing repmat with dot does not give any
advantage (I guess the underlying implementation uses the same core C
functions ?). Is it expected than automatic broadcast is so much
expensive ? And why using tranpose gives a 40% speed improvement ? What
makes automatic broadcast so expensive compared to repmat ?

For what it worths, the substraction itself is as fast as in matlab,
whereas the repmat is twice slower (this quick script is not meant as a
comparison against matlab of course, but I just wanted to check how
matlab behaves in those situations),

David