# [Numpy-discussion] Rookie problems - Why is C-code much faster?

Tim Hochberg tim.hochberg at cox.net
Tue Feb 21 11:12:04 CST 2006

```Mads Ipsen wrote:

>On Tue, 21 Feb 2006, Zachary Pincus wrote:
>
>
>
>>
>>The game with numpy, just as it is with Matlab or any other
>>interpreted numeric environment, is to try push as much of the
>>looping down into the C code as you can. This is because, as you now
>>know, compiled C can loop much faster than interpreted python.
>>
>>A simple example for averaging 1000 (x,y,z) points:
>>
>>print data.shape
>>(1000, 3)
>># bad: explicit for loop in python
>>avg = numpy.zeros(3, numpy.float_)
>>for i in data: avg += i
>>avg /= 1000.0
>>
>># good: implicit for loop in C
>>avg = numpy.add.reduce(data, axis = 0)
>>avg /= 1000.0
>>
>>not do the calculations in parallel, operating on entire vectors of
>>points at one time? Then the looping is "pushed down" into compiled C
>>code. Or if you're really lucky, it's pushed all the way down to the
>>vector math units on your cpu if you have a good BLAS or whatever
>>installed.
>>
>>Zach Pincus
>>
>>Program in Biomedical Informatics and Department of Biochemistry
>>Stanford University School of Medicine
>>
>>
>>
>>
>>
>>>2. Here is the main loop for finding all possible pair distances,
>>>   which corresponds to a loop over the upper triangular part of a
>>>   square matrix
>>>
>>>     # Loop over all particles
>>>     for i in range(n-1):
>>>        dx = x[i+1:] - x[i]
>>>        dy = y[i+1:] - y[i]
>>>
>>>        dx -= box*rint(dx/box)
>>>        dy -= box*rint(dy/box)
>>>
>>>        r2 = dx**2 + dy**2      # square of dist. between points
>>>
>>>where x and y contain the positions of the particles. A naive
>>>implementation in C is
>>>
>>>
>>>    // loop over all particles
>>>    for (int i=0; i<n-1; i++) {
>>>        for (int j=i+1; j<n; j++) {
>>>            dx = x[j] - x[i];
>>>            dy = y[j] - y[i];
>>>
>>>            dx -= box*rint(dx/box);
>>>            dy -= box*rint(dy/box);
>>>
>>>            r2 = dx*dx + dy*dy;
>>>        }
>>>    }
>>>
>>>For n = 2500 particles, i.e. 3123750 particle pairs, the C loop is
>>>app. 10 times faster than the Python/Numeric counterpart. This is of
>>>course not satisfactory.
>>>
>>>Are there any things I am doing completely wrong here, basic
>>>approaches completely misunderstood, misuses etc?
>>>
>>>Any suggestions, guidelines, hints are most welcome.
>>>
>>>Best regards,
>>>
>>>
>>>
>>>+---------------------------------+-------------------------+
>>>| Dept. of Chemistry              | phone:     +45-35320220 |
>>>| H.C.Ørsted Institute            | fax:       +45-35320322 |
>>>| Universitetsparken 5            |                         |
>>>| DK-2100 Copenhagen Ø, Denmark   | mpi at osc.kiku.dk         |
>>>+---------------------------------+-------------------------+
>>>
>>>
>>>-------------------------------------------------------
>>>This SF.net email is sponsored by: Splunk Inc. Do you grep through
>>>log files
>>>SPLUNK!
>>>http://sel.as-us.falkag.net/sel?cmd=lnk&kid3432&bid#0486&dat1642
>>>_______________________________________________
>>>Numpy-discussion mailing list
>>>Nump
>>>
>>>
>
>I agree completely with your comments. But, as you can, the innermost part
>of the loop has been removed in the code, and replaced with numpy slices.
>It's hard for me to see how to compress the outer loop as well, since it
>determines the ranges for the inner loop. Unless there is some fancy slice
>notation, that allows you to loop over a triangular part of a matrix, ie.
>
>    x[i] = sum(A[i+1,i])
>
>meaning x[i] = sum of elements in i'th row of A using only elements from
>position i+1 up to n.
>
>Of course, there is the possibility of hardcoding this in C and then make
>it available as a Python module. But I don't want to do this before I am
>sure there isn't a numpy way out this.
>
>Let me know, if you have any suggestions.
>

Can you explain a little more about what you are trying to calculate?
The bit about subtracting off  box*rint(dx/box) is a little odd. It
almost seems like you should be able to do something with fmod, but I
admit that I'm not sure how.

If I had to guess as to source of the relative slowness I'd say it's
because you are creating a lot of temporary matrices. There are ways to
avoid this, but when taken to the extreme, they make your code look
ugly. You might try the following, untested, code or some variation and
see if it speeds things up. This makes extensive use of the little known
optional destination argument for ufuncs. I only tend to do this sort of
stuff where it's very critical since, as you can see, it makes things
quite ugly.

dx_space = x.copy()
dy_space = y.copy()
scratch_space = x.copy()
for i in range(n-1):
dx = dx_space[i+1:]
dy = dy_space[i+1:]
scratch = scratch_space[i+1:]
subtract(x[i+1:], x[i], dx)
subtract(y[i+1:], y[i], dy)
#   dx -= box*rint(dx/box)
divide(dx, box, scratch)
rint(scratch, scratch)
scratch *= box
dx -= scratch
# dy -= box*rint(dy/box)
divide(dy, box, scratch)
rint(scratch, scratch)
scratch *= box
dy -= scratch
r2 = dx**2 + dy**2      # square of dist. between points

Hope that helps:

-tim

```