# [Numpy-discussion] Vectorizing code, for loops, and all that

Travis Oliphant oliphant at ee.byu.edu
Mon Oct 2 18:03:50 CDT 2006

```Albert Strasheim wrote:

>In [571]: x1 = N.random.randn(2000,39)
>
>In [572]: y1 = N.random.randn(64,39)
>
>In [574]: %timeit z1=x1[...,N.newaxis,...]-y1 10 loops, best of 3: 703 ms
>per loop
>
>In [575]: z1.shape
>Out[575]: (2000, 64, 39)
>
>As far as I can figure, this operation is doing 2000*64*39 subtractions.
>Doing this straight up yields the following:
>
>In [576]: x2 = N.random.randn(2000,64,39)
>
>In [577]: y2 = N.random.randn(2000,64,39)
>
>In [578]: %timeit z2 = x2-y2
>10 loops, best of 3: 108 ms per loop
>
>Does anybody have any ideas on why this is so much faster? Hopefully I
>didn't mess up somewhere...
>
>

I suspect I know why, although the difference seems rather large.  There
is code optimization that is being taken advantage of in the second
case.  If you have contiguous arrays (no broadcasting needed), then 1
C-loop is used for the subtraction (your second case).

In the first case you are using broadcasting to generate the larger
array.  This requires more complicated looping constructs under the
loops of 2000 elements each in the first example with a bit of
calculation over-head to reset the pointers before each loop.

In the ufunc code, compare the ONE_UFUNCLOOP case with the
NOBUFFER_UFUNCLOOP case.  If you want to be sure what is running
un-comment the fprintf statements so you can tell.

I'm surprised the overhead of adjusting pointers is so high, but then
again you are probably getting a lot of cache misses in the first case
so there is more to it than that, the loops may run more slowly too.

-Travis

```