# [SciPy-dev] Re: [SciPy-user] Low-level code integration discussion at scipy'03?

Wed Sep 3 11:42:28 CDT 2003

```Prabhu Ramachandran wrote:

>  1. I guess that weave code will usually be used to optimize the
>     slowest loops.  Typically eliminating 'for/while' loops.  The
>     trivial (and ancient) Laplacian examples on the SciPy pages
>     demonstrate one such case.  My guess is most other uses would be
>     similar.  In such cases merely having slices working is not good
>     enough.  You might also need to do something non-trivial to the
>     array.  At that point I'm not sure how much the niceties of blitz
>     would be used (efficiently)?  Could you provide a more
>     illustrative example of what blitz does that could not be done by
>     a simple C++ loop or something?

It's not a matter of being impossible in loops, just that the blitz syntax is
incredibly compact.  Consider computing Numeric's innerproduct() operation for
a rank-2 array M and a rank-6 one T, storing the result in U.  In blitz, the

// 6d version
void mat_ten_inner(Array<double,2>& M,
Array<double,6>& T, Array<double,6>& U ) {
firstIndex i1;
secondIndex i2;
thirdIndex i3;
fourthIndex i4;
fifthIndex i5;
sixthIndex i6;
seventhIndex j;

U = sum(M(i1,j)*T(i2,i3,i4,i5,i6,j),j);
}

That's it.  ONE line of actual code!  A version of this done with hand loops
requires obviously 7 nested for loops.  Since I needed similar code for
dimensions 1..6, I wrote preprocessor macros to generate all versions from  a
'template' (meaning, another macro, not a C++ template<>).  This is, of
course, ugly.

Having said that, the above one-liner performs about 3 times slower than the
loop macro version for d=6.  For d=1 they are about even, the blitz one-liner
no response so far.

So, in the end, I'm writing my own loops too :)  This is just to point that
I'm ok with writing loops too, even though I'd love blitz to handle things
like the above better.

What blitz DOES buy you is A(i,j,k,l...) type indexing of high-rank arrays,
which is not trivial with the C Numeric API.  And that is important.  Writing
loops isn't bad, but all the manual pointer arithmetic currently necessary
with the existing APIs is rather annoying.

It's a matter of balancing abstraction with performance.  Blitz provides a lot
of abstraction, but performance isn't very good.  The subset of blitz which
gives just indexing (you do the rest yourself) seems to maintain pretty good
performance in many cases (though I need to do more precise benchmarks against
fortran).

>  3. If plain old inlined C code does indeed perform better than the
>     blitz equivalents, it sure would be useful to have the cleaner API
>     anyway.  Afterall performance is a key issue here.  Having a
>     cleaner API (without fancy expression templates but without having
>     to convert between C++ and Python data) would really be useful in
>     any case.

This, we agree 100% on.  In fact, if the C API offered a way to index high
rank arrays without manually computing stride offsets on every access, that
would probably satisfy a significant fraction of usage cases.  I consider
one-liners like the above icing on the cake, but A(i,j,k) indexing is really a
significant change which makes development far more productive.

> Well, I'm sure I would not mind writing this kind of loop (taken from
> here http://www.scipy.org/site_content/weave/python_performance.html):
>
>         double tmp, err, diff;
>         err = 0.0;
>         for (int i=1; i<nx-1; ++i) {
>             for (int j=1; j<ny-1; ++j) {
>                 tmp = u(i,j);
>                 u(i,j) = ((u(i-1,j) + u(i+1,j))*dy2 +
>                           (u(i,j-1) + u(i,j+1))*dx2)*dnr_inv;
>                 diff = u(i,j) - tmp;
>                 err += diff*diff;
>             }
>         }

I don't either.  As I said, the main consideration for me is clean indexing of
rank2+ arrays.  Everything else I take as extra credit, and I can live
without.  For example, blitz has 'stencils' to do the above, and in 3d cases
they do simplify writing the code a lot.  But I haven't tested their
performance, and after seeing my innerproduct() example above, I'd be at least
a bit skeptical.  So loops are fine with me.

Best,

f.

```