[SciPy-dev] Re: [SciPy-user] Low-level code integration discussion at scipy'03?

Fernando Perez scipy-user@scipy.net
Wed, 03 Sep 2003 10:42:28 -0600

Prabhu Ramachandran wrote:

>  1. I guess that weave code will usually be used to optimize the
>     slowest loops.  Typically eliminating 'for/while' loops.  The
>     trivial (and ancient) Laplacian examples on the SciPy pages
>     demonstrate one such case.  My guess is most other uses would be
>     similar.  In such cases merely having slices working is not good
>     enough.  You might also need to do something non-trivial to the
>     array.  At that point I'm not sure how much the niceties of blitz
>     would be used (efficiently)?  Could you provide a more
>     illustrative example of what blitz does that could not be done by
>     a simple C++ loop or something?

It's not a matter of being impossible in loops, just that the blitz syntax is 
incredibly compact.  Consider computing Numeric's innerproduct() operation for 
a rank-2 array M and a rank-6 one T, storing the result in U.  In blitz, the 
code reads:

// 6d version
void mat_ten_inner(Array<double,2>& M,
		Array<double,6>& T, Array<double,6>& U ) {
	firstIndex i1;
	secondIndex i2;
	thirdIndex i3;
	fourthIndex i4;
	fifthIndex i5;
	sixthIndex i6;
	seventhIndex j;

	U = sum(M(i1,j)*T(i2,i3,i4,i5,i6,j),j);

That's it.  ONE line of actual code!  A version of this done with hand loops 
requires obviously 7 nested for loops.  Since I needed similar code for 
dimensions 1..6, I wrote preprocessor macros to generate all versions from  a 
'template' (meaning, another macro, not a C++ template<>).  This is, of 
course, ugly.

Having said that, the above one-liner performs about 3 times slower than the 
loop macro version for d=6.  For d=1 they are about even, the blitz one-liner 
gets worse as d increases.  I posted to the blitz list asking about this, but 
no response so far.

So, in the end, I'm writing my own loops too :)  This is just to point that 
I'm ok with writing loops too, even though I'd love blitz to handle things 
like the above better.

What blitz DOES buy you is A(i,j,k,l...) type indexing of high-rank arrays, 
which is not trivial with the C Numeric API.  And that is important.  Writing 
loops isn't bad, but all the manual pointer arithmetic currently necessary 
with the existing APIs is rather annoying.

It's a matter of balancing abstraction with performance.  Blitz provides a lot 
of abstraction, but performance isn't very good.  The subset of blitz which 
gives just indexing (you do the rest yourself) seems to maintain pretty good 
performance in many cases (though I need to do more precise benchmarks against 

>  3. If plain old inlined C code does indeed perform better than the
>     blitz equivalents, it sure would be useful to have the cleaner API
>     anyway.  Afterall performance is a key issue here.  Having a
>     cleaner API (without fancy expression templates but without having
>     to convert between C++ and Python data) would really be useful in
>     any case.

This, we agree 100% on.  In fact, if the C API offered a way to index high 
rank arrays without manually computing stride offsets on every access, that 
would probably satisfy a significant fraction of usage cases.  I consider 
one-liners like the above icing on the cake, but A(i,j,k) indexing is really a 
significant change which makes development far more productive.

> Well, I'm sure I would not mind writing this kind of loop (taken from
> here http://www.scipy.org/site_content/weave/python_performance.html):
>         double tmp, err, diff;
>         err = 0.0;
>         for (int i=1; i<nx-1; ++i) {
>             for (int j=1; j<ny-1; ++j) {
>                 tmp = u(i,j);
>                 u(i,j) = ((u(i-1,j) + u(i+1,j))*dy2 +
>                           (u(i,j-1) + u(i,j+1))*dx2)*dnr_inv;
>                 diff = u(i,j) - tmp;
>                 err += diff*diff;
>             }
>         }

I don't either.  As I said, the main consideration for me is clean indexing of 
rank2+ arrays.  Everything else I take as extra credit, and I can live 
without.  For example, blitz has 'stencils' to do the above, and in 3d cases 
they do simplify writing the code a lot.  But I haven't tested their 
performance, and after seeing my innerproduct() example above, I'd be at least 
a bit skeptical.  So loops are fine with me.