[Numpy-discussion] Efficient operator overloading

Sturla Molden sturla@molden...
Wed Apr 18 13:14:10 CDT 2007

On 4/18/2007 7:33 AM, Anne Archibald wrote:

 >copying. And the scope of improvement would be very limited; an
expression like A*B+C*D would be much more efficient, probably, if the
whole expression were evaluated at once for each element (due to
memory locality and temporary allocation). But it is impossible for
numpy, sitting inside python as it does, to do that.

Most numerical array/matrix libraries dependent on operator overloading 
generates temporaries. That is why Fortran is usually perceived as 
superior to C++ for scientific programming. The Fortran compiler knows 
about arrays and can avoid allocating three temporary arrays to evaluate 
and expression like

y = a * b + c * d

If this expression is evaluated bya Fortran 90/95 compiler, it will 
automatically generate code like

do i = 1,n
    y(i) = a(i) * b(i) + c(i) * d(i)

On the other hand, conventional use of overloaded operators would result 
in something like this:

do i = 1,n
    tmp1(i) = a(i) * b(i)
do i = 1,n
    tmp2(i) = c(i) * d(i)
do i = 1,n
    tmp3(i) = tmp1(i) + tmp2(i)
do i = 1,n
    y(i) = tmp3(i)

Traversing memory is one of the most expensive thing a CPU can do. This 
approach is therefore extremely inefficient compared with what a Fortran 
compiler can do.

This does not mean that all use of operator overloading is inherently 
bad. Notably, there is a C++ numerical library called Blitz++ which can 
avoid  these tempraries for small fixed-size arrays. As it depends on 
template metaprogramming, the size must be known at compile time. But if 
this is the case, it can be just as efficient as Fortran if the 
optimizer is smart enough to remove the redundant operations. Most 
modern C++ compilers is smart enough to do this. Note that it only works 
for fixed size arrays. Fortran compilers can do this on a more general 
basis. It is therefore advisable to have array syntax built into the 
language syntax it self, as in Fortran 90/95 and Ada 95.

But if we implement the operator overloading a bit more intelligent, it 
should be possible to get rid of most of the temporary arrays. We could 
replace temporary arrays with an "unevaluated expression class" and let 
  the library pretend it is a compiler.

Let us assume again we have an expression like

y = a * b + c * d

where a,b,c and d are all arrays or matrices. In this case, the 
overloaded * and + operators woud not return a temporary array but an 
unevaluated expression of class Expr. Thus we would get

tmp1 = Expr('__mul__',a,b) # symbolic representation of 'a * b'
tmp2 = Expr('__mul__',c,d) # symbolic representation of 'c * d'
tmp3 = Expr('__add__',tmp1,tmp1) # symbolic "a * b + c * d"
del tmp1
del tmp2
y = tmp3 # y becomes a reference to an unevaluated expression

Finally, we need a mechanism to 'flush' the unevaluated expression. 
Python does not allow the assignment operator to be evaluated, so one 
could not depend on that. But one could a 'flush on read/write' 
mechanism, and let an Expr object exist in different finite states (e.g. 
unevaluated and evaluated). If anyone tries to read an element from y or 
change any of the objects it involves, the expression gets evaluated 
without temporaries. Before that, there is no need to evalute the 
expression at all! We can just keep a symbolic representation of it. 
Procrastination is good!

Thus later on...

x = y[i] # The expression 'a * b + c * d' gets evaluated. The
          # object referred to by y now holds an actual array.


a[i] = 2.0 # The expression 'a * b + c * d' gets evaluated. The
            # object referred to by y now holds an actual array.
            # Finally, 2.0 is written to a[i].


y[i] = 2.0 # The expression 'a * b + c * d' gets evaluated. The
            # object referred to by y now holds an actual array.
            # Finally, 2.0 is written to y[i].

When the expression 'a * b + c * d' is finally evaluated, we should 
through symbolic manipulation get something (at least close to) a single 
efficient loop:

do i = 1,n
    y(i) = a(i) * b(i) + c(i) * d(i)

I'd really like to see a Python extension library do this one day. It 
would be very cool and (almost) as efficient as plain Fortran - though 
not quite, we would still get some small temporary objects created. But 
that is a sacrifice I am willing to pay to use Python. We would gain 
some efficacy over Fortran by postponing indefinitely evaluation of 
computations that are not needed, when this is not known at compile 

Any comments?

Sturla Molden

More information about the Numpy-discussion mailing list