[Numpy-discussion] optimizing power() for complex and real cases (was: indexing problem)

David M. Cooke cookedm at physics.mcmaster.ca
Tue Feb 14 15:13:04 CST 2006

[changed subject to reflect this better]

Tim Hochberg <tim.hochberg at cox.net> writes:
> David M. Cooke wrote:
>>Tim Hochberg <tim.hochberg at cox.net> writes:
>>>David M. Cooke wrote:
>>>  2, Don't distinguish between scalars and arrays -- that just makes
>>>things harder to explain.
>>Makes the optimizations better, though.
> Ah, Because you can hoist all the checks for what type of optimization
> to do, if any, out of the core loop, right? That's a good point. Still
> I'm not keen on a**b having different performance *and* different
> results depending on whether b is a scalar or matrix. The first thing
> to do is to measure how much overhead doing the optimization element
> by element is going to add. Assuming that it's signifigant that leaves
> us with the familiar dilema: fast, simple or general purpose; pick any
> two.
> 1. Do what I've proposed: optimize things at the c_pow level. This is
> general purpose and relatively simple to implement (since we can steal
> most of the code from complexobject.c). It may have a signifigant
> speed penalty versus 2 though:
> 2. Do what you've proposed: optimize things at the ufunc level. This
> fast and relatively simple to implement. It's more limited in scope
> and a bit harder to explain than 2.
> 3. Do both. This is straightforward, but adds a bunch of extra code
> paths with all the attendant required testing and possibility for
> bugs. So, fast, general purpose, but not simple.

Start with #1, then try #2. The problem with #2 is that you still have
to include #1: if you're doing x**y when y is an array, then you have
do if (y==2) etc. checks in your inner loop anyways. In that case, you
might as well do it in nc_pow. At that point, it may be better to move
the #1 optimization to the level of x.__pow__ (see below).

>>The speed comes from the inlining of how to do the power _outside_ the
>>inner loop. The reason x**2, etc. are slower currently is there is a
>>function call in the inner loop. Your's and mine C library's pow()
>>function mostly likely does something like I have above, for a single
>>case: pow(x, 2.0) is calculated as x*x. However, each time through it
>>has do decide _how_ to do it.
> Part of our difference in perspective comes from the fact that I've
> just been staring at the guts of complex power. In this case you
> always have function calls at present, even for s*s. (At least I'm
> fairly certain that doesn't get inlined although I haven't checked).
> Since much of the work I do is with complex matrices,  it's
> appropriate that I focus on this.

Ah, ok, now things are clicking. Complex power is going to be harder,
because making sure that going from x**2.001 to x**2 doesn't do some
funny complex branch cut stuff (I work in reals all the time :-)

For the real numbers, these type of optimizations *are* a big win, and
don't have the same type of continuity problems. I'll put them into numpy soon.

> Have you measured the effect of a function call on the speed here, or
> is that just an educated guess. If it's an educated guess, it's
> probably worth determining how much of speed hit the function call
> actually causes.  I was going to try to get a handle on this by
> comparing multiplication of Complex numbers (which requires a function
> call plus more math), with multiplication of Floats which does not.
> Perversly, the Complex multiplication came out marginally faster,
> which is hard to explain whichever way you look at it.
>>>> timeit.Timer("a*b", "from numpy import arange; a =
> arange(10000)+0j; b = arange(10000)+0j").time
> it(10000)
> 3.2974959107959876
>>>> timeit.Timer("a*b", "from numpy import arange; a = arange(10000);
>>>> b
> = arange(10000)").timeit(100
> 00)
> 3.4541194481425919

You're not multiplying floats in the last one: you're multiplying
integers. You either need to use a = arange(10000.0), or a =
arange(10000.0, dtype=float) (to be more specific).

Your integer numbers are about 3x better than mine, though (difference
in architecture, maybe? I'm on an Athlon64).

>>That's why I limited the optimization to scalar exponents: array
>>exponents would mean it's about as slow as the pow() call, even if the
>>checks were inlined into the loop. It would probably be even slower
>>for the non-optimized case, as you'd check for the special exponents,
>>then call pow() if it fails (which would likely recheck the exponents).
> Again, here I'm thinking of the complex case. In that case at least, I
> don't think that the non-optimized case would take a noticeable speed
> hit. I would put it into pow itself, which already special cases a==0
> and b==0. For float pow it might, but that's already slow, so I doubt
> that it would make much difference.

It does make a bit of difference with float pow: the general case
slows down a bit.

>>Maybe a simple way to add this is to rewrite x.__pow__() as something
>>like the C equivalent of
>>def __pow__(self, p):
>>    if p is not a scalar:
>>       return power(self, p)
>>    elif p == 1:
>>       return p
>>    elif p == 2:
>>       return square(self)
>>    elif p == 3:
>>       return cube(self)
>>    elif p == 4:
>>       return power_4(self)
>>    elif p == 0:
>>       return ones(self.shape, dtype=self.dtype)
>>    elif p == -1:
>>       return 1.0/self
>>    elif p == 0.5:
>>       return sqrt(self)
>>and add ufuncs square, cube, power_4 (etc.).
> It sounds like we need to benchmark some stuff and see what we come up
> with. One approach would be for each of us to implement this for one
> time (say float) and see how the approaches compare speed wise. That's
> not entirely fair as my approach will do much better at complex than
> float I believe, but it's certainly easier.

The way the ufuncs are templated, we can split out the complex
routines easily enough.

Here's what I propose:

- add a square() ufunc, where square(x) == x*x (but faster of course)
- I'll fiddle with the floats
- you fiddle with the complex numbers :-)

I've created a new branch in svn, at
to do this fiddling. The changes below I mention are all checked in as
revision 2104 (http://projects.scipy.org/scipy/numpy/changeset/2104).

I've added a square() ufunc to the power_optimization branch because
I'd argue that it's probably *the* most common use of **. I've
implemented it, and it's as fast as a*a for reals, and runs in 2/3 the
time as a*a for complex (which makes sense: squaring a complex numbers
has 3 real multiplications, while multiplying has 4 in the (simple)
scheme [1]).

At least with square(), there's no argument about continuity, as it
only squares :-).

The next step I'd suggest is special-casing x.__pow__, like I suggest
above. We could just test for integer scalar exponents (0, 1, 2), and
just special-case those (returning ones(), x.copy(), square(x)), and
leave all the rest to power().

I've also checked in code to the power_optimization branch that
special cases power(x, <scalar exponent>), or anytime the basic ufunc
gets called with a stride of 0 for the exponent. It doesn't do complex
x, so no problems on your side, but it's a good chunk faster for this
case than what we've got now. One reason I'm also looking at adding
square() is because my optimization of power() makes x**2 run (only)
1.5 slower than x*x (and I can't for the life of me see where that 0.5
is coming from! It should be 1.0 like square()!).

[1] which brings up another point. Would using the 3-multiplication
version for complex multiplication be good? There might be some
effects with cancellation errors due to the extra subtractions...

|David M. Cooke                      http://arbutus.physics.mcmaster.ca/dmc/
|cookedm at physics.mcmaster.ca

More information about the Numpy-discussion mailing list