[Numpy-discussion] Numpy speed ups to simple tasks - final findings and suggestions

Nathaniel Smith njs@pobox....
Fri Jan 4 18:44:28 CST 2013

On Fri, Jan 4, 2013 at 11:36 PM, Raul Cota <raul@virtualmaterials.com> wrote:
> On 04/01/2013 2:33 PM, Nathaniel Smith wrote:
>> On Fri, Jan 4, 2013 at 6:50 AM, Raul Cota <raul@virtualmaterials.com> wrote:
>>> On 02/01/2013 7:56 AM, Nathaniel Smith wrote:
>>>> But, it's almost certainly possible to optimize numpy's float64 (and
>>>> friends), so that they are themselves (almost) as fast as the native
>>>> python objects. And that would help all the code that uses them, not
>>>> just the ones where regular python floats could be substituted
>>>> instead. Have you tried profiling, say, float64 * float64 to figure
>>>> out where the bottlenecks are?
>>> Seems to be split between
>>> - (primarily) the memory allocation/deallocation of the float64 that is
>>> created from the operation float64 * float64. This is the reason why float64
>>> * Pyfloat got improved with one of my changes because PyFloat was being
>>> internally converted into a float64 before doing the multiplication.
>>> - the rest of the time is the actual multiplication path way.
>> Running a quick profile on Linux x86-64 of
>>    x = np.float64(5.5)
>>    for i in xrange(n):
>>       x * x
>> I find that ~50% of the total CPU time is inside feclearexcept(), the
>> function which resets the floating point error checking registers --
>> and most of this is inside a single instruction, stmxcsr ("store sse
>> control register").
> I find strange you don't see bottleneck in allocation of a float64.
> is it easy for you to profile this ?
> x = np.float64(5.5)
> y = 5.5
> for i in xrange(n):
>      x * y
> numpy internally translates y into a float64 temporarily and then
> discards it and I seem to remember is a bit over two times slower than x * x

Yeah, seems to be dramatically slower. Using ipython's handy interface
to the timeit[1] library:

In [1]: x = np.float64(5.5)

In [2]: y = 5.5

In [3]: timeit x * y
1000000 loops, best of 3: 725 ns per loop

In [4]: timeit x * x
1000000 loops, best of 3: 283 ns per loop

But we already figured out how to (mostly) fix this part, right? I was
curious about the Float64*Float64 case, because that's the one that
was still slow after those first two patches. (And, yes, like you say,
when I run x*y in the profiler then there's a huge amount of overhead
in PyArray_GetPriority and object allocation/deallocation).

> I will try to do your suggestions on
> PyUFunc_clearfperr/PyUFunc_getfperror
> and see what I get. Haven't gotten around to get going with being able
> to do a pull request for the previous stuff. if changes are worth while
> would it be ok if I also create one for this ?

First, to be clear, it's always OK to do a pull request -- the worst
that can happen is that we all look it over carefully and decide that
it's the wrong approach and don't merge. In my email before I just
wanted to give you some clear suggestions on a good way to get
started, we wouldn't have like kicked you out or something if you did
it differently :-)

And, yes, assuming my analysis so far is correct we would definitely
be interested in major speedups that have no other user-visible
effects... ;-)


[1] http://docs.python.org/2/library/timeit.html

More information about the NumPy-Discussion mailing list