[Numpy-discussion] Numpy speed ups to simple tasks - final findings and suggestions
Fri Jan 4 18:44:28 CST 2013
On Fri, Jan 4, 2013 at 11:36 PM, Raul Cota <firstname.lastname@example.org> wrote:
> On 04/01/2013 2:33 PM, Nathaniel Smith wrote:
>> On Fri, Jan 4, 2013 at 6:50 AM, Raul Cota <email@example.com> wrote:
>>> On 02/01/2013 7:56 AM, Nathaniel Smith wrote:
>>>> But, it's almost certainly possible to optimize numpy's float64 (and
>>>> friends), so that they are themselves (almost) as fast as the native
>>>> python objects. And that would help all the code that uses them, not
>>>> just the ones where regular python floats could be substituted
>>>> instead. Have you tried profiling, say, float64 * float64 to figure
>>>> out where the bottlenecks are?
>>> Seems to be split between
>>> - (primarily) the memory allocation/deallocation of the float64 that is
>>> created from the operation float64 * float64. This is the reason why float64
>>> * Pyfloat got improved with one of my changes because PyFloat was being
>>> internally converted into a float64 before doing the multiplication.
>>> - the rest of the time is the actual multiplication path way.
>> Running a quick profile on Linux x86-64 of
>> x = np.float64(5.5)
>> for i in xrange(n):
>> x * x
>> I find that ~50% of the total CPU time is inside feclearexcept(), the
>> function which resets the floating point error checking registers --
>> and most of this is inside a single instruction, stmxcsr ("store sse
>> control register").
> I find strange you don't see bottleneck in allocation of a float64.
> is it easy for you to profile this ?
> x = np.float64(5.5)
> y = 5.5
> for i in xrange(n):
> x * y
> numpy internally translates y into a float64 temporarily and then
> discards it and I seem to remember is a bit over two times slower than x * x
Yeah, seems to be dramatically slower. Using ipython's handy interface
to the timeit library:
In : x = np.float64(5.5)
In : y = 5.5
In : timeit x * y
1000000 loops, best of 3: 725 ns per loop
In : timeit x * x
1000000 loops, best of 3: 283 ns per loop
But we already figured out how to (mostly) fix this part, right? I was
curious about the Float64*Float64 case, because that's the one that
was still slow after those first two patches. (And, yes, like you say,
when I run x*y in the profiler then there's a huge amount of overhead
in PyArray_GetPriority and object allocation/deallocation).
> I will try to do your suggestions on
> and see what I get. Haven't gotten around to get going with being able
> to do a pull request for the previous stuff. if changes are worth while
> would it be ok if I also create one for this ?
First, to be clear, it's always OK to do a pull request -- the worst
that can happen is that we all look it over carefully and decide that
it's the wrong approach and don't merge. In my email before I just
wanted to give you some clear suggestions on a good way to get
started, we wouldn't have like kicked you out or something if you did
it differently :-)
And, yes, assuming my analysis so far is correct we would definitely
be interested in major speedups that have no other user-visible
More information about the NumPy-Discussion