[Numpy-discussion] Unnecessarily bad performance of elementwise operators with Fortran-arrays
Fri Nov 9 00:30:45 CST 2007
Anne Archibald wrote:
> On 08/11/2007, David Cournapeau <firstname.lastname@example.org> wrote:
>> For copy and array creation, I understand this, but for element-wise
>> operations (mean, min, and max), this is not enough to explain the
>> difference, no ? For example, I can understand a 50 % or 100 % time
>> increase for simple operations (by simple, I mean one element operation
>> taking only a few CPU cycles), because of copies, but a 5 fold time
>> increase seems too big, no (mayb a cache problem, though) ? Also, the
>> fact that mean is slower than min/max for both cases (F vs C) seems a
>> bit counterintuitive (maybe cache effects are involved somehow ?).
> I have no doubt at all that cache effects are involved: for an int
> array, each data element is four bytes, but typical CPUs need to load
> 64 bytes at a time into cache. If you read (or write) the rest of
> those bytes in the next iterations through the loop, the (large) cost
> of a memory read is amortized. If you jump to the next row of the
> array, some large number of bytes away, those 64 bytes basically need
> to be purged to make room for another 64 bytes, of which you'll use 4.
> If you're reading from a FORTRAN-order array and writing to a C-order
> one, there's no way around doing this on one end or another: you're
> effectively doing a transpose, which is pretty much always expensive.
This is obviously part of the picture, and there is indeed no way around
the simple fact that on sequential memory access is extremely slow on
modern hardware. The fact that the array iterator does not support (yet)
the Fortran order would largely explain this (I wrongly assumed the
iterator already did that).
> Is there any reason not to let ufuncs pick whichever order for
> newly-allocated arrays they want? The natural choice would be the same
> as the bigger input array, if it's a contiguous block of memory
> (whether or not the contiguous flags are set). Failing that, the same
> as the other input array (if any); failing that, C order's as good a
> default as any. How difficult would this be to implement?
I think part of the problem is that it is difficult to know which is
faster a priori (is copying multi-segmented parts in a single buffer
faster for processing faster than direct processing ?). As mentioned
previously, on the same platform (same OS/compiler combination), there
are already huge differences between two CPU (pentium4 vs pentium M, the
former being noticeably pig-slow when SSE FPU is not used).
Does ufunc use buffer for segmented memory, or do they only use them for
misbehaved arrays ? Also, from my very limited experience, I found array
iterators to be significantly slower than simple array indexing on
contiguous, single segment arrays. Do ufunc always use array iterator,
or do they use simple array indexing when possible ? When I implemented
the fast clip function (which does not use ufunc), I noticed that the
- avoiding unnecessary copies.
- quickly determine which cases can be optimized wrt to the
operations you are using (memmove vs memcpy, etc...)
- fast memory copying (numpy do not have those yet: by using
optimized memory copying, you can often gain a 50 % speed increase on
recent archs; also, having information about alignment could help, too,
and we go back to aligned allocators discussion :) ).
The problem is to find a good API to put those optimizations at one
place. I unfortunately do not know much about ufunc, maybe they already
implement most of those.
More information about the Numpy-discussion