[Numpy-discussion] Assigning complex value to real array
Andrew P. Mullhaupt
Thu Oct 7 17:47:03 CDT 2010
On 10/7/2010 3:48 PM, Anne Archibald wrote:
> Years ago MATLAB did just this - store real and complex parts of
> arrays separately (maybe it still does, I haven't used it in a long
> time). It caused us terrible performance headaches,
Years ago performance headaches from Matlab are not exactly relevant
today, for several reasons. There have been other languages that made
that choice - APL for example. Your performance problems there could
have been caused (depending on machine) by things like TLB congestion,
other address translation issues (segmented memory). Most machines now
and in the future are not going to choke on these issues (for a variety
Note that even as early as 1989, it was possible to use the memory
mapping technique I described (in SunOS 4.0) and on that machine, the
ability to fill a register with all zeros in a single instruction means
that there would have been almost no penalty for the address translation.
What matters now, and for the foreseeable future, is pretty much the
memory bandwidth - memory bandwidth at some cache level - memory for
storage and FPU bandwidth is not limiting any more.
Let's be clear about apples to apples comparison. Let's consider a very
large array which sometimes has nonzero imaginary part as a result of
calculation (and therefore not predictable when the code is written).
Strategy I is for the programmer to declare the array as complex from
the beginning - thereby doubling the required memory bandwidth (at every
level involved) for the entire computation. You double the cache
footprint at every level, etc. You will pretty much eat the full factor
of 2 memory bandwidth penalty.
Strategy II is to have the imaginary part of the array as a sparse
(initially null) file. Well, if the array never has a nonzero imaginary
part, then the memory bandwidth is the same at all levels except perhaps
at the level of the register file. This could conceivably change if the
granularity of the file system changes to have pages much larger than
4kB, but unless someone decides to have a page size larger than the L1
cache on their processor, it's really just the register file where you
care. Well suppose you end up really caring. You actually can test, once
per page, if the imaginary page is the zero page, and call the real code
instead - giving you essentially full bandwidth on the comptation down
to the granularity of a page. Since a page is 512 doubles, the overhead
of this test is less than one branch instruction per 512 flops - but
with modern CPU architecture that branch will get predicted, so normally
there will be zero overhead for this test. In other words, if you know
what you're doing, you should achieve full bandwidth.
Of course, you probably have the question of locality of reference in
mind. Here there are at least two answers to any possible objection
there - the first is that modern address translation is good enough to
interleave two streams anyway - you can actually check this right now on
your machine. Time how long it takes to add two very long vectors (say
of length 2N) to produce a third. Then compare this to the time it takes
to add two vectors of length N to produce a third, and two other vectors
of length N to produce a fourth. Unless those times are significantly
different as N becomes very large, you have confirmed that your machine
can translate addresses and shovel segregated real and complex parts
through the FPU about as fast as it can shovel them the corresponding
interleaved real and complex parts through the FPU. Well OK suppose you
worry about N so big that having real and complex parts in different
parts of the file system is an issue - you could always interleave the
PAGES of the real and imaginary parts to eliminate address differences
of more than one page length from consideration. That's answer number two.
> since it meant
> that individual complex values were spread over two different memory
> areas (parts of the disk in fact, since we were using gigabyte arrays
> in 1996), so that writing operations in the natural way tended to
> cause disk thrashing.
Yeah you didn't understand memory mapping, which was available back in
1989. We were using 4GB arrays by 1990. I got really tired of
evangelizing memory mapping to various communities (which I did a lot
back then, including the early Numerical Python community). So these
days I just say "I told you so." Kevin Sheehan told you so too (His
classic paper "Why aren't you using mmap() yet?" is from February 1996).
> As for what's right for numpy, well, I think it makes a lot more sense
> to simply raise an exception when assigning a complex value to a real
> array (or store a NaN).
That would at least allow a workaround. However I want quiet NaNs to
stay quiet, only signalling NaNs are allowed to get up in my grille.
More information about the NumPy-Discussion