[Numpy-discussion] Assigning complex value to real array

Andrew P. Mullhaupt doc@zen-pharaohs....
Thu Oct 7 17:47:03 CDT 2010


  On 10/7/2010 3:48 PM, Anne Archibald wrote:
>
> Years ago MATLAB did just this - store real and complex parts of
> arrays separately (maybe it still does, I haven't used it in a long
> time). It caused us terrible performance headaches,

Years ago performance headaches from Matlab are not exactly relevant 
today, for several reasons. There have been other languages that made 
that choice - APL for example. Your performance problems there could 
have been caused (depending on machine) by things like TLB congestion, 
other address translation issues (segmented memory). Most machines now 
and in the future are not going to choke on these issues (for a variety 
of reasons).

Note that even as early as 1989, it was possible to use the memory 
mapping technique I described (in SunOS 4.0) and on that machine, the 
ability to fill a register with all zeros in a single instruction means 
that there would have been almost no penalty for the address translation.

What matters now, and for the foreseeable future, is pretty much the 
memory bandwidth - memory bandwidth at some cache level - memory for 
storage and FPU bandwidth is not limiting any more.

Let's be clear about apples to apples comparison. Let's consider a very 
large array which sometimes has nonzero imaginary part as a result of 
calculation (and therefore not predictable when the code is written).

Strategy I is for the programmer to declare the array as complex from 
the beginning - thereby doubling the required memory bandwidth (at every 
level involved) for the entire computation. You double the cache 
footprint at every level, etc. You will pretty much eat the full factor 
of 2 memory bandwidth penalty.

Strategy II is to have the imaginary part of the array as a sparse 
(initially null) file. Well, if the array never has a nonzero imaginary 
part, then the memory bandwidth is the same at all levels except perhaps 
at the level of the register file. This could conceivably change if the 
granularity of the file system changes to have pages much larger than 
4kB, but unless someone decides to have a page size larger than the L1 
cache on their processor, it's really just the register file where you 
care. Well suppose you end up really caring. You actually can test, once 
per page, if the imaginary page is the zero page, and call the real code 
instead - giving you essentially full bandwidth on the comptation down 
to the granularity of a page. Since a page is 512 doubles, the overhead 
of this test is less than one branch instruction per 512 flops - but 
with modern CPU architecture that branch will get predicted, so normally 
there will be zero overhead for this test. In other words, if you know 
what you're doing, you should achieve full bandwidth.

Of course, you probably have the question of locality of reference in 
mind. Here there are at least two answers to any possible objection 
there - the first is that modern address translation is good enough to 
interleave two streams anyway - you can actually check this right now on 
your machine. Time how long it takes to add two very long vectors (say 
of length 2N) to produce a third. Then compare this to the time it takes 
to add two vectors of length N to produce a third, and two other vectors 
of length N to produce a fourth. Unless those times are significantly 
different as N becomes very large, you have confirmed that your machine 
can translate addresses and shovel segregated real and complex parts 
through the FPU about as fast as it can shovel them the corresponding 
interleaved real and complex parts through the FPU. Well OK suppose you 
worry about N so big that having real and complex parts in different 
parts of the file system is an issue - you could always interleave the 
PAGES of the real and imaginary parts to eliminate address differences 
of more than one page length from consideration. That's answer number two.

>   since it meant
> that individual complex values were spread over two different memory
> areas (parts of the disk in fact, since we were using gigabyte arrays
> in 1996), so that writing operations in the natural way tended to
> cause disk thrashing.

Yeah you didn't understand memory mapping, which was available back in 
1989. We were using 4GB arrays by 1990. I got really tired of 
evangelizing memory mapping to various communities (which I did a lot 
back then, including the early Numerical Python community). So these 
days I just say "I told you so." Kevin Sheehan told you so too (His 
classic paper "Why aren't you using mmap() yet?" is from February 1996).

> As for what's right for numpy, well, I think it makes a lot more sense
> to simply raise an exception when assigning a complex value to a real
> array (or store a NaN).
That would at least allow a workaround. However I want quiet NaNs to 
stay quiet, only signalling NaNs are allowed to get up in my grille.

Best Regards,
Andrew Mullhaupt



More information about the NumPy-Discussion mailing list