[Numpy-discussion] Histograms of extremely large data sets
rlw at stsci.edu
Thu Dec 14 07:30:05 CST 2006
On Dec 14, 2006, at 2:56 AM, Cameron Walsh wrote:
> At some point I might try and test
> different cache sizes for different data-set sizes and see what the
> effect is. For now, 65536 seems a good number and I would be happy to
> see this replace the current numpy.histogram.
I experimented a little on my machine and found that 64k was a good
size, but it is fairly insensitive to the size over a wide range
(16000 to 1e6). I'd be interested to hear how this scales on other
machines -- I'm pretty sure that the ideal size will keep the piece
of the array being sorted smaller than the on-chip cache.
Just so we don't get too smug about the speed, if I do this in IDL on
the same machine it is 10 times faster (0.28 seconds instead of 4
seconds). I'm sure the IDL version uses the much faster approach of
just sweeping through the array once, incrementing counts in the
appropriate bins. It only handles equal-sized bins, so it is not as
general as the numpy version -- but equal-sized bins is a very common
case. I'd still like to see a C version of histogram (which I guess
would need to be a ufunc) go into the core numpy.
More information about the Numpy-discussion