[Numpy-discussion] Histograms of extremely large data sets

Rick White rlw at stsci.edu
Thu Dec 14 07:30:05 CST 2006

On Dec 14, 2006, at 2:56 AM, Cameron Walsh wrote:

> At some point I might try and test
> different cache sizes for different data-set sizes and see what the
> effect is.  For now, 65536 seems a good number and I would be happy to
> see this replace the current numpy.histogram.

I experimented a little on my machine and found that 64k was a good  
size, but it is fairly insensitive to the size over a wide range  
(16000 to 1e6).  I'd be interested to hear how this scales on other  
machines -- I'm pretty sure that the ideal size will keep the piece  
of the array being sorted smaller than the on-chip cache.

Just so we don't get too smug about the speed, if I do this in IDL on  
the same machine it is 10 times faster (0.28 seconds instead of 4  
seconds).  I'm sure the IDL version uses the much faster approach of  
just sweeping through the array once, incrementing counts in the  
appropriate bins.  It only handles equal-sized bins, so it is not as  
general as the numpy version -- but equal-sized bins is a very common  
case.  I'd still like to see a C version of histogram (which I guess  
would need to be a ufunc) go into the core numpy.

More information about the Numpy-discussion mailing list