[Numpy-discussion] Histograms of extremely large data sets
cameron.walsh at gmail.com
Thu Dec 14 01:56:09 CST 2006
Absolutely gorgeous, I confirm the 1.6x speed-up over the weave
version, i.e. a 25x speed-up over the existing version.
It would be good if the redefinition of the range function could be
changed in the numpy modules, before it goes into subversion, to
avoid the need for Rick's line
before the new histogram function. At some point I might try and test
different cache sizes for different data-set sizes and see what the
effect is. For now, 65536 seems a good number and I would be happy to
see this replace the current numpy.histogram.
Thanks very much Eric and Rick, you've both taught me a lot, as well
as solving the original problem. I'm sure this will be of use to
others in the future, so if there's anything I can do to assist in
getting this into the next numpy release, please let me know.
On 14/12/06, eric jones <eric at enthought.com> wrote:
> Looks to me like Rick's version is simpler and faster.It looks like it
> offers a speed-up of about 1.6 on my machine over the weave version. I
> believe this is because the sorting approach results in quite a few less
> compares than the algorithm I used.
> Very cool. I vote that his version go into numpy.
> Rick White wrote:
> > On Dec 12, 2006, at 10:27 PM, Cameron Walsh wrote:
> >> I'm trying to generate histograms of extremely large datasets. I've
> >> tried a few methods, listed below, all with their own shortcomings.
> >> Mailing-list archive and google searches have not revealed any
> >> solutions.
> > The numpy.histogram function can be modified to use memory much more
> > efficiently when the input array is large, and the modification turns
> > out to be faster even for smallish arrays (in my tests, anyway).
> > Below is a modified version of the histogram function from
> > function_base.py. It is almost identical, but it avoids doing the
> > sort of the entire input array simply by dividing it into blocks.
> > (It would be even better to avoid the call to ravel too.) The only
> > other messy detail is that the builtin range function is shadowed by
> > the 'range' parameter.
> > In my timing tests this is about the same speed for arrays about the
> > same size as the block size and is faster than the current version by
> > 30-40% for large arrays. The speed difference increases as the array
> > size increases.
> > I haven't compared this to Eric's weave function, but this has the
> > advantages of being pure Python and of being much simpler. On my
> > machine (MacBook Pro) it takes about 4 seconds for an array with 100
> > million elements. The time increases perfectly linearly with array
> > size for arrays larger than a million elements.
> > Rick
> > from numpy import *
> > lrange = range
> > def histogram(a, bins=10, range=None, normed=False):
> > a = asarray(a).ravel()
> > if not iterable(bins):
> > if range is None:
> > range = (a.min(), a.max())
> > mn, mx = [mi+0.0 for mi in range]
> > if mn == mx:
> > mn -= 0.5
> > mx += 0.5
> > bins = linspace(mn, mx, bins, endpoint=False)
> > # best block size probably depends on processor cache size
> > block = 65536
> > n = sort(a[:block]).searchsorted(bins)
> > for i in lrange(block,len(a),block):
> > n += sort(a[i:i+block]).searchsorted(bins)
> > n = concatenate([n, [len(a)]])
> > n = n[1:]-n[:-1]
> > if normed:
> > db = bins - bins
> > return 1.0/(a.size*db) * n, bins
> > else:
> > return n, bins
> > _______________________________________________
> > Numpy-discussion mailing list
> > Numpy-discussion at scipy.org
> > http://projects.scipy.org/mailman/listinfo/numpy-discussion
> Numpy-discussion mailing list
> Numpy-discussion at scipy.org
More information about the Numpy-discussion