[Numpy-discussion] Histograms of extremely large data sets

Cameron Walsh cameron.walsh at gmail.com
Wed Dec 13 02:27:22 CST 2006


On 13/12/06, eric jones <eric at enthought.com> wrote 290 lines of
awesome code and a fantastic explanation:

> Hey Cameron,
>
> I wrote a simple weave based histogram function that should work for
> your problem.  It should work for any array input data type.  The needed
> files (and a few tests and examples) are attached.

Thank you very much, they seem to be exactly what I need.  I haven't
yet been able to test it all completely, as for some reason I'm
missing the zlib module.  That might have to wait till tomorrow
depending on how the next half hour goes.

>
> Below is the output from the histogram_speed.py file attached.  The test
> takes about 10 seconds to bin a uniformly distributed set of data from a
> 1000x1000x100 uint8 array into 256 bins.  It compares Travis' nifty new
> iterator based indexing in numpy to raw C indexing of a contiguous
> array.  The two algorithms give identical results, and the speed
> difference is negligible.  That's cool because the iterator based stuff
> makes this sort of algorithms quite easy to handle for N-dimensional.

If that's the case, assuming our machines are the same, your new code
is around 5 times faster.  That brings it back to a reasonable time
frame.  I'll let you know as soon as I can how it all works.

>
> Hope that helps,
> eric

It certainly does!

Cameron.


>
> ps.  For those who care, I had to make a minor change to the array type
> converters so that they can be used with the iterator interface more
> easily.  Later this will be folded into weave, but for now I sub-classed
> the standard array converter and made the modifications.
>
> # speed test output.
> c:\eric\code\histogram> histogram_speed.py
> type: uint8
> millions of elements: 100.0
> sec (C indexing based): 9.52776707654
> [390141 390352 390598 389706 390985 390856 389785 390262 389929 391024
>  391854 390243 391255 390723 390525 391751 389842 391612 389601 391210
>  390799 391674 390693 390381 390460 389839 390185 390909 390215 391271
>  390934 390818 390528 389990 389982 389667 391035 390317 390616 390916
>  390191 389771 391448 390325 390556 391333 390148 390894 389611 390511
>  390614 390999 389646 391255 391284 391214 392106 391067 391480 389991
>  391091 390271 389801 390044 391459 390644 391309 390450 390200 391537
>  390907 390160 391117 390738 391638 391200 390815 390611 390355 389925
>  390939 390932 391569 390287 389987 389545 391140 391280 389773 389794
>  389559 390085 389991 391372 390189 391010 390863 390432 390743 390959
>  389271 390210 390967 390999 391177 389777 391748 390623 391597 392009
>  389308 390557 390213 390930 390449 390327 390600 390626 389985 390816
>  389671 390187 390595 390973 390921 390599 390167 391196 390381 391345
>  392166 389709 390656 389886 390646 390355 391273 391342 390234 390751
>  390515 390048 390455 391122 391069 390968 390488 390708 391027 391179
>  391110 390453 390632 390825 391369 390844 390001 391487 390778 390788
>  390609 390254 389907 391803 391508 391414 391012 389987 389284 390699
>  391094 390658 390463 390291 390848 389616 390894 389561 390971 391165
>  391378 391698 389434 390591 390027 391088 390787 391165 390169 391212
>  389799 389829 389764 390435 391158 391834 391206 390041 391537 390237
>  390253 391025 392336 391081 390005 391057 390226 390240 390197 389906
>  391164 391157 390639 391501 389125 389922 390961 390012 389832 389650
>  390018 390461 390695 390140 390939 389089 391094 390076 391123 389518
>  391340 390039 390786 391751 391133 390675 392305 390667 391243 389889
>  390103 390438 389215 389805 392180 391351 389923 390932 390136 390556
>  389684 390324 390152 390982 391355]
> sec (numpy iteration based): 10.3055525213
> [390141 390352 390598 389706 390985 390856 389785 390262 389929 391024
>  391854 390243 391255 390723 390525 391751 389842 391612 389601 391210
>  390799 391674 390693 390381 390460 389839 390185 390909 390215 391271
>  390934 390818 390528 389990 389982 389667 391035 390317 390616 390916
>  390191 389771 391448 390325 390556 391333 390148 390894 389611 390511
>  390614 390999 389646 391255 391284 391214 392106 391067 391480 389991
>  391091 390271 389801 390044 391459 390644 391309 390450 390200 391537
>  390907 390160 391117 390738 391638 391200 390815 390611 390355 389925
>  390939 390932 391569 390287 389987 389545 391140 391280 389773 389794
>  389559 390085 389991 391372 390189 391010 390863 390432 390743 390959
>  389271 390210 390967 390999 391177 389777 391748 390623 391597 392009
>  389308 390557 390213 390930 390449 390327 390600 390626 389985 390816
>  389671 390187 390595 390973 390921 390599 390167 391196 390381 391345
>  392166 389709 390656 389886 390646 390355 391273 391342 390234 390751
>  390515 390048 390455 391122 391069 390968 390488 390708 391027 391179
>  391110 390453 390632 390825 391369 390844 390001 391487 390778 390788
>  390609 390254 389907 391803 391508 391414 391012 389987 389284 390699
>  391094 390658 390463 390291 390848 389616 390894 389561 390971 391165
>  391378 391698 389434 390591 390027 391088 390787 391165 390169 391212
>  389799 389829 389764 390435 391158 391834 391206 390041 391537 390237
>  390253 391025 392336 391081 390005 391057 390226 390240 390197 389906
>  391164 391157 390639 391501 389125 389922 390961 390012 389832 389650
>  390018 390461 390695 390140 390939 389089 391094 390076 391123 389518
>  391340 390039 390786 391751 391133 390675 392305 390667 391243 389889
>  390103 390438 389215 389805 392180 391351 389923 390932 390136 390556
>  389684 390324 390152 390982 391355]
> 0
>
>
> Cameron Walsh wrote:
> > Hi all,
> >
> > I'm trying to generate histograms of extremely large datasets.  I've
> > tried a few methods, listed below, all with their own shortcomings.
> > Mailing-list archive and google searches have not revealed any
> > solutions.
> >
> > Method 1:
> >
> > import numpy
> > import matplotlib
> >
> > data=numpy.empty((489,1000,1000),dtype="uint8")
> > # Replace this line with actual data samples, but the size and types
> > are correct.
> >
> > histogram = pylab.hist(data, bins=range(0,256))
> > pylab.xlim(0,256)
> > pylab.show()
> >
> > The problem with this method is it appears to never finish.  It is
> > however, extremely fast for smaller data sets, like 5x1000x1000 (1-2
> > seconds) instead of 500x1000x1000.
> >
> >
> > Method 2:
> >
> > import numpy
> > import matplotlib
> >
> > data=numpy.empty((489,1000,1000),dtype="uint8")
> > # Replace this line with actual data samples, but the size and types
> > are correct.
> >
> > bins=numpy.zeros((256),dtype="uint32")
> >    for val in data.flat:
> >        bins[val]+=1
> > barchart = pylab.bar(xrange(256),bins,align="center")
> > pylab.xlim(0,256)
> > pylab.show()
> >
> > The problem with this method is it is incredibly slow, taking up to 30
> > seconds for a 1x1000x1000 sample, I have neither the patience nor the
> > inclination to time a 500x1000x1000 sample.
> >
> >
> > Method 3:
> >
> > import numpy
> >
> > data=numpy.empty((489,1000,1000),dtype="uint8")
> > # Replace this line with actual data samples, but the size and types
> > are correct.
> >
> > a=numpy.histogram(data,256)
> >
> >
> > The problem with this one is:
> >
> > Traceback (most recent call last):
> >  File "<stdin>", line 1, in <module>
> >  File "/usr/local/lib/python2.5/site-packages/numpy/lib/function_base.py",
> > line 96, in histogram
> >    n = sort(a).searchsorted(bins)
> > ValueError: dimensions too large.
> >
> >
> > It seems that iterating over the entire array and doing it manually is
> > the slowest possible method, but that the rest are not much better.
> > Is there a faster method available, or do I have to implement method 2
> > in C and submit the change as a patch?
> >
> > Thanks and best regards,
> >
> > Cameron.
> > _______________________________________________
> > Numpy-discussion mailing list
> > Numpy-discussion at scipy.org
> > http://projects.scipy.org/mailman/listinfo/numpy-discussion
> >
> >
>
>
>
> _______________________________________________
> Numpy-discussion mailing list
> Numpy-discussion at scipy.org
> http://projects.scipy.org/mailman/listinfo/numpy-discussion
>
>
>
>


More information about the Numpy-discussion mailing list