[Numpy-discussion] Histograms of extremely large data sets

eric jones eric at enthought.com
Wed Dec 13 01:42:09 CST 2006


Hey Cameron,

I wrote a simple weave based histogram function that should work for 
your problem.  It should work for any array input data type.  The needed 
files (and a few tests and examples) are attached.

Below is the output from the histogram_speed.py file attached.  The test 
takes about 10 seconds to bin a uniformly distributed set of data from a 
1000x1000x100 uint8 array into 256 bins.  It compares Travis' nifty new 
iterator based indexing in numpy to raw C indexing of a contiguous 
array.  The two algorithms give identical results, and the speed 
difference is negligible.  That's cool because the iterator based stuff 
makes this sort of algorithms quite easy to handle for N-dimensional.

Hope that helps,
eric

ps.  For those who care, I had to make a minor change to the array type 
converters so that they can be used with the iterator interface more 
easily.  Later this will be folded into weave, but for now I sub-classed 
the standard array converter and made the modifications.

# speed test output.
c:\eric\code\histogram> histogram_speed.py
type: uint8
millions of elements: 100.0
sec (C indexing based): 9.52776707654
[390141 390352 390598 389706 390985 390856 389785 390262 389929 391024
 391854 390243 391255 390723 390525 391751 389842 391612 389601 391210
 390799 391674 390693 390381 390460 389839 390185 390909 390215 391271
 390934 390818 390528 389990 389982 389667 391035 390317 390616 390916
 390191 389771 391448 390325 390556 391333 390148 390894 389611 390511
 390614 390999 389646 391255 391284 391214 392106 391067 391480 389991
 391091 390271 389801 390044 391459 390644 391309 390450 390200 391537
 390907 390160 391117 390738 391638 391200 390815 390611 390355 389925
 390939 390932 391569 390287 389987 389545 391140 391280 389773 389794
 389559 390085 389991 391372 390189 391010 390863 390432 390743 390959
 389271 390210 390967 390999 391177 389777 391748 390623 391597 392009
 389308 390557 390213 390930 390449 390327 390600 390626 389985 390816
 389671 390187 390595 390973 390921 390599 390167 391196 390381 391345
 392166 389709 390656 389886 390646 390355 391273 391342 390234 390751
 390515 390048 390455 391122 391069 390968 390488 390708 391027 391179
 391110 390453 390632 390825 391369 390844 390001 391487 390778 390788
 390609 390254 389907 391803 391508 391414 391012 389987 389284 390699
 391094 390658 390463 390291 390848 389616 390894 389561 390971 391165
 391378 391698 389434 390591 390027 391088 390787 391165 390169 391212
 389799 389829 389764 390435 391158 391834 391206 390041 391537 390237
 390253 391025 392336 391081 390005 391057 390226 390240 390197 389906
 391164 391157 390639 391501 389125 389922 390961 390012 389832 389650
 390018 390461 390695 390140 390939 389089 391094 390076 391123 389518
 391340 390039 390786 391751 391133 390675 392305 390667 391243 389889
 390103 390438 389215 389805 392180 391351 389923 390932 390136 390556
 389684 390324 390152 390982 391355]
sec (numpy iteration based): 10.3055525213
[390141 390352 390598 389706 390985 390856 389785 390262 389929 391024
 391854 390243 391255 390723 390525 391751 389842 391612 389601 391210
 390799 391674 390693 390381 390460 389839 390185 390909 390215 391271
 390934 390818 390528 389990 389982 389667 391035 390317 390616 390916
 390191 389771 391448 390325 390556 391333 390148 390894 389611 390511
 390614 390999 389646 391255 391284 391214 392106 391067 391480 389991
 391091 390271 389801 390044 391459 390644 391309 390450 390200 391537
 390907 390160 391117 390738 391638 391200 390815 390611 390355 389925
 390939 390932 391569 390287 389987 389545 391140 391280 389773 389794
 389559 390085 389991 391372 390189 391010 390863 390432 390743 390959
 389271 390210 390967 390999 391177 389777 391748 390623 391597 392009
 389308 390557 390213 390930 390449 390327 390600 390626 389985 390816
 389671 390187 390595 390973 390921 390599 390167 391196 390381 391345
 392166 389709 390656 389886 390646 390355 391273 391342 390234 390751
 390515 390048 390455 391122 391069 390968 390488 390708 391027 391179
 391110 390453 390632 390825 391369 390844 390001 391487 390778 390788
 390609 390254 389907 391803 391508 391414 391012 389987 389284 390699
 391094 390658 390463 390291 390848 389616 390894 389561 390971 391165
 391378 391698 389434 390591 390027 391088 390787 391165 390169 391212
 389799 389829 389764 390435 391158 391834 391206 390041 391537 390237
 390253 391025 392336 391081 390005 391057 390226 390240 390197 389906
 391164 391157 390639 391501 389125 389922 390961 390012 389832 389650
 390018 390461 390695 390140 390939 389089 391094 390076 391123 389518
 391340 390039 390786 391751 391133 390675 392305 390667 391243 389889
 390103 390438 389215 389805 392180 391351 389923 390932 390136 390556
 389684 390324 390152 390982 391355]
0


Cameron Walsh wrote:
> Hi all,
>
> I'm trying to generate histograms of extremely large datasets.  I've
> tried a few methods, listed below, all with their own shortcomings.
> Mailing-list archive and google searches have not revealed any
> solutions.
>
> Method 1:
>
> import numpy
> import matplotlib
>
> data=numpy.empty((489,1000,1000),dtype="uint8")
> # Replace this line with actual data samples, but the size and types
> are correct.
>
> histogram = pylab.hist(data, bins=range(0,256))
> pylab.xlim(0,256)
> pylab.show()
>
> The problem with this method is it appears to never finish.  It is
> however, extremely fast for smaller data sets, like 5x1000x1000 (1-2
> seconds) instead of 500x1000x1000.
>
>
> Method 2:
>
> import numpy
> import matplotlib
>
> data=numpy.empty((489,1000,1000),dtype="uint8")
> # Replace this line with actual data samples, but the size and types
> are correct.
>
> bins=numpy.zeros((256),dtype="uint32")
>    for val in data.flat:
>        bins[val]+=1
> barchart = pylab.bar(xrange(256),bins,align="center")
> pylab.xlim(0,256)
> pylab.show()
>
> The problem with this method is it is incredibly slow, taking up to 30
> seconds for a 1x1000x1000 sample, I have neither the patience nor the
> inclination to time a 500x1000x1000 sample.
>
>
> Method 3:
>
> import numpy
>
> data=numpy.empty((489,1000,1000),dtype="uint8")
> # Replace this line with actual data samples, but the size and types
> are correct.
>
> a=numpy.histogram(data,256)
>
>
> The problem with this one is:
>
> Traceback (most recent call last):
>  File "<stdin>", line 1, in <module>
>  File "/usr/local/lib/python2.5/site-packages/numpy/lib/function_base.py",
> line 96, in histogram
>    n = sort(a).searchsorted(bins)
> ValueError: dimensions too large.
>
>
> It seems that iterating over the entire array and doing it manually is
> the slowest possible method, but that the rest are not much better.
> Is there a faster method available, or do I have to implement method 2
> in C and submit the change as a patch?
>
> Thanks and best regards,
>
> Cameron.
> _______________________________________________
> Numpy-discussion mailing list
> Numpy-discussion at scipy.org
> http://projects.scipy.org/mailman/listinfo/numpy-discussion
>
>   

-------------- next part --------------
A non-text attachment was scrubbed...
Name: weave_histogram.py
Type: text/x-python
Size: 2533 bytes
Desc: not available
Url : http://projects.scipy.org/pipermail/numpy-discussion/attachments/20061213/89e225e0/attachment-0005.py 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: histogram_speed.py
Type: text/x-python
Size: 702 bytes
Desc: not available
Url : http://projects.scipy.org/pipermail/numpy-discussion/attachments/20061213/89e225e0/attachment-0006.py 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: test_weave_histogram.py
Type: text/x-python
Size: 2170 bytes
Desc: not available
Url : http://projects.scipy.org/pipermail/numpy-discussion/attachments/20061213/89e225e0/attachment-0007.py 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: typed_array_converter.py
Type: text/x-python
Size: 1582 bytes
Desc: not available
Url : http://projects.scipy.org/pipermail/numpy-discussion/attachments/20061213/89e225e0/attachment-0008.py 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: weave_contiguous_histogram.py
Type: text/x-python
Size: 2388 bytes
Desc: not available
Url : http://projects.scipy.org/pipermail/numpy-discussion/attachments/20061213/89e225e0/attachment-0009.py 


More information about the Numpy-discussion mailing list