[Numpy-discussion] Histograms of extremely large data sets
Cameron Walsh
cameron.walsh at gmail.com
Thu Dec 14 21:43:46 CST 2006
Using Eric's latest speed-testing, here's David's results:
cameron at cameron-laptop:~/code_snippets/histogram$ python histogram_speed.py
type: uint8
millions of elements: 100.0
sec (C indexing based): 8.44 100000000
sec (numpy iteration based): 8.91 100000000
sec (rick's pure python): 6.4 100000000
sec (nd evenly spaced): 2.1 100000000
sec (1d evenly spaced): 1.33 100000000
sec (david huard): 35.84 100000000
Summary:
case sec speed-up
weave_1d_arbitrary 8.440000 0.758294
weave_nd_arbitrary 8.910000 0.718294
ricks_arbitrary 6.400000 1.000000
weave_nd_even 2.100000 3.047619
weave_1d_even 1.330000 4.812030
david_huard 35.840000 0.178571
I also tried this on an equal-sized sample of my real-world data: 100
image slices, 8bits/sample, 1000x1000 pixels per image. The full data
set is 489 image slices, but I was unable to randomly generate 489
million data samples because I ran out of memory and started thrashing
the page file, ruining any results. So I've compared like with like
and got the following results with real-world data:
type: uint8
millions of elements: 100.0
sec (C indexing based): 6.1 100000000
sec (numpy iteration based): 7.07 100000000
sec (rick's pure python): 4.77 100000000
sec (nd evenly spaced): 2.12 100000000
sec (1d evenly spaced): 1.33 100000000
sec (david huard): 16.47 100000000
Summary:
case sec speed-up
weave_1d_arbitrary 6.100000 0.781967
weave_nd_arbitrary 7.070000 0.674682
ricks_arbitrary 4.770000 1.000000
weave_nd_even 2.120000 2.250000
weave_1d_even 1.330000 3.586466
david_huard 16.470000 0.289617
Note how much faster some of the algorithms run on the non-random,
real-world data. I assume this is due to variations in the scaling of
the quick-sort algorithm depending on the starting order of the data?
Scaling with the full data set was similar. Unfortunately, David's
code was not able to load the entire 489 image slices, throwing the
same error as that mentioned in the first email in this thread.
Later parts of the project I am working on will probably require
iteration over the entire data set, and iteration seems to be slowing
down several of these histogram algorithms, requiring the sort()
approach. I'll have a look at the iterator, and see if there's
anything that can be done there instead. I'm hoping that it will be
possible to use a C-based iterator for a numpy multiarray, as this
would allow many data processing algorithms to run faster, not just
the histogram.
Once again, thanks to everyone for all your input. This seems to have
generated more discussion and action than I anticipated, for which I
am very grateful.
Best regards,
Cameron.
More information about the Numpy-discussion
mailing list