[Numpy-discussion] A better median function?
Fri Aug 21 10:47:24 CDT 2009
I presented this during a lightning talk at the scipy conference
yesterday, so again, at the risk of painting myself as a flaming
Wanted: A Better/Faster median() Function
numpy implementation uses simple sorting algorithm:
Sort all the data using the .sort() method
Return middle value (or mean of two middle values)
One doesn’t have to sort all data – need only the middle value
Nicolas Devillard discusses several algorithms at
Implemented Devillard’s version of the Numerical Recipes select()
function using ctypes: 2 to 20 times faster on the large (> 10^6
points) arrays I tested
--- Caveat: I don’t have all the bells and whistles of the built-in
median function (multiple dimensions, non-contiguous, etc.)
Any of the numpy developers interested in pursuing this further?
I got a fairly loud "yes" from the back of the room which a few of us
guessed was Robert Kern. I take that as generic interest at least in
checking this out.
The background on this is that I am doing some glitch finding
algorithms where I call median frequently. I think my ultimate problem
is not in median(), but how I loop through the data, but that is a
different discussion. What I noticed as I was investigating was what I
noted in the slide above. Returning the middle of a sorted vector is
not a bad thing to do (admit it, we've all done it at some point), but
it does too much work. Things that are lower or higher than the median
don't need to be in a perfectly sorted order if all we are after is
the median value.
I did some googling and came up with the web page noted above. I used
his modified NumRec select() function as an excuse to learn ctypes,
and my initial weak attempts were successful. The speed ups depend
highly on the length of the data and the randomness - things that are
correlated or partially sorted already go quickly. My caveat is that
my select-based median is too simple; it must have 1-d contiguous data
of a predefined type. It also moves the data in place, affecting the
original variable. I have no idea how this will blow up if implemented
in a general purpose way.
Anyway, I'm not enough of a C-coder to have any hope of improving this
to the point where it can be included in numpy itself. However, if
someone is willing to take up the torch, I will volunteer to assist
with discussion, prototyping a few routines, and testing (I have lots
of real-world data). One could argue that the current median
implementation is good enough (and it probably is for 99% of all
usage), but I view this as a chance to add an industrial strength
routine to the numpy base.
Thanks for listening.
More information about the NumPy-Discussion