[Numpy-discussion] A better median function?

Tue Aug 25 01:24:22 CDT 2009

```I've made some progress on this, building the tools for a faster
median().  I was able to fairly easily make both nth_element() and
partial_sort() types of functions by modifying numpy's quicksort,
however I wasn't that happy with their API from a python/numpy point
of view.

My current plan of attack is to deliver a partition() function that
basically returns an array such that elements less than the pivot(s)
come first, then the pivot(s), then the elements greater than the
pivot(s).  The nifty trick being that the pivot can be a range of
indices, so it can be used to easily implement C++'s nth_element(),
partial_sort() and more.  This should allow for faster medians, as
well as satisfy those wanting the two values of an even-length array's
median elements, rather than the average between them.

It will probably work something like this:

\$ python
>>> import numpy as np
>>> a=np.array(range(9,-1,-1))
>>> a
array([9, 8, 7, 6, 5, 4, 3, 2, 1, 0])
>>> np.partition(a, 3, 7)  # partition around the indices 3-7
>>> array([1, 0, 2, 4, 6, 5, 3, 8, 9, 7])  # The partition is not
necessarily sorted
>>> a[3:7].sort()    # But can be sub-sorted after the fact
>>> a
array([1, 0, 2, 3, 4, 5, 6, 8, 9, 7])

This partition operation can usually be expected be to faster than a
full sort, depending on the length of the pivot range.

An nth_element() operation then becomes:

>>> np.partition(a, k)  # a single arg means pivot around that index
>>> a[k]

partial_sort() is as above:

>>> np.partition(a, start, end)  # using Python range notation
>>> a[start:end].sort()

odd length median is:

>>> n = a.size//2
>>> np.partition(a, n)
>>> a[n]

and the pair of elements for an even length median would be:

>>> start = a.size//2 - 1
>>> end = a.size//2 + 1
>>> np.partition(a, start, end)
>>> a[start:end].sort()
>>> l, r = a[start:end]

I'll work towards getting some code to look at up later in the week.
In the meantime, anyone interested, who has feedback on the above

Note - I already am not sure of this proposed API using [start, end)
type ranges to define the pivot range.  What then would be the result
of:

>>> np.partition(a, start, start)    # an "empty" range?  Probably
just a no-op...

Also, it might just make sense to go with partial_sort() directly (ie.
does the partitioning and also sorts the pivot range), since that
avoids a separate python call to sort() just compute the even-median.
Maybe I'll just guarantee for partitioning "short" ranges, that the
pivot will be sorted after partitioning.  It avoids an additional
Python sort call, and is trivially fast for insertion sort to do...

Thanks,