[Numpy-discussion] String sort

Francesc Altet faltet@carabos....
Mon Feb 11 03:58:59 CST 2008


A Monday 11 February 2008, Charles R Harris escrigué:
> On Feb 8, 2008 5:29 AM, Francesc Altet <faltet@carabos.com> wrote:
> > Hi,
> >
> > I'm a bit confused that the sort method of a string character
> > doesn't
> >
> > allow a mergesort:
> > >>> s = numpy.empty(10, "S10")
> > >>> s.sort(kind="merge")
> >
> > TypeError: desired sort not supported for this type
> >
> > However, by looking at the numpy sources, it seems that the only
> > implemented method for sorting array strings is "merge" (I presume
> > because it is stable).  So, perhaps the message above should be
> > fixed.
>
> The only available method is the default quicksort. The mergesort is
> for argsort and was put in for lexsort to use.

That's good to know.  However, I'm curious about where it is the 
specific quicksort implementation for strings/unicode.  I've had a look 
at _sortmodule.c.src, but I can only find a quicksort implementation 
for:

/**begin repeat
#TYPE=BOOL,BYTE,UBYTE,SHORT,USHORT,INT,UINT,LONG,ULONG,LONGLONG,ULONGLONG,FLOAT,DOUBLE,LONGDOUBLE,CFLOAT,CDOUBLE,CLONGDOUBLE#
**/

Where are the STRING/UNICODE versions?

> > Also, in the context of my work in indexing, and because of the
> > slowness of the current implementation in NumPy, I've ended with an
> > implementation of the quicksort method for 1-D array strings.  For
> > moderately large arrays, it is about 2.5x-3x faster than the
> > (supposedly) mergesort version in NumPy, not only due to the
> > quicksort, but also because I've implemented a couple of macros for
> > efficient string swapping and copy.  If this is of interest for
> > NumPy developers, tell me and I will provide the code.
>
> I've now got a string/ucs4 specific argsort(kind='q'), the string
> version of which is about 40% faster than the old default and about
> 10% faster than the mergesort version, but the string/ucs4 specific
> versions of sort aren't yet fairing as well. I'm timing things with
>
> In [1]: import timeit
>
> In [2]: t =
> timeit.Timer("np.fromstring(np.empty(10000).tostring(),dtype='|S8').s
>ort(kind='q')","import numpy as np")
>
> In [3]: t.repeat(3,100)
> Out[3]: [0.22127485275268555, 0.21282196044921875,
> 0.21273088455200195]
>
> That's with the current sort(kind='q') in svn, which uses the new
> string compare function but is otherwise the old default quicksort.
> The new string specific version of quicksort I'm testing is actually
> a bit slower than that. Both versions correctly sort the array. So
> I'm going to continue to experiment a bit until I see what is going
> on.

The version you are testing is your own or the one that I provided?  
Here are the timings for my laptop:

In [32]: a = np.random.rand(10000).astype('S8')

In [33]: %timeit a.copy().sort()   # original sort in NumPy
10 loops, best of 3: 16.8 ms per loop

In [34]: %timeit newqsort(a.copy())  # My own qsort implementation
100 loops, best of 3: 4.29 ms per loop

(I'm using a random string array here mainly because I use the sort in 
my system NumPy, with the old string compare.  However, as all the 
contents in strings are not NULL chars, the performance should be 
comparable, bar a few percent of improvement).

So, my newqsort still seems to run almost 4x faster than the one in 
NumPy (you know, using the old string compare).

However, when using a server with an Opteron processor, the view is much 
different:

In [55]: a = np.random.rand(10000).astype('S8')

In [56]: %timeit a.copy().sort()
100 loops, best of 3: 3.82 ms per loop

In [57]: %timeit newqsort(a.copy())
100 loops, best of 3: 3.29 ms per loop

Here, the difference in performance has been reduced to a mere 15% 
(still favouring newqsort).  So, with this, it seems like the 
performance of the original sorting in NumPy only suffers a lot when 
running in old processors (eg. Pentium 4), while the performance is 
reasonable with newer ones (Opteron).  On its hand, newqsort seems to 
perform reasonably well in both.  I don't know what exactly is the 
reason for this (I don't know where it is the code for the original 
quicksort for strings, so I can't do a visual comparison), but it would 
be great if we can discover it!

Cheers,

-- 
>0,0<   Francesc Altet     http://www.carabos.com/
V   V   Cárabos Coop. V.   Enjoy Data
 "-"


More information about the Numpy-discussion mailing list