[Numpy-discussion] String sort
Francesc Altet
faltet@carabos....
Mon Feb 11 03:58:59 CST 2008
A Monday 11 February 2008, Charles R Harris escrigué:
> On Feb 8, 2008 5:29 AM, Francesc Altet <faltet@carabos.com> wrote:
> > Hi,
> >
> > I'm a bit confused that the sort method of a string character
> > doesn't
> >
> > allow a mergesort:
> > >>> s = numpy.empty(10, "S10")
> > >>> s.sort(kind="merge")
> >
> > TypeError: desired sort not supported for this type
> >
> > However, by looking at the numpy sources, it seems that the only
> > implemented method for sorting array strings is "merge" (I presume
> > because it is stable). So, perhaps the message above should be
> > fixed.
>
> The only available method is the default quicksort. The mergesort is
> for argsort and was put in for lexsort to use.
That's good to know. However, I'm curious about where it is the
specific quicksort implementation for strings/unicode. I've had a look
at _sortmodule.c.src, but I can only find a quicksort implementation
for:
/**begin repeat
#TYPE=BOOL,BYTE,UBYTE,SHORT,USHORT,INT,UINT,LONG,ULONG,LONGLONG,ULONGLONG,FLOAT,DOUBLE,LONGDOUBLE,CFLOAT,CDOUBLE,CLONGDOUBLE#
**/
Where are the STRING/UNICODE versions?
> > Also, in the context of my work in indexing, and because of the
> > slowness of the current implementation in NumPy, I've ended with an
> > implementation of the quicksort method for 1-D array strings. For
> > moderately large arrays, it is about 2.5x-3x faster than the
> > (supposedly) mergesort version in NumPy, not only due to the
> > quicksort, but also because I've implemented a couple of macros for
> > efficient string swapping and copy. If this is of interest for
> > NumPy developers, tell me and I will provide the code.
>
> I've now got a string/ucs4 specific argsort(kind='q'), the string
> version of which is about 40% faster than the old default and about
> 10% faster than the mergesort version, but the string/ucs4 specific
> versions of sort aren't yet fairing as well. I'm timing things with
>
> In [1]: import timeit
>
> In [2]: t =
> timeit.Timer("np.fromstring(np.empty(10000).tostring(),dtype='|S8').s
>ort(kind='q')","import numpy as np")
>
> In [3]: t.repeat(3,100)
> Out[3]: [0.22127485275268555, 0.21282196044921875,
> 0.21273088455200195]
>
> That's with the current sort(kind='q') in svn, which uses the new
> string compare function but is otherwise the old default quicksort.
> The new string specific version of quicksort I'm testing is actually
> a bit slower than that. Both versions correctly sort the array. So
> I'm going to continue to experiment a bit until I see what is going
> on.
The version you are testing is your own or the one that I provided?
Here are the timings for my laptop:
In [32]: a = np.random.rand(10000).astype('S8')
In [33]: %timeit a.copy().sort() # original sort in NumPy
10 loops, best of 3: 16.8 ms per loop
In [34]: %timeit newqsort(a.copy()) # My own qsort implementation
100 loops, best of 3: 4.29 ms per loop
(I'm using a random string array here mainly because I use the sort in
my system NumPy, with the old string compare. However, as all the
contents in strings are not NULL chars, the performance should be
comparable, bar a few percent of improvement).
So, my newqsort still seems to run almost 4x faster than the one in
NumPy (you know, using the old string compare).
However, when using a server with an Opteron processor, the view is much
different:
In [55]: a = np.random.rand(10000).astype('S8')
In [56]: %timeit a.copy().sort()
100 loops, best of 3: 3.82 ms per loop
In [57]: %timeit newqsort(a.copy())
100 loops, best of 3: 3.29 ms per loop
Here, the difference in performance has been reduced to a mere 15%
(still favouring newqsort). So, with this, it seems like the
performance of the original sorting in NumPy only suffers a lot when
running in old processors (eg. Pentium 4), while the performance is
reasonable with newer ones (Opteron). On its hand, newqsort seems to
perform reasonably well in both. I don't know what exactly is the
reason for this (I don't know where it is the code for the original
quicksort for strings, so I can't do a visual comparison), but it would
be great if we can discover it!
Cheers,
--
>0,0< Francesc Altet http://www.carabos.com/
V V Cárabos Coop. V. Enjoy Data
"-"
More information about the Numpy-discussion
mailing list