[Numpy-discussion] String sort

Charles R Harris charlesr.harris@gmail....
Sat Feb 9 10:57:29 CST 2008


On Feb 9, 2008 6:47 AM, Francesc Altet <faltet@carabos.com> wrote:

> A Friday 08 February 2008, Charles R Harris escrigué:
> > On Feb 8, 2008 10:31 AM, Francesc Altet <faltet@carabos.com> wrote:
> > > A Friday 08 February 2008, Francesc Altet escrigué:
> > > > A Friday 08 February 2008, Charles R Harris escrigué:
> > > > > > Also, in the context of my work in indexing, and because of
> > > > > > the slowness of the current implementation in NumPy, I've
> > > > > > ended with an implementation of the quicksort method for 1-D
> > > > > > array strings. For moderately large arrays, it is about
> > > > > > 2.5x-3x faster than the (supposedly) mergesort version in
> > > > > > NumPy, not only due to the quicksort, but also because I've
> > > > > > implemented a couple of macros for efficient string swapping
> > > > > > and copy.  If this is of interest for NumPy developers, tell
> > > > > > me and I will provide the code.
> > > > >
> > > > > I have some code for this too and was going to merge it. Send
> > > > > yours along and I'll get to it this weekend.
> > > >
> > > > Ok, great.  I'm attaching it.  Tell me if you need some
> > > > clarification on the code.
> > >
> > > Ops.  I've introduced a last-minute problem in my code.  To fix
> > > this, just replace the flawed opt_strncmp() that I sent before by:
> > >
> > > static int inline opt_strncmp(char *a, char *b, int n) {
> > >  int i;
> > >  for (i=0; i<n; i++) {
> > >    if (a[i] > b[i]) return i+1;
> > >    if (a[i] < b[i]) return -(i+1);
> > >    /* Another way, but seems equivalent in speed, at least here */
> > > /*     if (a[i] != b[i]) */
> > > /*       return (((unsigned char *)a)[i] - ((unsigned char
> > > *)b)[i]); */ }
> > >  return 0;
> > > }
> > >
> > > Apparently, this version works just fine.
> >
> > Did you find this significantly faster than strncmp? There is also a
> > unicode compare, do you have thoughts about that?
>
> Well, for the unicode case it wouldn't be enough by replacing 'char'
> by 'Py_ArrayUCS4'?  Maybe this afternoon I can do some benchmarking too
> in this regard.
>

Looks like that for Numpy. The problem I was thinking about is that for wide
characters Windows C defaults to UTF16 while the Unixes default to UTF32.
The C99 standard didn't specify the exact length, but Numpy seems to use (or
assume) UTF32.

Anyway, after doing some work to fool the optimizer and subtracting loop
overhead, strncmp still comes out a bit faster for me, 11e-9 vs 16e-9
seconds to compare strings of length 10. I've attached the program. Note
that on my machine malloc appears to return zeroed memory, so the string
compares always go to the end.

Chuck
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://projects.scipy.org/pipermail/numpy-discussion/attachments/20080209/6d4b5c6d/attachment.html 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: compare.c
Type: text/x-csrc
Size: 629 bytes
Desc: not available
Url : http://projects.scipy.org/pipermail/numpy-discussion/attachments/20080209/6d4b5c6d/attachment.bin 


More information about the Numpy-discussion mailing list