[Numpy-discussion] String sort

Francesc Altet faltet@carabos....
Sat Feb 9 16:19:42 CST 2008


A Saturday 09 February 2008, Charles R Harris escrigué:
> I'm also thinking that at some point it becomes more efficient to do
> a indirect sort followed by take than to move all those big strings
> around. But I guess we won't know where that point is until we have
> both versions available.

I've done some experiments in that matter too.  They are saying that, 
with the current mergesort in NumPy, an indirect sort followed by take 
performs similarly to direct sort for small string lengths (<=16), but 
indirect sort starts to win afterwards.

The version with quicksort and optimized sSWAP should be between 2x and 
3x times faster than current mergesort implementation, so the advantage 
for direct sort could grow up to somewhere between 50 and 100.  A nice 
idea could be doing some more toughful experiments in order to find the 
point where an indirect sort followed by a take would be more efficient 
and automatically select this method beyond that point.

However, this has the drawback that you have to use additional memory 
for keeping the indices in the indirect method.  Of course, when 
strings are large, those indices should take a rather negligible space 
compared with strings itself.  In any case, in some situations where 
space is critical, this can still be important.  I don't know, but my 
opinion is that we shouldn't take too agressive optimizations for that 
matter.  My vote is to document this possibility in the docstrings, so 
that the user wanting for extreme performance can use this approach if 
he wants to.  Still, for string sizes greater than, say, 1000, well, an 
automatic selection of the indirect method is very tempting indeed.

Cheers,

-- 
>0,0<   Francesc Altet     http://www.carabos.com/
V   V   Cárabos Coop. V.   Enjoy Data
 "-"


More information about the Numpy-discussion mailing list