[SciPy-dev] Lexsort

Travis Oliphant oliphant.travis at ieee.org
Sat Dec 31 21:07:53 CST 2005


Charles R Harris wrote:

> Travis,
>
> On 12/31/05, *Travis Oliphant* <oliphant.travis at ieee.org 
> <mailto:oliphant.travis at ieee.org>> wrote:
>
>
>     Chuck,
>
>     I've read up on lexigraphic sorting and I finally get it (It's the
>     kind
>     of sorting we do in spreadsheets all the time...First on  column
>     D, then
>     on columnB, right).  In the case of lexsort, it would appear that the
>     ordering of the keys determines which array gets sorted on first.
>
>
> Yep. The one thing to be careful about is the key ordering. For instance,
> to sort a bunch of fixed length words alphabetically the words need to be
> sorted on the last letter first, then second to last, and so on. I 
> agonized a
> bit about building this in, but decided to just sort on the keys in 
> the order
> they appeared in the list as that seemed most intuitive, and let the
> sorter beware.

Ahh.  Yeah, that is a bit confusing, a good example could help guide the 
user.   I just added a merge sort for STRING arrays, so that now the 
lexsort could be used to do sorting on a record array just like 
spreadsheet's do.  I thought about a general-purpose sorting function 
for void-type arrays for a while, then realized that memory-copies would 
be a big problem and decided against it.  I think the index-sort you 
have in lexsort followed by index selection, will provide the capability 
and will often be faster than trying to sort the actual record elements.

>
> Oh my, bit fields, too?

Well, no, bit-fields are below the resolution of the current type 
descriptors.   While, the general-purpose sort might be an interesting 
idea, I'm inclined to leave it alone at least until someone has an 
actual need for it (and not a hypothetical need). 

>
>     Right now, void types can't be sorted which means that records
>     can't be
>     sorted.  A lex sort would certainly fix that problem as the user
>     could
>     specify the desired keys themselves.  It just seems like it could be
>     done more elegantly. 
>
>
> Are you are thinking of an extended comparison function, perhaps? I 
> think that
> a lexsort followed by selection would be faster, as integers, not 
> large records,
> will be exchanged.

I came to this same conclusion and agree with you.  That's why I left 
the general-purpose VOID sorter be for now.  Right now, you can sort 
VOID arrays (using the builtin qsort) and the VOID-compare.  We could 
think about a better VOID compare function than just comparing the raw 
bytes string-like.  Perhaps using any defined fields.   But, for now, 
I'm just going to sit on it.

> Of course, there is always the memory problem. I am not
> sure of the best way to sort large memmapped data files, but I think 
> it is a
> different sort of sort. A merge sort with an extra memmapped file 
> might be the
> way to go. Or the merge sort with lots of swap. In either case, I 
> think it will be
> important to access the records sequentially, which I guess selection 
> doesn't
> do. A special comparison function could be useful in that case because 
> record
> access time might be the limiting factor. Hmm... or swapping records 
> instead
> of indices. In that case, you could use a callback function for 
> swapping. I suspect
> a little experimentation would be in order.

I'm going to leave this alone for a while.  If you'd like to continue to 
play with it, feel free :-)

>
> I'm happy to see someone else use the code. I suspect it has been kind 
> of hidden
> up to now. Maybe I should have finished documenting it ;)

Feel free to add to the docstring in multiarraymodule.c 
(doc_lexsort).    You know, I think it would be nice to have some means 
to edit docstrings without re-compiling.   I wonder if there is some-way 
to write a function that would dynamically add to the docstrings of 
builtins...

Thanks for all your sorting expertise....

-Travis




More information about the Scipy-dev mailing list