[SciPy-dev] Lexsort
Travis Oliphant
oliphant.travis at ieee.org
Sat Dec 31 21:07:53 CST 2005
Charles R Harris wrote:
> Travis,
>
> On 12/31/05, *Travis Oliphant* <oliphant.travis at ieee.org
> <mailto:oliphant.travis at ieee.org>> wrote:
>
>
> Chuck,
>
> I've read up on lexigraphic sorting and I finally get it (It's the
> kind
> of sorting we do in spreadsheets all the time...First on column
> D, then
> on columnB, right). In the case of lexsort, it would appear that the
> ordering of the keys determines which array gets sorted on first.
>
>
> Yep. The one thing to be careful about is the key ordering. For instance,
> to sort a bunch of fixed length words alphabetically the words need to be
> sorted on the last letter first, then second to last, and so on. I
> agonized a
> bit about building this in, but decided to just sort on the keys in
> the order
> they appeared in the list as that seemed most intuitive, and let the
> sorter beware.
Ahh. Yeah, that is a bit confusing, a good example could help guide the
user. I just added a merge sort for STRING arrays, so that now the
lexsort could be used to do sorting on a record array just like
spreadsheet's do. I thought about a general-purpose sorting function
for void-type arrays for a while, then realized that memory-copies would
be a big problem and decided against it. I think the index-sort you
have in lexsort followed by index selection, will provide the capability
and will often be faster than trying to sort the actual record elements.
>
> Oh my, bit fields, too?
Well, no, bit-fields are below the resolution of the current type
descriptors. While, the general-purpose sort might be an interesting
idea, I'm inclined to leave it alone at least until someone has an
actual need for it (and not a hypothetical need).
>
> Right now, void types can't be sorted which means that records
> can't be
> sorted. A lex sort would certainly fix that problem as the user
> could
> specify the desired keys themselves. It just seems like it could be
> done more elegantly.
>
>
> Are you are thinking of an extended comparison function, perhaps? I
> think that
> a lexsort followed by selection would be faster, as integers, not
> large records,
> will be exchanged.
I came to this same conclusion and agree with you. That's why I left
the general-purpose VOID sorter be for now. Right now, you can sort
VOID arrays (using the builtin qsort) and the VOID-compare. We could
think about a better VOID compare function than just comparing the raw
bytes string-like. Perhaps using any defined fields. But, for now,
I'm just going to sit on it.
> Of course, there is always the memory problem. I am not
> sure of the best way to sort large memmapped data files, but I think
> it is a
> different sort of sort. A merge sort with an extra memmapped file
> might be the
> way to go. Or the merge sort with lots of swap. In either case, I
> think it will be
> important to access the records sequentially, which I guess selection
> doesn't
> do. A special comparison function could be useful in that case because
> record
> access time might be the limiting factor. Hmm... or swapping records
> instead
> of indices. In that case, you could use a callback function for
> swapping. I suspect
> a little experimentation would be in order.
I'm going to leave this alone for a while. If you'd like to continue to
play with it, feel free :-)
>
> I'm happy to see someone else use the code. I suspect it has been kind
> of hidden
> up to now. Maybe I should have finished documenting it ;)
Feel free to add to the docstring in multiarraymodule.c
(doc_lexsort). You know, I think it would be nice to have some means
to edit docstrings without re-compiling. I wonder if there is some-way
to write a function that would dynamically add to the docstrings of
builtins...
Thanks for all your sorting expertise....
-Travis
More information about the Scipy-dev
mailing list