[Numpy-discussion] unique() should return a sorted array

Robert Kern robert.kern at gmail.com
Tue Jul 11 11:23:42 CDT 2006


Tim Hochberg wrote:
> Norbert Nemec wrote:
>> unique1d is based on ediff1d, so it really calculates many differences
>> and compares those to 0.0
>>
>> This is inefficient, even though this is hidden by the general
>> inefficiency of Python (It might be the reason for the two milliseconds,
>> though)
>>
>> What is more: subtraction works only for numbers, while the various
>> proposed versions use only comparison which works for any data type (as
>> long as it can be sorted)
>>   
> My first question is: why? What's the attraction in returning a sorted 
> answer here? Returning an unsorted array is potentially faster, 
> depending on the algorithm chosen,  and sorting after the fact is 
> trivial. If one was going to spend extra complexity on something, I'd 
> think it would be better spent on preserving the input order.

One issue is that uniquifying numpy arrays using Python dicts, while expected 
O(n) in terms of complexity, is really slow compared to sorting because of the 
overhead in getting the elements out of the numpy arrays and into Python 
objects. For the cases where sorting works (your caveat below is quite correct), 
it's really quite good for numpy arrays.

OTOH, if one were to implement a hash table in C, that might potentially be 
faster and more robust, but that is spending a *large* amount of extra complexity.

> Second, some objects can be compared for equality and hashed, but not 
> sorted (Python's complex number's come to mind). If one is going to 
> worry about subtraction so as to keep things general, it makes sense to 
> also avoid sorting as well  Sasha's slick algorithm not withstanding.

-- 
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
  that is made terrible by our own mad attempt to interpret it as though it had
  an underlying truth."
   -- Umberto Eco





More information about the Numpy-discussion mailing list