[Numpy-discussion] Sort performance with structured array

Charles R Harris charlesr.harris@gmail....
Sun Apr 7 19:00:56 CDT 2013


On Sun, Apr 7, 2013 at 5:56 PM, Charles R Harris
<charlesr.harris@gmail.com>wrote:

>
>
> On Sun, Apr 7, 2013 at 5:23 PM, Tom Aldcroft <
> aldcroft@head.cfa.harvard.edu> wrote:
>
>> I'm seeing about a factor of 50 difference in performance between
>> sorting a random integer array versus sorting that same array viewed
>> as a structured array.  Am I doing anything wrong here?
>>
>> In [2]: x = np.random.randint(10000, size=10000)
>>
>> In [3]: xarr = x.view(dtype=[('a', np.int)])
>>
>> In [4]: timeit np.sort(x)
>> 1000 loops, best of 3: 588 us per loop
>>
>> In [5]: timeit np.sort(xarr)
>> 10 loops, best of 3: 29 ms per loop
>>
>> In [6]: timeit np.sort(xarr, order=('a',))
>> 10 loops, best of 3: 28.9 ms per loop
>>
>> I was wondering if this slowdown is expected (maybe the comparison is
>> dropping back to pure Python or ??).  I'm showing a simple example
>> here, but in reality I'm working with non-trivial structured arrays
>> where I might want to sort on multiple columns.
>>
>> Does anyone have suggestions for speeding things up, or have a sort
>> implementation (perhaps Cython) that has better performance for
>> structured arrays?
>>
>
> This is probably due to the comparison function used. For straight
> integers the C operator `<` is used, for dtypes the dtype comparison
> function is passed as a pointer to the routines. I doubt Cython would make
> any difference in this case, but making the dtype comparison routine better
> would probably help a lot. For all I know, the dtype gets parsed on every
> call to the comparison function.
>
>
Note that even sorting as a byte string is notably faster

In [13]: sarr = x.view(dtype='<S8')

In [14]: timeit sort(sarr)
1000 loops, best of 3: 1.31 ms per loop

Chuck
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.scipy.org/pipermail/numpy-discussion/attachments/20130407/c50eb1c4/attachment-0001.html 


More information about the NumPy-Discussion mailing list