[SciPy-User] Accumulation sum using indirect indexes

Alexander Kalinin alec.kalinin@gmail....
Sat Feb 4 13:23:57 CST 2012


I have checked the performance of the "pure numpy" solution with pandas
solution on my task. The "pure numpy" solution is about two times slower.

The data shape:
    (1062, 6348)
Pandas "group by sum" time:
    0.16588 seconds
Pure numpy "group by sum" time:
    0.38979 seconds

But it is interesting, that the main bottleneck in numpy solution is the
data copying. I have divided solution on three blocks:

# block (a):
     s = np.argsort(labels)

keys, inv = np.unique(labels, return_inverse = True)

i = inv[s]

groups_at = np.where(i != np.concatenate(([-1], i[:-1])))[0]

# block (b):
    ordered_data = data[:, s]

# block (c):
    group_sums = np.add.reduceat(ordered_data, groups_at, axis = 1)

The timing for the blocks is:
block (a):
    0.00138 seconds

block (b):
    0.29285 seconds

block (c):
    0.08868 seconds

The sorting and reduce_at procedures are very fast. But only one line:
"ordered_data = data[:, s]" takes the most time.

For me it is a bit strange. The reduceat() procedure where summation is
executed is about 3 time faster than the only data copying.

Alexander

On Thu, Feb 2, 2012 at 10:16 PM, Warren Weckesser <
warren.weckesser@enthought.com> wrote:

>
>
> On Wed, Feb 1, 2012 at 10:34 AM, Alexander Kalinin <alec.kalinin@gmail.com
> > wrote:
>
>> Yes, but for large data sets loops is quite slow. I have tried Pandas
>> groupby.sum() and it works faster.
>>
>>
>
> Pandas is probably the correct tool to use for this, but it will be nice
> when numpy has a native "group-by" capability.
>
> For what its worth (had to scratch the itch, so to speak), the attached
> script provides a "pure numpy" implementation without a python loop.  The
> output of the script is
>
> In [53]: run pseudo_group_by.py
> Label   Data
>  20    [1 2 3]
>  20    [1 2 4]
>  10    [3 3 1]
>   0    [5 0 0]
>  20    [1 9 0]
>  10    [2 3 4]
>  20    [9 9 1]
>
> Label  Num.   Sum
>   0     1   [5 0 0]
>  10     2   [5 6 5]
>  20     4   [12 22  8]
>
>
> A drawback of the method is that it will make a reordered copy of the
> data.  I haven't compared the performance to pandas.
>
> Warren
>
>
>
>>
>> 2012/2/1 Frédéric Bastien <nouiz@nouiz.org>
>>
>>> It will be slow, but you can make a python loop.
>>>
>>> Fred
>>> On Jan 31, 2012 3:34 PM, "Alexander Kalinin" <alec.kalinin@gmail.com>
>>> wrote:
>>>
>>>> Hello!
>>>>
>>>> I use SciPy in computer graphics applications. My task is to calculate
>>>> vertex normals by averaging faces normals. In other words I want to
>>>> accumulate vectors with the same ids. For example,
>>>>
>>>> ids = numpy.array([0, 1, 1, 2])
>>>> n = numpy.array([ [0.1, 0.1, 0.1], [0.1, 0.1, 0.1], [0.1, 0.1, 0.1],
>>>> [0.1, 0.1 0.1] ])
>>>>
>>>> I need result:
>>>> nv = ([ [0.1, 0.1, 0.1], [0.2, 0.2, 0.2], [0.1, 0.1, 0.1]])
>>>>
>>>> The most simple code:
>>>> nv[ids] += n
>>>> does not work, I know about this. For 1D arrays I use
>>>> numpy.bincount(...) function. But this function does not work for 2D arrays.
>>>>
>>>> So, my question. What is the best way calculate accumulation sum for 2D
>>>> arrays using indirect indexes?
>>>>
>>>> Sincerely,
>>>> Alexander
>>>>
>>>> _______________________________________________
>>>> SciPy-User mailing list
>>>> SciPy-User@scipy.org
>>>> http://mail.scipy.org/mailman/listinfo/scipy-user
>>>>
>>>>
>>> _______________________________________________
>>> SciPy-User mailing list
>>> SciPy-User@scipy.org
>>> http://mail.scipy.org/mailman/listinfo/scipy-user
>>>
>>>
>>
>> _______________________________________________
>> SciPy-User mailing list
>> SciPy-User@scipy.org
>> http://mail.scipy.org/mailman/listinfo/scipy-user
>>
>>
>
> _______________________________________________
> SciPy-User mailing list
> SciPy-User@scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-user
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.scipy.org/pipermail/scipy-user/attachments/20120204/03b24b56/attachment.html 


More information about the SciPy-User mailing list