[SciPy-User] Accumulation sum using indirect indexes
Alexander Kalinin
alec.kalinin@gmail....
Sat Feb 4 13:23:57 CST 2012
I have checked the performance of the "pure numpy" solution with pandas
solution on my task. The "pure numpy" solution is about two times slower.
The data shape:
(1062, 6348)
Pandas "group by sum" time:
0.16588 seconds
Pure numpy "group by sum" time:
0.38979 seconds
But it is interesting, that the main bottleneck in numpy solution is the
data copying. I have divided solution on three blocks:
# block (a):
s = np.argsort(labels)
keys, inv = np.unique(labels, return_inverse = True)
i = inv[s]
groups_at = np.where(i != np.concatenate(([-1], i[:-1])))[0]
# block (b):
ordered_data = data[:, s]
# block (c):
group_sums = np.add.reduceat(ordered_data, groups_at, axis = 1)
The timing for the blocks is:
block (a):
0.00138 seconds
block (b):
0.29285 seconds
block (c):
0.08868 seconds
The sorting and reduce_at procedures are very fast. But only one line:
"ordered_data = data[:, s]" takes the most time.
For me it is a bit strange. The reduceat() procedure where summation is
executed is about 3 time faster than the only data copying.
Alexander
On Thu, Feb 2, 2012 at 10:16 PM, Warren Weckesser <
warren.weckesser@enthought.com> wrote:
>
>
> On Wed, Feb 1, 2012 at 10:34 AM, Alexander Kalinin <alec.kalinin@gmail.com
> > wrote:
>
>> Yes, but for large data sets loops is quite slow. I have tried Pandas
>> groupby.sum() and it works faster.
>>
>>
>
> Pandas is probably the correct tool to use for this, but it will be nice
> when numpy has a native "group-by" capability.
>
> For what its worth (had to scratch the itch, so to speak), the attached
> script provides a "pure numpy" implementation without a python loop. The
> output of the script is
>
> In [53]: run pseudo_group_by.py
> Label Data
> 20 [1 2 3]
> 20 [1 2 4]
> 10 [3 3 1]
> 0 [5 0 0]
> 20 [1 9 0]
> 10 [2 3 4]
> 20 [9 9 1]
>
> Label Num. Sum
> 0 1 [5 0 0]
> 10 2 [5 6 5]
> 20 4 [12 22 8]
>
>
> A drawback of the method is that it will make a reordered copy of the
> data. I haven't compared the performance to pandas.
>
> Warren
>
>
>
>>
>> 2012/2/1 Frédéric Bastien <nouiz@nouiz.org>
>>
>>> It will be slow, but you can make a python loop.
>>>
>>> Fred
>>> On Jan 31, 2012 3:34 PM, "Alexander Kalinin" <alec.kalinin@gmail.com>
>>> wrote:
>>>
>>>> Hello!
>>>>
>>>> I use SciPy in computer graphics applications. My task is to calculate
>>>> vertex normals by averaging faces normals. In other words I want to
>>>> accumulate vectors with the same ids. For example,
>>>>
>>>> ids = numpy.array([0, 1, 1, 2])
>>>> n = numpy.array([ [0.1, 0.1, 0.1], [0.1, 0.1, 0.1], [0.1, 0.1, 0.1],
>>>> [0.1, 0.1 0.1] ])
>>>>
>>>> I need result:
>>>> nv = ([ [0.1, 0.1, 0.1], [0.2, 0.2, 0.2], [0.1, 0.1, 0.1]])
>>>>
>>>> The most simple code:
>>>> nv[ids] += n
>>>> does not work, I know about this. For 1D arrays I use
>>>> numpy.bincount(...) function. But this function does not work for 2D arrays.
>>>>
>>>> So, my question. What is the best way calculate accumulation sum for 2D
>>>> arrays using indirect indexes?
>>>>
>>>> Sincerely,
>>>> Alexander
>>>>
>>>> _______________________________________________
>>>> SciPy-User mailing list
>>>> SciPy-User@scipy.org
>>>> http://mail.scipy.org/mailman/listinfo/scipy-user
>>>>
>>>>
>>> _______________________________________________
>>> SciPy-User mailing list
>>> SciPy-User@scipy.org
>>> http://mail.scipy.org/mailman/listinfo/scipy-user
>>>
>>>
>>
>> _______________________________________________
>> SciPy-User mailing list
>> SciPy-User@scipy.org
>> http://mail.scipy.org/mailman/listinfo/scipy-user
>>
>>
>
> _______________________________________________
> SciPy-User mailing list
> SciPy-User@scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-user
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.scipy.org/pipermail/scipy-user/attachments/20120204/03b24b56/attachment.html
More information about the SciPy-User
mailing list