# [SciPy-User] Accumulation sum using indirect indexes

Wes McKinney wesmckinn@gmail....
Tue Feb 7 17:15:04 CST 2012

```On Tue, Feb 7, 2012 at 5:38 PM, Wes McKinney <wesmckinn@gmail.com> wrote:
> On Sun, Feb 5, 2012 at 2:17 AM, Alexander Kalinin
> <alec.kalinin@gmail.com> wrote:
>> Yes, the numpy.take() is much faster than "fancy" indexing and now "pure
>> numpy" solution is two time faster than pandas. Below are timing results:
>>
>>
>> The data shape:
>>      (1062, 6348)
>>
>> Pandas solution:
>>     0.16610 seconds
>>
>> "Pure numpy" solution:
>>     0.08907 seconds
>>
>> Timing of the "pure numpy" by blocks:
>> block (a) (sorting and obtaining groups):
>>     0.00134 seconds
>> block (b) (copy data to the ordered_data):
>>     0.05517 seconds
>> block (c) (reduceat):
>>     0.02698
>>
>> Alexander.
>>
>>
>> On Sun, Feb 5, 2012 at 4:01 AM, <josef.pktd@gmail.com> wrote:
>>>
>>> On Sat, Feb 4, 2012 at 2:27 PM, Wes McKinney <wesmckinn@gmail.com> wrote:
>>> > On Sat, Feb 4, 2012 at 2:23 PM, Alexander Kalinin
>>> > <alec.kalinin@gmail.com> wrote:
>>> >> I have checked the performance of the "pure numpy" solution with pandas
>>> >> solution on my task. The "pure numpy" solution is about two times
>>> >> slower.
>>> >>
>>> >> The data shape:
>>> >>     (1062, 6348)
>>> >> Pandas "group by sum" time:
>>> >>     0.16588 seconds
>>> >> Pure numpy "group by sum" time:
>>> >>     0.38979 seconds
>>> >>
>>> >> But it is interesting, that the main bottleneck in numpy solution is
>>> >> the
>>> >> data copying. I have divided solution on three blocks:
>>> >>
>>> >> # block (a):
>>> >>     s = np.argsort(labels)
>>> >>
>>> >> keys, inv = np.unique(labels, return_inverse = True)
>>> >>
>>> >> i = inv[s]
>>> >>
>>> >> groups_at = np.where(i != np.concatenate(([-1], i[:-1])))[0]
>>> >>
>>> >>
>>> >> # block (b):
>>> >>     ordered_data = data[:, s]
>>>
>>> can you try with numpy.take? Keith and Wes were showing that take is
>>> much faster than advanced indexing.
>>>
>>> Josef
>>>
>>
>> _______________________________________________
>> SciPy-User mailing list
>> SciPy-User@scipy.org
>> http://mail.scipy.org/mailman/listinfo/scipy-user
>>
>
> FWIW I did some refactoring in pandas today and am getting the
> following timings now in this use case:
>
> In [12]: df = DataFrame(randn(1062, 6348))
>
> In [13]: labels = np.random.randint(0, 100, size=1062)
>
> In [14]: timeit df.groupby(labels).sum()
> 10 loops, best of 3: 38.7 ms per loop
>
> comparing with
>
> def numpy_groupby(data, labels, axis=0):
>    s = np.argsort(labels)
>    keys, inv = np.unique(labels, return_inverse = True)
>    i = inv.take(s)
>    groups_at = np.where(i != np.concatenate(([-1], i[:-1])))[0]
>    ordered_data = data.take(s, axis=axis)
>    group_sums = np.add.reduceat(ordered_data, groups_at, axis=axis)
>
>    return group_sums
>
> In [15]: timeit numpy_groupby(df.values, labels)
> 10 loops, best of 3: 95.4 ms per loop
>
> according to line_profiler, the runtime is being consumed by the reduceat now
>
> In [20]: lprun -f numpy_groupby numpy_groupby(df.values, labels)
> Timer unit: 1e-06 s
>
> File: pandas/core/groupby.py
> Function: numpy_groupby at line 1511
> Total time: 0.108126 s
>
> Line #      Hits         Time  Per Hit   % Time  Line Contents
> ==============================================================
>  1511                                           def
> numpy_groupby(data, labels):
>  1512         1          125    125.0      0.1      s = np.argsort(labels)
>  1513         1          720    720.0      0.7      keys, inv =
> np.unique(labels, return_inverse = True)
>  1514         1           13     13.0      0.0      i = inv.take(s)
>  1515         1           62     62.0      0.1      groups_at =
> np.where(i != np.concatenate(([-1], i[:-1])))[0]
>  1516         1        28684  28684.0     26.5      ordered_data =
> data.take(s, axis=0)
>  1517         1        78519  78519.0     72.6      group_sums =
>  1518
>  1519         1            3      3.0      0.0      return group_sums
>
> The performance of the pure numpy solution will degrade both with the
> length of the labels vector and the number of unique elements (because
> there are two O(N log N) steps there). In this case it matters less
> because there are so many rows / columns to aggregate
>
> The performance of the pure NumPy solution is unsurprisingly much
> better when the aggregation is across contiguous memory vs. strided
> memory access:
>
>
> In [41]: labels = np.random.randint(0, 100, size=6348)
>
> In [42]: timeit numpy_groupby(df.values, labels, axis=1)
> 10 loops, best of 3: 47.4 ms per loop
>
> pandas is slower in this case because it's not giving any
> consideration to cache locality:
>
> In [50]: timeit df.groupby(labels, axis=1).sum()
> 10 loops, best of 3: 79.9 ms per loop
>
> One can only complain so much about 30 lines of Cython code ;) Good
> enough for the time being
>
> - Wes

More on this for those interested. These methods start really becoming
different when you aggregate very large 1D arrays. Consider a million
float64s each with a label chosen from 1000 unique labels. You can see
where we start running into problems:

In [9]: data = np.random.randn(1000000, 1)

In [10]: labels = np.random.randint(0, 1000, size=1000000)

In [11]: lprun -f gp.numpy_groupby gp.numpy_groupby(data, labels)
Timer unit: 1e-06 s

File: pandas/core/groupby.py
Function: numpy_groupby at line 1512
Total time: 0.413775 s

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
1512                                           def
numpy_groupby(data, labels, axis=0):
1513         1        98867  98867.0     23.9      s = np.argsort(labels)
1514         1       286792 286792.0     69.3      keys, inv =
np.unique(labels, return_inverse = True)
1515         1         9617   9617.0      2.3      i = inv.take(s)
1516         1         8059   8059.0      1.9      groups_at =
np.where(i != np.concatenate(([-1], i[:-1])))[0]
1517         1         9365   9365.0      2.3      ordered_data =
data.take(s, axis=axis)
1518         1         1073   1073.0      0.3      group_sums =
1519
1520         1            2      2.0      0.0      return group_sums

In [12]: timeit gp.numpy_groupby(data, labels)
1 loops, best of 3: 410 ms per loop

whereas the hash-table based approach (a la pandas) looks like:

In [13]: df = DataFrame(data)

In [14]: timeit df.groupby(labels).sum()
10 loops, best of 3: 71.5 ms per loop

with

In [17]: %prun -s cumulative -l 15 for _ in xrange(10): df.groupby(labels).sum()
3002 function calls in 0.771 seconds

Ordered by: cumulative time
List reduced from 109 to 15 due to restriction <15>

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
1    0.025    0.025    0.771    0.771 <string>:1(<module>)
10    0.000    0.000    0.744    0.074 groupby.py:327(sum)
10    0.001    0.000    0.744    0.074
groupby.py:940(_cython_agg_general)
10    0.000    0.000    0.692    0.069 groupby.py:384(_group_info)
10    0.000    0.000    0.408    0.041 groupby.py:620(labels)
10    0.000    0.000    0.408    0.041 groupby.py:641(_make_labels)
10    0.015    0.002    0.408    0.041 algorithms.py:72(factorize)
10    0.314    0.031    0.314    0.031 {method 'get_labels' of
'pandas._tseries.Int64HashTable' objects}
10    0.012    0.001    0.284    0.028
groupby.py:1470(_compress_group_index)
10    0.178    0.018    0.178    0.018 {method
'get_labels_groupby' of 'pandas._tseries.Int64HashTable' objects}
90    0.132    0.001    0.132    0.001 {method 'take' of
'numpy.ndarray' objects}
10    0.000    0.000    0.045    0.004 groupby.py:1432(cython_aggregate)
10    0.044    0.004    0.044    0.004 {pandas._tseries.group_add}
30    0.019    0.001    0.019    0.001 {numpy.core.multiarray.putmask}
20    0.016    0.001    0.016    0.001 {method 'astype' of
'numpy.ndarray' objects}

I'm working on getting rid of the "compress" step which will save
another 30% in this single-key groupby case, unfortunately with the
way I have the code factored it's not completely trivial

(this is all very bleeding-edge pandas, forthcoming in 0.7.0 final)

- Wes
```