# [Numpy-svn] r8373 - trunk/doc/neps

numpy-svn@scip... numpy-svn@scip...
Thu Apr 29 18:57:39 CDT 2010

```Author: oliphant
Date: 2010-04-29 18:57:39 -0500 (Thu, 29 Apr 2010)
New Revision: 8373

Log:
Add NEP for group-by additions to NumPy: reduceby, reducein, segment, and edges.

===================================================================
+++ trunk/doc/neps/groupby_additions.rst	2010-04-29 23:57:39 UTC (rev 8373)
@@ -0,0 +1,112 @@
+====================================================================
+ A proposal for adding groupby functionality to NumPy
+====================================================================
+
+:Author: Travis Oliphant
+:Contact: oliphant@enthought.com
+:Date: 2010-04-27
+
+
+Executive summary
+=================
+
+NumPy provides tools for handling data and doing calculations in much
+the same way as relational algebra allows.  However, the common group-by
+functionality is not easily handled.  The reduce methods of NumPy's
+ufuncs are a natural place to put this groupby behavior.  This NEP
+describes two additional methods for ufuncs (reduceby and reducein) and
+functionality.
+
+Example Use Case
+================
+Suppose you have a NumPy structured array containing information about
+the number of purchases at several stores over multiple days.  To be clear, the
+structured array data-type is:
+
+dt = [('year', i2), ('month', i1), ('day', i1), ('time', float),
+      ('store', i4), ('SKU', 'S6'), ('number', i4)]
+
+Suppose there is a 1-d NumPy array of this data-type and you would like
+to compute various statistics (max, min, mean, sum, etc.) on the number
+of products sold, by product, by month, by store, etc.
+
+Currently, this could be done by using reduce methods on the number
+field of the array, coupled with in-place sorting, unique with
+return_inverse=True and bincount, etc.  However, for such a common
+data-analysis need, it would be nice to have standard and more direct
+ways to get the results.
+
+
+Ufunc methods proposed
+======================
+
+It is proposed to add two new reduce-style methods to the ufuncs:
+reduceby and reducein.  The reducein method is intended to be a simpler
+to use version of reduceat, while the reduceby method is intended to
+provide group-by capability on reductions.
+
+reducein::
+
+        <ufunc>.reducein(arr, indices, axis=0, dtype=None, out=None)
+
+        Perform a local reduce with slices specified by pairs of indices.
+
+        The reduction occurs along the provided axis, using the provided
+        data-type to calculate intermediate results, storing the result into
+        the array out (if provided).
+
+        The indices array provides the start and end indices for the
+        reduction.  If the length of the indices array is odd, then the
+        final index provides the beginning point for the final reduction
+        and the ending point is the end of arr.
+
+        This generalizes along the given axis, the behavior:
+
+        [<ufunc>.reduce(arr[indices[2*i]:indices[2*i+1]])
+                for i in range(len(indices)/2)]
+
+        This assumes indices is of even length
+
+        Example:
+           >>> a = [0,1,2,4,5,6,9,10]
+           [3, 11, 19]
+
+           Notice that sum(a[0:3]) = 3; sum(a[2:5]) = 11; and sum(a[-2:]) = 19
+
+reduceby::
+
+        <ufunc>.reduceby(arr, by, dtype=None, out=None)
+
+        Perform a reduction in arr over unique non-negative integers in by.
+
+
+        Let N=arr.ndim and M=by.ndim.  Then, by.shape[:N] == arr.shape.
+        In addition, let I be an N-length index tuple, then by[I]
+        contains the location in the output array for the reduction to
+        be stored.  Notice that if N == M, then by[I] is a non-negative
+        integer, while if N < M, then by[I] is an array of indices into
+        the output array.
+
+        The reduction is computed on groups specified by unique indices
+        into the output array. The index is either the single
+        non-negative integer if N == M or if N < M, the entire
+        (M-N+1)-length index by[I] considered as a whole.
+
+
+Functions proposed
+==================
+
+segment::
+
+
+edges::
+
+
+.. Local Variables:
+.. mode: rst
+.. coding: utf-8
+.. fill-column: 72
+.. End:
+

```