[Numpy-svn] r8373 - trunk/doc/neps

numpy-svn@scip... numpy-svn@scip...
Thu Apr 29 18:57:39 CDT 2010

Author: oliphant
Date: 2010-04-29 18:57:39 -0500 (Thu, 29 Apr 2010)
New Revision: 8373

Add NEP for group-by additions to NumPy: reduceby, reducein, segment, and edges.

Added: trunk/doc/neps/groupby_additions.rst
--- trunk/doc/neps/groupby_additions.rst	                        (rev 0)
+++ trunk/doc/neps/groupby_additions.rst	2010-04-29 23:57:39 UTC (rev 8373)
@@ -0,0 +1,112 @@
+ A proposal for adding groupby functionality to NumPy
+:Author: Travis Oliphant
+:Contact: oliphant@enthought.com
+:Date: 2010-04-27
+Executive summary
+NumPy provides tools for handling data and doing calculations in much
+the same way as relational algebra allows.  However, the common group-by
+functionality is not easily handled.  The reduce methods of NumPy's
+ufuncs are a natural place to put this groupby behavior.  This NEP
+describes two additional methods for ufuncs (reduceby and reducein) and
+two additional functions (segment and edges) which can help add this
+Example Use Case
+Suppose you have a NumPy structured array containing information about
+the number of purchases at several stores over multiple days.  To be clear, the
+structured array data-type is:
+dt = [('year', i2), ('month', i1), ('day', i1), ('time', float), 
+      ('store', i4), ('SKU', 'S6'), ('number', i4)]
+Suppose there is a 1-d NumPy array of this data-type and you would like
+to compute various statistics (max, min, mean, sum, etc.) on the number
+of products sold, by product, by month, by store, etc.
+Currently, this could be done by using reduce methods on the number
+field of the array, coupled with in-place sorting, unique with
+return_inverse=True and bincount, etc.  However, for such a common
+data-analysis need, it would be nice to have standard and more direct
+ways to get the results.
+Ufunc methods proposed
+It is proposed to add two new reduce-style methods to the ufuncs:
+reduceby and reducein.  The reducein method is intended to be a simpler
+to use version of reduceat, while the reduceby method is intended to
+provide group-by capability on reductions. 
+        <ufunc>.reducein(arr, indices, axis=0, dtype=None, out=None)
+        Perform a local reduce with slices specified by pairs of indices.
+        The reduction occurs along the provided axis, using the provided
+        data-type to calculate intermediate results, storing the result into
+        the array out (if provided). 
+        The indices array provides the start and end indices for the
+        reduction.  If the length of the indices array is odd, then the
+        final index provides the beginning point for the final reduction
+        and the ending point is the end of arr.
+        This generalizes along the given axis, the behavior: 
+        [<ufunc>.reduce(arr[indices[2*i]:indices[2*i+1]]) 
+                for i in range(len(indices)/2)]
+        This assumes indices is of even length 
+        Example: 
+           >>> a = [0,1,2,4,5,6,9,10]
+           >>> add.reducein(a,[0,3,2,5,-2])                 
+           [3, 11, 19]  
+           Notice that sum(a[0:3]) = 3; sum(a[2:5]) = 11; and sum(a[-2:]) = 19
+        <ufunc>.reduceby(arr, by, dtype=None, out=None)
+        Perform a reduction in arr over unique non-negative integers in by. 
+        Let N=arr.ndim and M=by.ndim.  Then, by.shape[:N] == arr.shape.
+        In addition, let I be an N-length index tuple, then by[I]
+        contains the location in the output array for the reduction to
+        be stored.  Notice that if N == M, then by[I] is a non-negative
+        integer, while if N < M, then by[I] is an array of indices into
+        the output array.
+        The reduction is computed on groups specified by unique indices
+        into the output array. The index is either the single
+        non-negative integer if N == M or if N < M, the entire
+        (M-N+1)-length index by[I] considered as a whole.
+Functions proposed
+.. Local Variables:
+.. mode: rst
+.. coding: utf-8
+.. fill-column: 72
+.. End:

More information about the Numpy-svn mailing list