[Numpy-discussion] fast method to to count a particular value in a large matrix
Sun Feb 5 19:17:38 CST 2012
On Sun, Feb 5, 2012 at 7:02 PM, David Verelst <firstname.lastname@example.org> wrote:
> Just out of curiosity, what speed-up factor did you achieve?
> On 04/02/12 22:20, Naresh wrote:
>> Warren Weckesser<warren.weckesser<at> enthought.com> writes:
>>> On Sat, Feb 4, 2012 at 2:35 PM, Benjamin Root<ben.root<at> ou.edu> wrote:
>>> On Saturday, February 4, 2012, Naresh Pai<npai<at> uark.edu> wrote:> I am
>> somewhat new to Python (been coding with Matlab mostly). I am trying to
>>>> simplify (and expedite) a piece of code that is currently a bottleneck in a
>>>> code.> I have a large array (7000 rows x 4500 columns) titled say, abc, and
>> I am trying> to find a fast method to count the number of instances of each
>> unique value within> it. All unique values are stored in a variable, say,
>> unique_elem. My current code
>>>> is as follows:> import numpy as np> #allocate space for storing element
>> count> elem_count = zeros((len(unique_elem),1))> #loop through and count number
>> of unique_elem> for i in range(len(unique_elem)):
>>>> elem_count[i]= np.sum(reduce(np.logical_or,(abc== x for x
>> in [unique_elem[i]])))> This loop is bottleneck because I have about 850 unique
>> elements and it takes> about 9-10 minutes. Can you suggest a faster way to do
>>>> Thank you,> Naresh>
>>> no.unique() can return indices and reverse indices. It would be trivial to
>> histogram the reverse indices using np.histogram().
>>> Instead of histogram(), you can use bincount() on the inverse indices:u, inv =
>> np.unique(abc, return_inverse=True)n = np.bincount(inv)u will be an array of the
>> unique elements, and n will be an array of the corresponding number of
>>> NumPy-Discussion mailing list
>>> NumPy-Discussion<at> scipy.org
>> The histogram() solution works perfect since unique_elem is ordered. I
>> appreciate everyone's help.
>> NumPy-Discussion mailing list
> NumPy-Discussion mailing list
np.histogram works pretty well. I'm getting speeds something like 1300
ms on float64 data. A hash table-based solution is faster (no big
surprise here), about 800ms so in the ballpark of 40% faster.
Whenever I get motivated enough I'm going to make a pull request on
NumPy with something like khash.h and start fixing all the O(N log N)
algorithms floating around that ought to be O(N). NumPy should really
have a "match" function similar to R's and a lot of other things.
More information about the NumPy-Discussion