[Numpy-discussion] fast method to to count a particular value in a large matrix

Wes McKinney wesmckinn@gmail....
Sun Feb 5 19:17:38 CST 2012


On Sun, Feb 5, 2012 at 7:02 PM, David Verelst <david.verelst@gmail.com> wrote:
> Just out of curiosity, what speed-up factor did you achieve?
>
> Regards,
> David
>
> On 04/02/12 22:20, Naresh wrote:
>> Warren Weckesser<warren.weckesser<at>  enthought.com>  writes:
>>
>>>
>>> On Sat, Feb 4, 2012 at 2:35 PM, Benjamin Root<ben.root<at>  ou.edu>  wrote:
>>>
>>>
>>> On Saturday, February 4, 2012, Naresh Pai<npai<at>  uark.edu>  wrote:>  I am
>> somewhat new to Python (been coding with Matlab mostly). I am trying to
>>>> simplify (and expedite) a piece of code that is currently a bottleneck in a
>> larger
>>>> code.>  I have a large array (7000 rows x 4500 columns) titled say, abc, and
>> I am trying>  to find a fast method to count the number of instances of each
>> unique value within>  it. All unique values are stored in a variable, say,
>> unique_elem. My current code
>>>
>>>> is as follows:>  import numpy as np>  #allocate space for storing element
>> count>  elem_count = zeros((len(unique_elem),1))>  #loop through and count number
>> of unique_elem>  for i in range(len(unique_elem)):
>>>
>>>>     elem_count[i]= np.sum(reduce(np.logical_or,(abc== x for x
>> in [unique_elem[i]])))>  This loop is bottleneck because I have about 850 unique
>> elements and it takes>  about 9-10 minutes. Can you suggest a faster way to do
>> this?
>>>
>>>> Thank you,>  Naresh>
>>> no.unique() can return indices and reverse indices.  It would be trivial to
>> histogram the reverse indices using np.histogram().
>>>
>>> Instead of histogram(), you can use bincount() on the inverse indices:u, inv =
>> np.unique(abc, return_inverse=True)n = np.bincount(inv)u will be an array of the
>> unique elements, and n will be an array of the corresponding number of
>> occurrences.Warren
>>>
>>>
>>>
>>> _______________________________________________
>>> NumPy-Discussion mailing list
>>> NumPy-Discussion<at>  scipy.org
>>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>>>
>> The histogram() solution works perfect since unique_elem is ordered. I
>> appreciate everyone's help.
>>
>>
>> _______________________________________________
>> NumPy-Discussion mailing list
>> NumPy-Discussion@scipy.org
>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion

np.histogram works pretty well. I'm getting speeds something like 1300
ms on float64 data. A hash table-based solution is faster (no big
surprise here), about 800ms so in the ballpark of 40% faster.

Whenever I get motivated enough I'm going to make a pull request on
NumPy with something like khash.h and start fixing all the O(N log N)
algorithms floating around that ought to be O(N). NumPy should really
have a "match" function similar to R's and a lot of other things.

- Wes


More information about the NumPy-Discussion mailing list