[Numpy-discussion] Implementing a "find first" style function

Benjamin Root ben.root@ou....
Wed Mar 6 15:16:19 CST 2013


On Tue, Mar 5, 2013 at 9:15 AM, Phil Elson <pelson.pub@gmail.com> wrote:

> The ticket https://github.com/numpy/numpy/issues/2269 discusses the
> possibility of implementing a "find first" style function which can
> optimise the process of finding the first value(s) which match a predicate
> in a given 1D array. For example:
>
>
> >>> a = np.sin(np.linspace(0, np.pi, 200))
> >>> print find_first(a, lambda a: a > 0.9)
> ((71, ), 0.900479032457)
>
>
> This has been discussed in several locations:
>
> https://github.com/numpy/numpy/issues/2269
> https://github.com/numpy/numpy/issues/2333
>
> http://stackoverflow.com/questions/7632963/numpy-array-how-to-find-index-of-first-occurrence-of-item
>
>
> *Rationale*
>
> For small arrays there is no real reason to avoid doing:
>
> >>> a = np.sin(np.linspace(0, np.pi, 200))
> >>> ind = (a > 0.9).nonzero()[0][0]
> >>> print (ind, ), a[ind]
> (71,) 0.900479032457
>
>
> But for larger arrays, this can lead to massive amounts of work even if
> the result is one of the first to be computed. Example:
>
> >>> a = np.arange(1e8)
> >>> print (a == 5).nonzero()[0][0]
> 5
>
>
> So a function which terminates when the first matching value is found is
> desirable.
>
> As mentioned in #2269, it is possible to define a consistent ordering
> which allows this functionality for >1D arrays, but IMHO it overcomplicates
> the problem and was not a case that I personally needed, so I've limited
> the scope to 1D arrays only.
>
>
> *Implementation*
>
> My initial assumption was that to get any kind of performance I would need
> to write the *find* function in C, however after prototyping with some
> array chunking it became apparent that a trivial python function would be
> quick enough for my needs.
>
> The approach I've implemented in the code found in #2269 simply breaks the
> array into sub-arrays of maximum length *chunk_size* (2048 by default,
> though there is no real science to this number), applies the given
> predicating function, and yields the results from *nonzero()*. The given
> function should be a python function which operates on the whole of the
> sub-array element-wise (i.e. the function should be vectorized). Returning
> a generator also has the benefit of allowing users to get the first *n*matching values/indices.
>
>
> *Results*
>
>
> I timed the implementation of *find* found in my comment at
> https://github.com/numpy/numpy/issues/2269#issuecomment-14436725 with an
> obvious test:
>
>
> In [1]: from np_utils import find
>
> In [2]: import numpy as np
>
> In [3]: import numpy.random
>
> In [4]: np.random.seed(1)
>
> In [5]: a = np.random.randn(1e8)
>
> In [6]: a.min(), a.max()
> Out[6]: (-6.1194900990552776, 5.9632246301166321)
>
> In [7]: next(find(a, lambda a: np.abs(a) > 6))
> Out[7]: ((33105441,), -6.1194900990552776)
>
> In [8]: (np.abs(a) > 6).nonzero()
> Out[8]: (array([33105441]),)
>
> In [9]: %timeit (np.abs(a) > 6).nonzero()
> 1 loops, best of 3: 1.51 s per loop
>
> In [10]: %timeit next(find(a, lambda a: np.abs(a) > 6))
> 1 loops, best of 3: 912 ms per loop
>
> In [11]: %timeit next(find(a, lambda a: np.abs(a) > 6, chunk_size=100000))
> 1 loops, best of 3: 470 ms per loop
>
> In [12]: %timeit next(find(a, lambda a: np.abs(a) > 6, chunk_size=1000000))
> 1 loops, best of 3: 483 ms per loop
>
>
> This shows that picking a sensible *chunk_size* can yield massive
> speed-ups (nonzero is x3 slower in one case). A similar example with a much
> smaller 1D array shows similar promise:
>
> In [41]: a = np.random.randn(1e4)
>
> In [42]: %timeit next(find(a, lambda a: np.abs(a) > 3))
> 10000 loops, best of 3: 35.8 us per loop
>
> In [43]: %timeit (np.abs(a) > 3).nonzero()
> 10000 loops, best of 3: 148 us per loop
>
>
> As I commented on the issue tracker, if you think this function is worth
> taking forward, I'd be happy to open up a pull request.
>
> Feedback greatfully received.
>
> Cheers,
>
> Phil
>
>
>
In the interest of generalizing code and such, could such approaches be
used for functions like np.any() and np.all() for short-circuiting if True
or False (respectively) are found?  I wonder what other sort of functions
in NumPy might benefit from this?

Ben Root
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.scipy.org/pipermail/numpy-discussion/attachments/20130306/f6bc5c12/attachment.html 


More information about the NumPy-Discussion mailing list