[Numpy-discussion] Extract subset from an array

Francesc Alted faltet@pytables....
Tue Feb 16 12:26:15 CST 2010


A Tuesday 16 February 2010 13:42:37 Nicola Creati escrigué:
> Hello,
> I need to extract a subset from a Nx3 array. Each row has x, y, and z
> coordinates.
> The subset is just a portion of the array in which the following
> condition realizes
> 
> x_min < x < x_max and y_min < y < y_max
> 
> The problem reduce to the extraction of points inside a rectangular box
> defined by
> x_min, x_max, y_min, y_max.
> 
> I work with large arrays, the number or rows is always larger than 5x1e7.
> I'm looking for a fast way to extract the subset.
> 
> At the moment I found a solution that seems the best. This is a small
> example:
> 
> import numpy as np
> 
> # Create a large 1e7x3 array of random numbers
> array = np.random.random((10000000, 3))
> 
> # Define rectangular box
> x_min = 0.3
> x_max = 0.5
> y_min = 0.4
> y_max = 0.7
> 
> # Create bool array that indicates the elemnts of array to extract
> condition = (array[:,0]>x_min) & (array[:,0]<x_max) & (array[:,1]>y_min)
> & (array[:,1]<y_max)
> 
> # Extract the subset
> subset = array[condition]
> 
> Are there any faster solution?

In the above condition you are walking strided arrays, and that hurts 
performance somewhat.  If you can afford to transpose your array first, you 
can get some significant performance.

For example, your original code takes:

In [6]: x_min, x_max, y_min, y_max = .3, .5, .4, .7

In [7]: array = np.random.random((10000000, 3))

In [8]: time (array[:,0]>x_min) & (array[:,0]<x_max) & (array[:,1]>y_min) & 
(array[:,1]<y_max)
CPU times: user 0.24 s, sys: 0.02 s, total: 0.26 s
Wall time: 0.27 s
Out[9]: array([False, False, False, ..., False, False, False], dtype=bool)

But, if you create a transposed array like:

In [10]: array = np.random.random((3, 10000000))

then the time drops significantly:

In [11]: time (array[0]>x_min) & (array[0]<x_max) & (array[1]>y_min) & 
(array[1]<y_max)
CPU times: user 0.15 s, sys: 0.01 s, total: 0.16 s
Wall time: 0.16 s
Out[12]: array([False, False, False, ..., False, False, False], dtype=bool)

i.e. walking your arrays row-wise is around 1.7x faster in this case.

-- 
Francesc Alted


More information about the NumPy-Discussion mailing list