[Numpy-discussion] Extract subset from an array
Francesc Alted
faltet@pytables....
Tue Feb 16 12:26:15 CST 2010
A Tuesday 16 February 2010 13:42:37 Nicola Creati escrigué:
> Hello,
> I need to extract a subset from a Nx3 array. Each row has x, y, and z
> coordinates.
> The subset is just a portion of the array in which the following
> condition realizes
>
> x_min < x < x_max and y_min < y < y_max
>
> The problem reduce to the extraction of points inside a rectangular box
> defined by
> x_min, x_max, y_min, y_max.
>
> I work with large arrays, the number or rows is always larger than 5x1e7.
> I'm looking for a fast way to extract the subset.
>
> At the moment I found a solution that seems the best. This is a small
> example:
>
> import numpy as np
>
> # Create a large 1e7x3 array of random numbers
> array = np.random.random((10000000, 3))
>
> # Define rectangular box
> x_min = 0.3
> x_max = 0.5
> y_min = 0.4
> y_max = 0.7
>
> # Create bool array that indicates the elemnts of array to extract
> condition = (array[:,0]>x_min) & (array[:,0]<x_max) & (array[:,1]>y_min)
> & (array[:,1]<y_max)
>
> # Extract the subset
> subset = array[condition]
>
> Are there any faster solution?
In the above condition you are walking strided arrays, and that hurts
performance somewhat. If you can afford to transpose your array first, you
can get some significant performance.
For example, your original code takes:
In [6]: x_min, x_max, y_min, y_max = .3, .5, .4, .7
In [7]: array = np.random.random((10000000, 3))
In [8]: time (array[:,0]>x_min) & (array[:,0]<x_max) & (array[:,1]>y_min) &
(array[:,1]<y_max)
CPU times: user 0.24 s, sys: 0.02 s, total: 0.26 s
Wall time: 0.27 s
Out[9]: array([False, False, False, ..., False, False, False], dtype=bool)
But, if you create a transposed array like:
In [10]: array = np.random.random((3, 10000000))
then the time drops significantly:
In [11]: time (array[0]>x_min) & (array[0]<x_max) & (array[1]>y_min) &
(array[1]<y_max)
CPU times: user 0.15 s, sys: 0.01 s, total: 0.16 s
Wall time: 0.16 s
Out[12]: array([False, False, False, ..., False, False, False], dtype=bool)
i.e. walking your arrays row-wise is around 1.7x faster in this case.
--
Francesc Alted
More information about the NumPy-Discussion
mailing list