[Numpy-discussion] take from structured array is faster than boolean indexing, but reshapes columns to 2D

Christopher Mutel cmutel@gmail....
Wed Dec 22 01:51:46 CST 2010


Dear all-

Structured arrays are great, but I am having problems filtering them
efficiently. Reading through the mailing list, it seems like boolean
arrays are the recommended approach to filtering arrays for arbitrary
conditions, but my testing shows that a combination of take and where
can be much faster when dealing with structured arrays:

import timeit

setup = "from numpy import random, where, zeros; r =
random.random_integers(1e3, size=1e6); q = zeros((1e6), dtype=[('foo',
'u4'), ('bar', 'u4'), ('baz', 'u4')]); q['foo'] = r"
statement1 = "s = q.take(where(q['foo'] < 500))"
statement2 = "s = q[q['foo'] < 500]"

t = timeit.Timer(statement1, setup)
t.timeit(10)
t = timeit.Timer(statement2, setup)
t.timeit(10)

Using the boolean array is about 4 times slower when dealing with
large arrays. In my case, these operations are supposed to happen on a
web server with a large number of requests, so the efficiency gain is
important.

However, the combination of take and where reshapes the columns of
structured arrays to be 2-dimensional:

q['foo'].shape
>> (1000000,)
s = q[q['foo'] < 500]
s['foo'].shape
>> (499102,)
s = q.take(where(q['foo'] < 500))
s['foo'].shape
>> (1, 499102)

Is there a way to use this seemingly more efficient approach (take &
where) and not have to manually reshape the columns? This seems
ungainly for larger structured arrays. Or should I file this as a bug?
Perhaps there are even more efficient approaches that I haven't
thought of, but are obvious to others?

Thanks in advance,

Yours,
-Chris

-- 
############################
Chris Mutel
Ökologisches Systemdesign - Ecological Systems Design
Institut f.Umweltingenieurwissenschaften - Institute for Environmental
Engineering
ETH Zürich - HIF C 42 - Schafmattstr. 6
8093 Zürich

Telefon: +41 44 633 71 45 - Fax: +41 44 633 10 61
############################


More information about the NumPy-Discussion mailing list