[Numpy-discussion] Is there a more efficient way to do this?
Laszlo Nagy
gandalf@shopzeus....
Wed Aug 8 09:19:04 CDT 2012
Is there a more efficient way to calculate the "slices" array below?
import numpy
import numpy.random
# In reality, this is between 1 and 50.
DIMENSIONS = 20
# In my real app, I have 100...1M data rows.
ROWS = 1000
DATA = numpy.random.random_integers(0,100,(ROWS,DIMENSIONS))
# This is between 0..DIMENSIONS-1
DRILLBY = 3
# Array of row incides that orders the data by the given dimension.
o = numpy.argsort(DATA[:,DRILLBY])
# Input of my task: the data ordered by the given dimension.
print DATA[o,DRILLBY]
#~ [ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1
#~ 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2
#~ 2 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4
#~ 4 4 4 4 5 5 5 5 5 5 5 5 5 5 6 6 6 6
#~ .... many more things here
#~ 96 96 96 97 97 97 97 97 97 97 97 97 98 98 98 98 98 98
#~ 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 100 100
#~ 100 100 100 100 100 100 100 100 100 100]
# Output of my task: determine slices for the same values on the DRILLBY
dimension.
slices = []
prev_val = None
sidx = -1
# Dimension values for the given dimension.
fdv = DATA[:,DRILLBY]
# Go over the rows, sorted by values of didx
for oidx,rowidx in enumerate(o):
val = fdv[rowidx]
if val!=prev_val:
if prev_val is None:
prev_val = val
sidx = oidx
else:
slices.append((prev_val,sidx,oidx))
sidx = oidx
prev_val = val
if (sidx>=0) and (sidx<ROWS):
slices.append((val,sidx,ROWS))
slices = numpy.array(slices,dtype=numpy.int64)
# This is what I want to have!
print slices
#~
#~ [[ 0 0 14]
#~ [ 1 14 26]
#~ [ 2 26 37]
#~ [ 3 37 44]
#~ .... many more values here
#~ [ 4 44 58]
#~ [ 96 952 957]
#~ [ 97 957 966]
#~ [ 98 966 972]
#~ [ 99 972 988]
#~ [ 100 988 1000]]
So for example, to get all row incides where dimension value is zero:
zeros at rows o[0:14]
Or, to get all row incides where dimension value is 99: o[988:1000] etc.
I do not want to make copies of DATA, because it can be huge. The
argsort is fast enough. I just need to create slices for different
dimensions. The above code works, but it does a linear time search,
implemented in pure Python code. For every iteration, Python code is
executed. For 1 million rows, this is very slow. Is there a way to
produce "slices" with numpy code? I could write C code for this, but I
would prefer to do it with mass numpy operations.
Thanks,
Laszlo
More information about the NumPy-Discussion
mailing list