# [Numpy-discussion] Is there a more efficient way to do this?

Laszlo Nagy gandalf@shopzeus....
Wed Aug 8 09:19:04 CDT 2012

```Is there a more efficient way to calculate the "slices" array below?

import numpy
import numpy.random

# In reality, this is between 1 and 50.
DIMENSIONS = 20

# In my real app, I have 100...1M data rows.
ROWS = 1000
DATA = numpy.random.random_integers(0,100,(ROWS,DIMENSIONS))

# This is between 0..DIMENSIONS-1
DRILLBY = 3

# Array of row incides that orders the data by the given dimension.
o = numpy.argsort(DATA[:,DRILLBY])

# Input of my task: the data ordered by the given dimension.
print DATA[o,DRILLBY]

#~ [  0   0   0   0   0   0   0   0   0   0   0   0   0   0   1 1   1   1
#~ 1   1   1   1   1   1   1   1   2   2   2   2   2   2   2 2   2   2
#~ 2   3   3   3   3   3   3   3   4   4   4   4   4   4   4 4   4   4
#~ 4   4   4   4   5   5   5   5   5   5   5   5   5   5   6 6   6   6
#~ .... many more things here
#~ 96  96  96  97  97  97  97  97  97  97  97  97  98  98  98  98 98  98
#~ 99  99  99  99  99  99  99  99  99  99  99  99  99  99  99  99 100 100
#~ 100 100 100 100 100 100 100 100 100 100]

# Output of my task: determine slices for the same values on the DRILLBY
dimension.

slices = []
prev_val = None
sidx = -1
# Dimension values for the given dimension.
fdv = DATA[:,DRILLBY]

# Go over the rows, sorted by values of didx
for oidx,rowidx in enumerate(o):
val = fdv[rowidx]
if val!=prev_val:
if prev_val is None:
prev_val = val
sidx = oidx
else:
slices.append((prev_val,sidx,oidx))
sidx = oidx
prev_val = val

if (sidx>=0) and (sidx<ROWS):
slices.append((val,sidx,ROWS))
slices = numpy.array(slices,dtype=numpy.int64)

# This is what I want to have!
print slices

#~
#~ [[   0    0   14]
#~ [   1   14   26]
#~ [   2   26   37]
#~ [   3   37   44]
#~ .... many more values here
#~ [   4   44   58]
#~ [  96  952  957]
#~ [  97  957  966]
#~ [  98  966  972]
#~ [  99  972  988]
#~ [ 100  988 1000]]

So for example, to get all row incides where dimension value is zero:
zeros at rows o[0:14]
Or, to get all row incides where dimension value is 99: o[988:1000] etc.

I do not want to make copies of DATA, because it can be huge. The
argsort is fast enough. I just need to create slices for different
dimensions. The above code works, but it does a linear time search,
implemented in pure Python code. For every iteration, Python code is
executed. For 1 million rows, this is very slow. Is there a way to
produce "slices" with numpy code? I could write C code for this, but I
would prefer to do it with mass numpy operations.

Thanks,

Laszlo

```