[Numpy-discussion] Advice please on efficient subtotal function
Stephen Simmons
mail at stevesimmons.com
Fri Dec 29 18:22:16 CST 2006
Thanks Francesc, but I am already planning to read the data in block-wise as
you suggest. My question is rather how best to update the subtotals for each
block in a parallel way using numpy efficiently, rather than a simplistic and
slow element-by-element loop.
I can't use a simple sum(), as in your benchmark example or Greg's reply,
because I need to do:
for n in xrange(len(data)):
totals[ i[n], j[n] ] += data[n]
and not
for n in xrange(len(data)):
totals[n] += data[n]
My best solution so far is roughly like this:
- read in next block of 100k or so rows (taking into account the PyTables
table's _v_maxTuples and _v_chunksize)
- calculate the subtotal index arrays i and j
- do a lexsort() on [i, j, n]
- partition the sorted [i, j, n] into subsets where the i and j arrays change
values. The k_th such subset is thus
s_k = [ i_k, j_k, [n_k0, ..., n_kN] ]
- update the subtotals for each subset in the block
totals[i_k, j_k]+= sum(data[n_k0, ..., n_kN])
This should be reasonably efficient, but it's messy, and I'm not familar
enough with numpy's indexing tricks to get this right first time.
Maybe instead I'll have a go at writing a Pyrex function that implements the
simple loop at C speed:
subtotal2d(data_array, idx_array, out=None, dtype=None)
where data_array is Nx1, idx_array is NxM and out is M-dimensional.
Incidentally, there's one other function I'd find useful here in forming the
index arrays i[] and j[], a fast translate-from-dict function:
arr2 = fromiter( d[a] for a in arr1 )
My initial impression is that a C version would be substantially faster; maybe
I should do some benchmarking to see whether a pure Python/numpy approach is
actually faster than I expect.
Cheers, and thanks for any further suggestions,
Stephen
Francesc Altet <faltet at carabos.com> wrote:
> A Divendres 29 Desembre 2006 10:05, Stephen Simmons escrigué:
> > Hi,
> >
> > I'm looking for efficient ways to subtotal a 1-d array onto a 2-D grid.
> > This is more easily explained in code that words, thus:
> >
> > for n in xrange(len(data)):
> > totals[ i[n], j[n] ] += data[n]
> >
> > data comes from a series of PyTables files with ~200m rows. Each row has
> > ~20 cols, and I use the first three columns (which are 1-3 char strings)
to
> > form the indexing functions i[] and j[], then want to calc averages of the
> > remaining 17 numerical cols.
> >
> > I have tried various indirect ways of doing this with searchsorted and
> > bincount, but intuitively they feel overly complex solutions to what is
> > essentially a very simple problem.
> >
> > My work involved comparing the subtotals for various different
segmentation
> > strategies (the i[] and j[] indexing functions). Efficient solutions are
> > important because I need to make many passes through the 200m rows of
data.
> > Memory usage is the easiest thing for me to adjust by changing how many
> > rows of data to read in for each pass and then reusing the same array data
> > buffers.
>
> Well, from your words I guess you should already have tested this, but just
in
>
> case. As PyTables saves data in tables row-wise, it is always faster using
> the complete row for computations in each iteration than using just a single
> column. This is shown in the small benchmark that I'm attaching at the end
of
> the message. Here is its output for a table with 1m rows:
>
> time for creating the file--> 12.044
> time for using column reads --> 46.407
> time for using the row wise iterator--> 73.036
> time for using block reads (row wise)--> 5.156
>
> So, using block reads (in case you can use them) is your best bet.
>
> HTH,
>
>
--------------------------------------------------------------------------------------
> import tables
> import numpy
> from time import time
>
> nrows = 1000*1000
>
> # Create a table definition with 17 double cols and 3 string cols
> coltypes = numpy.dtype("f8,"*17 + "S3,"*3)
>
> t1 = time()
> # Create a file with an empty table. Use compression to minimize file size.
> f = tables.openFile("/tmp/prova.h5", 'w')
> table = f.createTable(f.root, 'table', numpy.empty(0, coltypes),
> filters=tables.Filters(complevel=1, complib='lzo'))
> # Fill the table with default values (empty strings and zeros)
> row = table.row
> for nrow in xrange(nrows):
> row.append()
> f.close()
> print "time for creating the file-->", round(time()-t1, 3)
>
> # *********** Start benchmarks **************************
> f = tables.openFile("/tmp/prova.h5", 'r')
> table = f.root.table
> colnames = table.colnames[:-3] # exclude the string cols
>
> # Loop over the table using column reads
> t1 = time(); cum = numpy.zeros(17)
> for ncol, colname in enumerate(colnames):
> col = table.read(0, nrows, field=colname)
> cum[ncol] += col.sum()
> print "time for using column reads -->", round(time()-t1, 3)
>
> # Loop over the table using its row iterator
> t1 = time(); cum = numpy.zeros(17)
> for row in table:
> for ncol, colname in enumerate(colnames):
> cum[ncol] += row[colname]
> print "time for using the row iterator-->", round(time()-t1, 3)
>
> # Loop over the table using block reads (row wise)
> t1 = time(); cum = numpy.zeros(17)
> step = 10000
> for nrow in xrange(0, nrows, step):
> ra = table[nrow:nrow+step]
> for ncol, colname in enumerate(colnames):
> cum[ncol] += ra[colname].sum()
> print "time for using block reads (row wise)-->", round(time()-t1, 3)
>
> f.close()
More information about the Numpy-discussion
mailing list