[Numpy-discussion] expensive tensordot

Paul Northug pnorthug@gmail....
Wed Jun 16 03:38:18 CDT 2010

Dag Sverre Seljebotn <dagss <at> student.matnat.uio.no> writes:

>
> Paul Northug wrote:
> > I have a computation bounded by one step and I have always wondered
> > how to make it fast enough to be useable. I suspect that I have to use
> > an approximation, but I was hoping someone would spot a major
> > inefficiency in my implementation.
> >
> > The calculation is a kind of outer product of two sets of time series:
> >
> > for n,p in np.ndindex(self.P, self.P):
> >      self.A[:,n,:,p] += np.tensordot(a[:,:,n:n+self.T],
> >                                                  a[:,:,p:p+self.T],
> > ([0,2], [0,2]))
> >
> > The sizes for all indices are on the order of 100 with the exception
> > that a.size[0] == 10. One inefficiency is that A is symmetric in (n,p)
> > but this is only a factor of 2.
>
> The memory access pattern appears to be quite inefficient. It is better
> to vary the last indices fastest, i.e. you should avoid indexing or
> slicing in the last dimension and leave those to tensordot if at all
> possible. Use arr.T.copy() or arr.copy('F') prior/after the loop.
>

Thanks Dag for pointing this out. Is the main reason for the
inefficiency because it uses cache poorly? It would be instructive for
me to check this. Is there a way to uses PAPI from within python or
another form of instrumentation?

I tried a.copy('F') as well as a transpose that moved the last index
to the first in the above code and got a fractional speed-up. I will
work on this.

Another thing I forgot to mention is that the tensor a is sparse, with
approximately 10% of the elements non-zero with no special structure.
I am not sure how to take advantage of this. I can convert 'a' into a
list of csr_matrix's and do np.dot instead of np.tensordot with an
additional loop. All of 'a' would most likely fit into cache.

However, all my attempts made the calculation slower. I tried
csr_matrix of 'a' before the loop, as well as csr_matrix of the slices
themselves. Both slower. Is there a sparse matrix format that is most
slice friendly? Is it worthwhile to write my own sparse tensordot for
this particular application? These things are highly optimized. I
doubt I could do better.

Just checked and 100x100 sparse matrix multiply with 10% non-zero is
about 5 times faster than dense matrix multiply. For 1000x1000, it is
almost 100x faster. This includes time to convert to csr. It would be
nice to get this type of speedup.

Thanks,
Pål