[NumPy-Tickets] [NumPy] #2163: BLAS matrix product (dot) never used for ndim > 2 (tensordot does not use BLAS)

NumPy Trac numpy-tickets@scipy....
Fri Jun 15 11:09:52 CDT 2012

```#2163: BLAS matrix product (dot) never used for ndim > 2 (tensordot does not use
BLAS)
------------------------------------------------+---------------------------
Reporter:  thatistosay                         |       Owner:  somebody
Type:  enhancement                         |      Status:  new
Priority:  normal                              |   Milestone:  Unscheduled
Component:  numpy.core                          |     Version:  1.6.1
Keywords:  dot, blas, tensordot, optimization  |
------------------------------------------------+---------------------------
dotblas_matrixproduct() contains the comment "This function doesn't handle
dimensions greater than 2" and calls PyArray_MatrixProduct2() for these
cases. This means BLAS is never used for calls to dot() with arguments of
ndim>2!! In more detail...

If I want to contract a pair of tensor indices for ndim=3 such that

{{{
A = np.rand(d,D,D); B = np.rand(d,D,D)
AB[s,i,t,j] == sum(A[s,i,:], B[t,:,j])
}}}

then currently, although it can be done in a single line

{{{
res = sp.dot(A,B)
}}}

it can often be done much faster with explicit (python!) loops

{{{
res = np.zeros((d,d,D,D))
for s in xrange(d):
for t in xrange(d):
np.dot(A[s], B[t], out=res[s,t])
res = np.rollaxis(res, 2, 1)
}}}

..assuming dot() is using optimized BLAS for ndim=2, and the dimensions
are large enough so that calling BLAS is worth it.
In general, reproducing the behaviour of dot() for ndim>2 is just a matter
of calling GEMM in loops as above and then calling rollaxis() once.

I therefore propose doing this within _blasdot.c as far as possible (to
eliminate the use of python loops) so that ndim>2 dot(), and tensordot(),
can benefit from BLAS.

Some comparisons of the two methods above (attached script):
{{{
dtype=complex128
AB[s,i,t,j] = sum(A[s,i,:], B[t,:,j])

A.shape = (16, 512, 512); B.shape = (16, 512, 512)
looping over 2D dot() vs. 3D dot(): 24% (about 4 times faster)

A.shape = (20, 64, 64); B.shape = (20, 64, 64)
looping over 2D dot() vs. 3D dot(): 35%

A.shape = (20, 48, 32); B.shape = (20, 32, 48)
looping over 2D dot() vs. 3D dot(): 45%

A.shape = (32, 32, 16); B.shape = (32, 16, 32)
looping over 2D dot() vs. 3D dot(): 82%

A.shape = (64, 10, 8); B.shape = (64, 8, 10)
looping over 2D dot() vs. 3D dot(): 158% (slow python loops..)
}}}
(this was on a 4-core i7 system using ATLAS under heavy load)

--
Ticket URL: <http://projects.scipy.org/numpy/ticket/2163>
NumPy <http://projects.scipy.org/numpy>
My example project
```