> If you need/want more speed than the solution Chuck proposed, you should check out Cython and Tokyo. Cython lets you write loops that execute at C speed, whereas Tokyo provides a Cython level wrapper for BLAS (no need to go through Python code to call NumPy). Tokyo was designed for exactly your use case: lots of matrix multiplies with relatively small matrices, where you start noticing the Python overhead.

It occurred to me I neglected to provide a link (cursed iPhone):



