[SciPy-User] fast small matrix multiplication with cython?

Skipper Seabold jsseabold@gmail....
Mon Dec 6 18:31:26 CST 2010

On Mon, Dec 6, 2010 at 7:23 PM, Robert Kern <robert.kern@gmail.com> wrote:
> On Mon, Dec 6, 2010 at 18:11, Pauli Virtanen <pav@iki.fi> wrote:
>> On Mon, 06 Dec 2010 17:34:19 -0500, Skipper Seabold wrote:
>>> I'm wondering if anyone might have a look at my cython code that does
>>> matrix multiplication and see where I can speed it up or offer some
>>> pointers/reading.  I'm new to Cython and my knowledge of C is pretty
>>> basic based on trial and (mostly) error, so I am sure the code is still
>>> very naive.
>> You'll be hard pressed to do better than Numpy's dot. In the raw data
>> handling, BLAS is very likely faster than most things you can code
>> manually. Moreover, the Cython routine you write must have as much
>> overhead as dot() --- dealing with refcounting, allocating/dellocating
>> PyArrayObjects (which is expensive) etc.
> The main thing for his use case is reducing the overhead when called
> from Cython. This started in a Cython-user thread where he was
> directly calling the Python numpy.dot() from Cython. I suggested that
> writing a Cython implementation may be better given the small
> dimensions (only up to 10x10) might be better handled by writing the
> matmult directly. Unfortunately, the buffer syntax adds a bunch of
> overhead. Not the *same* overhead, mind, and I was hoping it would be
> less, but it turns out to be more.

Sorry for the cross-post.  I figured this was better hashed out over here.

> Getting access to the C BLAS implementations would be best. I guess
> you could get descr.f.dotfunc and use that.

Thanks, I will see what I can come up with.  I know it can be sped up
since other software in C++ solves the whole optimization almost
instantaneously when mine takes ~5 seconds for the same case, and my
profiling says that most of the time is spent in the loglikelihood

>> If you are willing to give up wrapping each small matrix in a separate
>> Numpy ndarray, then you can expect to get additional speed gains.
>> (Although even in that case it could make more sense to call BLAS
>> routines to do the multiplication instead, unless your matrices are small
>> and of fixed size in which case the C compiler may be able to produce
>> some tightly optimized code.)
>> However, in many cases the small matrices can be just stuffed into a
>> single Numpy array.
> His use case (Kalman filters) prevents this.

For posterity's sake.  More akin to my actual problem.


> --
> Robert Kern
> "I have come to believe that the whole world is an enigma, a harmless
> enigma that is made terrible by our own mad attempt to interpret it as
> though it had an underlying truth."
>   -- Umberto Eco
> _______________________________________________
> SciPy-User mailing list
> SciPy-User@scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-user

More information about the SciPy-User mailing list