[SciPy-User] calculating covariances fast (and accurate)

Keith Goodman kwgoodman@gmail....
Thu Jul 8 19:54:16 CDT 2010


On Thu, Jul 8, 2010 at 4:30 PM, Sturla Molden <sturla@molden.no> wrote:
> I needed to calculate covariances with rounding error correction. As
> SciPy expose BLAS, I could use dger and dgemm. Here is the code:
>
> import numpy as np
> import scipy as sp
> import scipy.linalg
> from scipy.linalg.fblas import dger, dgemm
>
> def mean_cov(X):
>    n,p = X.shape
>    m = X.mean(axis=0)
>    # covariance matrix with correction for rounding error
>    # S = (cx'*cx - (scx'*scx/n))/(n-1)
>    # Am Stat 1983, vol 37: 242-247.
>    cx = X - m
>    scx = cx.sum(axis=0)
>    scx_op = dger(-1.0/n,scx,scx)
>    S = dgemm(1.0, cx.T, cx.T, beta=1.0,
>            c=scx_op, trans_a=0, trans_b=1, overwrite_c=1)
>    S[:] *= 1.0/(n-1)
>    return m,S.T
>
> Let's time this couple of times against NumPy:
>
> if __name__ == '__main__':
>
>    from time import clock
>
>    n,p = 20000,1000
>    X = 2*np.random.randn(n,p) + 5
>
>    t0 = clock()
>    m,S = X.mean(axis=0), np.cov(X, rowvar=False, bias=0)
>    t1 = clock()
>    print t1-t0
>
>    t0 = clock()
>    m,S = mean_cov(X)
>    t1 = clock()
>    print t1-t0
>
> L:\>meancov.py
> 7.39771102515
> 2.24604790004
>
> L:\>meancov.py
> 16.1079984658
> 2.21100101726
>
> That speaks for itself :-D
>
> One important lesson from this: it does not help to write a function
> like cov(X) in C, when we have access to optimized BLAS from Python. So
> let this serve as a warning against using C istead of Python.
>
> If we don't want to correct rounding error, dger goes out and it just
> becomes:
>
> def mean_cov(X):
>    n,p = X.shape
>    m = X.mean(axis=0)
>    cx = X - m
>    S = dgemm(1./(n-1), cx.T, cx.T, trans_a=0, trans_b=1)
>    return m,S.T
>
>
> Sturla
>
>
> P.S. I should perhaps mention that I use SciPy and Numpy linked with
> MKL. Looking at Windows' task manager, it seems both np.cov and my code
> saturate all four cores.

I don't have MKL, so just using one core:

$ python mean_cov.py
5.09
4.73

If I switch from

n,p = 20000,1000

to

n,p = 2000,1000

I get:

0.51
0.48

A slightly faster version for small arrays (using dot instead of mean and sum):

def mean_cov2(X):
    n,p = X.shape
    one = np.empty(n)
    one.fill(1.0 / n)
    m = np.dot(one, X)
    cx = X - m
    scx = np.dot(one, cx)
    scx *= n
    scx_op = dger(-1.0/n,scx,scx)
    S = dgemm(1.0, cx.T, cx.T, beta=1.0,
    c=scx_op, trans_a=0, trans_b=1, overwrite_c=1)
    S[:] *= 1.0/(n-1)
    return m,S.T

gives

$ python mean_cov.py
0.51
0.48
0.45

but does that get rid of the rounding error correction?

(Pdb) ((S0 - S)*(S0 - S)).sum()
2.6895813860036798e-28
(Pdb) ((S2 - S)*(S2 - S)).sum()
2.331973369765596e-27

where S0 is np.cov, S is mean_cov, and S2 is mean_cov2.


More information about the SciPy-User mailing list