[SciPy-user] sparse version of stats.pearsonr ?

josef.pktd@gmai... josef.pktd@gmai...
Mon Mar 9 19:39:36 CDT 2009


On Mon, Mar 9, 2009 at 5:53 PM, Peter Skomoroch
<peter.skomoroch@gmail.com> wrote:
> Before I re-invent the wheel, is there an existing version of
> stats.pearsonr(x,y) that will work with scipy.sparse vectors?
>
> -Pete

Pearson correlation coefficient is just the regular correlation,
numpy.corrcoef  plus the t-statistic for the test that the correlation
coefficient is zero.

I'm not familiar enough with the sparse package to know how the
details work, but in my first try, `mean` seems strange to me

>>> B
<4x4 sparse matrix of type '<type 'numpy.int32'>'
	with 4 stored elements in Compressed Sparse Row format>
>>> B.todense()
matrix([[3, 0, 1, 0],
        [0, 2, 0, 0],
        [0, 0, 0, 0],
        [0, 0, 0, 1]])
>>> B.todense().mean(axis=0)
matrix([[ 0.75,  0.5 ,  0.25,  0.25]])
>>> B.mean(axis=0)
matrix([[0, 0, 0, 0]])
>>> B.todense().mean(axis=1)
matrix([[ 1.  ],
        [ 0.5 ],
        [ 0.  ],
        [ 0.25]])
>>> B.mean(axis=1)
matrix([[1],
        [0],
        [0],
        [0]])
>>> B.sum(axis=1)
matrix([[4],
        [2],
        [0],
        [1]])
>>> B.sum(axis=0)
matrix([[3, 2, 1, 1]])


Here is my version of sparse corrcoef and cov, that takes also zero
points into account, i.e. it is the same as using
np.cov(sparsematrix.todense())   but (I hope) it avoids any dense
calculation on the original matrix:


import numpy as np
from scipy import sparse, stats

# example from doc strings, help
I = np.array([0,0,1,3,1,0,0])
J = np.array([0,2,1,3,1,0,0])
V = np.array([1,1,1,1,1,1,1])
B = sparse.coo_matrix((V,(I,J)),shape=(4,4)).tocsr()


def covcsr(x):
    '''return covariance matrix, assumes column variable'''
    meanx = x.sum(axis=0)/float(x.shape[0])
    return ((x.T*x)/x.shape[0] - meanx.T*meanx)

def corrcoefcsr(x):
    covx = covcsr(x)
    stdx = np.sqrt(np.diag(covx))[np.newaxis,:]
    return covx/(stdx.T * stdx)


B1 = B[:,:2]
B1d = B1.todense()

print 'sparse cov:\n', covcsr(B1)
print 'np.cov:\n', np.cov(B1d, rowvar=0, bias=1)
print 'sparse corrcoef:\n', corrcoefcsr(B1)
print 'np.corrcoef:\n', np.corrcoef(B1d, rowvar=0, bias=1)
print 'stats.pearsonr:', stats.pearsonr(B1d[:,0],B1d[:,1])


Josef


More information about the SciPy-user mailing list