[Scipy-tickets] [SciPy] #866: cluster.kmeans2 has incorrect random init for 2 column data
SciPy
scipy-tickets@scipy....
Wed Feb 4 20:49:43 CST 2009
#866: cluster.kmeans2 has incorrect random init for 2 column data
---------------------------+------------------------------------------------
Reporter: josefpktd | Owner: somebody
Type: defect | Status: new
Priority: normal | Milestone: 0.8.0
Component: scipy.cluster | Version: devel
Severity: normal | Keywords:
---------------------------+------------------------------------------------
cluster.vq._krandinit does not create a bivariate random sample with the
desired covariance, because np.cov returns a scalar and not a covariance
matrix for data with 2 columns (or rows). (I think this was changed not
very long ago)
As a consequence the bivariate normal sample has perfect correlation and
not the one of the data.
Instead of doing the Cholesky decomposition to generate the multivariate
normal the numpy function np.random.multivariate_normal could be used (but
that wouldn't make a difference, I think and does not address the np.cov
problem).
example:
{{{
bvn = np.random.multivariate_normal([0,0],[[1,0.5],[0.5,1]],500)
>>> r2d=cluster.vq._krandinit(bvn,500).shape
>>> np.corrcoef(r2d)
1
>>> np.corrcoef(bvn, rowvar=0)
array([[ 1. , 0.5018332],
[ 0.5018332, 1. ]])
}}}
other cases work correctly, e.g.
{{{
>>> r3d=cluster.vq._krandinit(rn3d,500)
>>> np.corrcoef(r3d, rowvar=0)
array([[ 1. , 0.56225876, 0.90405282],
[ 0.56225876, 1. , 0.52268196],
[ 0.90405282, 0.52268196, 1. ]])
>>> np.corrcoef(rn3d, rowvar=0)
array([[ 1. , 0.51687592, 0.90504051],
[ 0.51687592, 1. , 0.47141087],
[ 0.90504051, 0.47141087, 1. ]])
--
Ticket URL: <http://scipy.org/scipy/scipy/ticket/866>
SciPy <http://www.scipy.org/>
SciPy is open-source software for mathematics, science, and engineering.
More information about the Scipy-tickets
mailing list