[Scipy-tickets] [SciPy] #1760: Inconsistent definition of "distortion" in kmeans

SciPy Trac scipy-tickets@scipy....
Tue Oct 30 16:10:40 CDT 2012

#1760: Inconsistent definition of "distortion" in kmeans
 Reporter:  Maigo                     |       Owner:  somebody   
     Type:  defect                    |      Status:  new        
 Priority:  normal                    |   Milestone:  Unscheduled
Component:  scipy.cluster             |     Version:  0.11.0     
 Keywords:  kmeans distortion square  |  
 In the documentation of the function {{{scipy.cluster.vq.kmeans}}},
 "distortion" is defined as "the sum of the squared differences between the
 observations and the corresponding centroid".
 However, it is implemented as the sum of the (unsquared) differences.

 This has the implication that the wrong objective function is being
 optimized in the kmeans iteration procedure. For example, suppose we have
 five 1-D data points: -2, -1, 1, 2, 9. The following code clusters them
 into 2 classes with -1 and 1 as the starting centroids:

 x = numpy.array([-2, -1, 1, 2, 9], dtype=float).reshape(-1,1)
 scipy.cluster.vq.kmeans(x, x[1:3])

 Currently the output centroids are -0.667 and 5.5. However, if the
 definition of distortion was corrected, the centroids would be 0 and 9,
 yielding a smaller distortion.

 This inconsistency may have stemmed from the definition of distortion as
 the distance in the funciton {{{scipy.cluster.vq.vq}}}.

 '''How to fix:''' In Line 388 of the source file vq.py, modify
 {{{distort}}} to {{{distort ** 2}}}.
 Additionally, I also found that kmeans would use integer arithmetics if
 the input data array has a dtype of integer. This has been reported before
 (http://projects.scipy.org/scipy/ticket/1246) but not fixed. Since the
 user would seldom intend to use integer arithmetics in the kmeans
 algorithm, I suggest adding a dtype cast at the beginning of the
 {{{kmeans}}} and {{{kmeans2}}} functions.

Ticket URL: <http://projects.scipy.org/scipy/ticket/1760>
SciPy <http://www.scipy.org>
SciPy is open-source software for mathematics, science, and engineering.

More information about the Scipy-tickets mailing list