[Scipy-tickets] [SciPy] #1760: Inconsistent definition of "distortion" in kmeans
SciPy Trac
scipy-tickets@scipy....
Tue Oct 30 16:10:40 CDT 2012
#1760: Inconsistent definition of "distortion" in kmeans
--------------------------------------+-------------------------------------
Reporter: Maigo | Owner: somebody
Type: defect | Status: new
Priority: normal | Milestone: Unscheduled
Component: scipy.cluster | Version: 0.11.0
Keywords: kmeans distortion square |
--------------------------------------+-------------------------------------
In the documentation of the function {{{scipy.cluster.vq.kmeans}}},
"distortion" is defined as "the sum of the squared differences between the
observations and the corresponding centroid".
However, it is implemented as the sum of the (unsquared) differences.
This has the implication that the wrong objective function is being
optimized in the kmeans iteration procedure. For example, suppose we have
five 1-D data points: -2, -1, 1, 2, 9. The following code clusters them
into 2 classes with -1 and 1 as the starting centroids:
{{{
x = numpy.array([-2, -1, 1, 2, 9], dtype=float).reshape(-1,1)
scipy.cluster.vq.kmeans(x, x[1:3])
}}}
Currently the output centroids are -0.667 and 5.5. However, if the
definition of distortion was corrected, the centroids would be 0 and 9,
yielding a smaller distortion.
This inconsistency may have stemmed from the definition of distortion as
the distance in the funciton {{{scipy.cluster.vq.vq}}}.
'''How to fix:''' In Line 388 of the source file vq.py, modify
{{{distort}}} to {{{distort ** 2}}}.
----
Additionally, I also found that kmeans would use integer arithmetics if
the input data array has a dtype of integer. This has been reported before
(http://projects.scipy.org/scipy/ticket/1246) but not fixed. Since the
user would seldom intend to use integer arithmetics in the kmeans
algorithm, I suggest adding a dtype cast at the beginning of the
{{{kmeans}}} and {{{kmeans2}}} functions.
--
Ticket URL: <http://projects.scipy.org/scipy/ticket/1760>
SciPy <http://www.scipy.org>
SciPy is open-source software for mathematics, science, and engineering.
More information about the Scipy-tickets
mailing list