Charles R Harris
Thu Jul 22 12:07:53 CDT 2010
On Thu, Jul 22, 2010 at 10:24 AM, Benjamin Root <firstname.lastname@example.org> wrote:
> On Thu, Jul 22, 2010 at 10:42 AM, alex <email@example.com> wrote:
>> On Thu, Jul 22, 2010 at 10:48 AM, Benjamin Root <firstname.lastname@example.org> wrote:
>>> On Wed, Jul 21, 2010 at 3:25 PM, alex <email@example.com> wrote:
>>>> I want to nitpick about the scipy kmeans clustering implementation.
>>>> Throughout the documentation
>>>> http://docs.scipy.org/doc/scipy/reference/cluster.vq.html and code, the
>>>> 'distortion' of a clustering is defined as "the sum of the distances between
>>>> each observation vector and its dominating centroid." I think that the sum
>>>> of squares of distances should be used instead of the sum of distances, and
>>>> all of the miscellaneous kmeans descriptions I found with google would seem
>>>> to support this.
>>>> For example if one cluster contains the 1D points (1, 2, 3, 4, 10) and
>>>> the old center is 3, then the centroid updating step will move the centroid
>>>> to 4. This step reduces the sum of squares of distances from 55 to 50, but
>>>> it increases the distortion from 11 to 12.
>>> Every implementation of kmeans (except for SciPy's) that I have seen
>>> allowed for the user to specify which distance measure they want to use.
>>> There is no right answer for a distance measure except for "it depends".
>>> Maybe SciPy's implementation should be updated to allow for user-specified
>>> distance measures (e.g. - absolute, euclidian, city-block, etc.)?
>>> Ben Root
>> While the best distance might depend on the application, I think that of
>> all these distance measures only the sum of squares of Euclidean distances
>> is guaranteed to monotonically decrease at each step of the algorithm. If
>> the scipy kmeans implementation depends on this monotonicity, which I think
>> it currently does, then this assumption could be the source of subtle bugs
>> when error measures like distortion are used.
>> Maybe I can nitpick more effectively if I frame it as "scipy is doing
>> something strange and possibly buggy" instead of as "scipy is not using the
>> distance function I want".
> It has been a while since I played with kmeans, but I believe that the
> distance measure merely has to satisfy Minkowski's inequality which is that
> the norm of a + b <= norm of a + norm of b (or was it the Cauchy-Schwarz
That's the triangle inequality and is a required property of anything called
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the SciPy-User