[SciPy-User] kmeans

Benjamin Root ben.root@ou....
Thu Jul 22 14:25:56 CDT 2010


On Thu, Jul 22, 2010 at 12:07 PM, Charles R Harris <
charlesr.harris@gmail.com> wrote:

>
>
> On Thu, Jul 22, 2010 at 10:24 AM, Benjamin Root <ben.root@ou.edu> wrote:
>
>> On Thu, Jul 22, 2010 at 10:42 AM, alex <argriffi@ncsu.edu> wrote:
>>
>>> On Thu, Jul 22, 2010 at 10:48 AM, Benjamin Root <ben.root@ou.edu> wrote:
>>>
>>>> On Wed, Jul 21, 2010 at 3:25 PM, alex <argriffi@ncsu.edu> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I want to nitpick about the scipy kmeans clustering implementation.
>>>>> Throughout the documentation
>>>>> http://docs.scipy.org/doc/scipy/reference/cluster.vq.html and code,
>>>>> the 'distortion' of a clustering is defined as "the sum of the distances
>>>>> between each observation vector and its dominating centroid."  I think that
>>>>> the sum of squares of distances should be used instead of the sum of
>>>>> distances, and all of the miscellaneous kmeans descriptions I found with
>>>>> google would seem to support this.
>>>>>
>>>>> For example if one cluster contains the 1D points (1, 2, 3, 4, 10) and
>>>>> the old center is 3, then the centroid updating step will move the centroid
>>>>> to 4.  This step reduces the sum of squares of distances from 55 to 50, but
>>>>> it increases the distortion from 11 to 12.
>>>>>
>>>>> Alex
>>>>>
>>>>
>>>> Every implementation of kmeans (except for SciPy's) that I have seen
>>>> allowed for the user to specify which distance measure they want to use.
>>>> There is no right answer for a distance measure except for "it depends".
>>>> Maybe SciPy's implementation should be updated to allow for user-specified
>>>> distance measures (e.g. - absolute, euclidian, city-block, etc.)?
>>>>
>>>> Ben Root
>>>>
>>>
>>> While the best distance might depend on the application, I think that of
>>> all these distance measures only the sum of squares of Euclidean distances
>>> is guaranteed to monotonically decrease at each step of the algorithm.  If
>>> the scipy kmeans implementation depends on this monotonicity, which I think
>>> it currently does, then this assumption could be the source of subtle bugs
>>> when error measures like distortion are used.
>>>
>>> Maybe I can nitpick more effectively if I frame it as "scipy is doing
>>> something strange and possibly buggy" instead of as "scipy is not using the
>>> distance function I want".
>>>
>>>
>> It has been a while since I played with kmeans, but I believe that the
>> distance measure merely has to satisfy Minkowski's inequality which is that
>> the norm of a + b <= norm of a + norm of b (or was it the Cauchy-Schwarz
>> inequality?)
>>
>
> That's the triangle inequality and is a required property of anything
> called a norm.
>
> <snip>
>
> Chuck
>
>
>
Right... why is it that I first think of the most complex concepts first...?

Ben Root
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.scipy.org/pipermail/scipy-user/attachments/20100722/377f618d/attachment-0001.html 


More information about the SciPy-User mailing list