[SciPy-User] kmeans

Benjamin Root ben.root@ou....
Fri Jul 23 13:12:21 CDT 2010


On Fri, Jul 23, 2010 at 12:48 PM, alex <argriffi@ncsu.edu> wrote:

> On Fri, Jul 23, 2010 at 1:36 PM, Benjamin Root <ben.root@ou.edu> wrote:
>
>> On Fri, Jul 23, 2010 at 12:27 PM, David Cournapeau <cournape@gmail.com>wrote:
>>
>>> On Sat, Jul 24, 2010 at 2:19 AM, Benjamin Root <ben.root@ou.edu> wrote:
>>>
>>> >
>>> > Examining further, I see that SciPy's implementation is fairly
>>> simplistic
>>> > and has some issues.  In the given example, the reason why 3 is never
>>> > returned is not because of the use of the distortion metric, but rather
>>> > because the kmeans function never sees the distance for using 3.  As a
>>> > matter of fact, the actual code that does the convergence is in vq and
>>> py_vq
>>> > (vector quantization) and it tries to minimize the sum of squared
>>> errors.
>>> > kmeans just keeps on retrying the convergence with random guesses to
>>> see if
>>> > different convergences occur.
>>>
>>> As one of the maintainer of kmeans, I would be the first to admit the
>>> code is basic, for good and bad. Something more elaborate for
>>> clustering may indeed be useful, as long as the interface stays
>>> simple.
>>>
>>> More complex needs should turn on scikits.learn or more specialized
>>> packages,
>>>
>>> cheers,
>>>
>>> David
>>>
>>
>> I agree, kmeans does not need to get very complicated because kmeans (the
>> general concept) is not very suitable for very complicated situations.
>>
>> As a thought, a possible way to help out the current implementation is to
>> ensure that unique guesses are made.  Currently, several iterations are
>> wasted by performing guesses that it has already done before.  Is there a
>> way to do sampling without replacement in numpy.random?
>>
>> Ben Root
>>
>> [clip]
>
If scipy wants to use the most vanilla kmeans, then I suggest that it should
> use sum of squares of errors everywhere it is currently using the sum of
> errors.  If you really want to optimize the sum of errors, then the median
> is probably a better cluster center than the mean, but adding more center
> definitions would start to get more complicated.
>
> Alex
>
>
I think there is some confusion here about how the distortion is being
used.  In line 257 of vq.py, the sum of the square of the difference between
the observation and the centroid is calculated for each given obs.  Then the
square root is taken (line 261) and returned as a distortion at line 378.
The only place where I see a simple subtraction being used is up in
_py_vq_1d() which isn't being called and at line 391, which is merely being
used for tolerance testing.

Even in the C code, the difference is taken and immediately squared.  Maybe
the documentation no longer matches the code?

Ben Root
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.scipy.org/pipermail/scipy-user/attachments/20100723/bb0062b6/attachment.html 


More information about the SciPy-User mailing list