[SciPy-User] kmeans

Keith Goodman kwgoodman@gmail....
Fri Jul 23 19:53:33 CDT 2010


On Fri, Jul 23, 2010 at 5:46 PM, Benjamin Root <ben.root@ou.edu> wrote:
> On Fri, Jul 23, 2010 at 6:48 PM, Keith Goodman <kwgoodman@gmail.com> wrote:
>>
>> On Fri, Jul 23, 2010 at 4:00 PM, Benjamin Root <ben.root@ou.edu> wrote:
>>
>> > The stopping condition uses the change in the distortion, not a
>> > non-squared
>> > distance.  The distortion is already a sum of squares.  The only place
>> > that
>> > a non-squared distance is used is in _py_vq_1d() which appears to be
>> > very
>> > old code and it has a raise error at the very first statement.
>>
>> That's good news.
>>
>> Another place that a non-squared distance is used is the return value:
>>
>> >> import numpy as np
>> >> from scipy import cluster
>> >> v = np.array([1,2,3,4,10],dtype=float)
>> >> cluster.vq.kmeans(v, 1)
>>   (array([ 4.]), 2.3999999999999999)
>>
>> >> np.sqrt(np.dot(v-4, v-4) / 5.0)
>>   3.1622776601683795  # Nope, not returned
>> >> np.absolute(v - 4).mean()
>>   2.3999999999999999 # Yep, this one is returned
>>
>> Is that a code bug or a doc bug?
>
> Well, see, that's just the thing... the doc says that it returns the
> distortion, which is what it does, but obviously, this distortion was a MAE
> and not a RMSE.  The problem is that I have gone backwards and forwards over
> the codes, including the Cython version, and I can't find anyplace where
> this is happening.
>
> Does anybody know of any good code tracing tools?  I used trace once, but it
> wasn't very user-friendly...

I think I see it! Yes, the squared distance is calculated. But before
it is summed or meaned, the square root is taken. That turns the
squared distance into just distance.


More information about the SciPy-User mailing list