[SciPy-User] kmeans
Keith Goodman
kwgoodman@gmail....
Fri Jul 23 19:53:33 CDT 2010
On Fri, Jul 23, 2010 at 5:46 PM, Benjamin Root <ben.root@ou.edu> wrote:
> On Fri, Jul 23, 2010 at 6:48 PM, Keith Goodman <kwgoodman@gmail.com> wrote:
>>
>> On Fri, Jul 23, 2010 at 4:00 PM, Benjamin Root <ben.root@ou.edu> wrote:
>>
>> > The stopping condition uses the change in the distortion, not a
>> > non-squared
>> > distance. The distortion is already a sum of squares. The only place
>> > that
>> > a non-squared distance is used is in _py_vq_1d() which appears to be
>> > very
>> > old code and it has a raise error at the very first statement.
>>
>> That's good news.
>>
>> Another place that a non-squared distance is used is the return value:
>>
>> >> import numpy as np
>> >> from scipy import cluster
>> >> v = np.array([1,2,3,4,10],dtype=float)
>> >> cluster.vq.kmeans(v, 1)
>> (array([ 4.]), 2.3999999999999999)
>>
>> >> np.sqrt(np.dot(v-4, v-4) / 5.0)
>> 3.1622776601683795 # Nope, not returned
>> >> np.absolute(v - 4).mean()
>> 2.3999999999999999 # Yep, this one is returned
>>
>> Is that a code bug or a doc bug?
>
> Well, see, that's just the thing... the doc says that it returns the
> distortion, which is what it does, but obviously, this distortion was a MAE
> and not a RMSE. The problem is that I have gone backwards and forwards over
> the codes, including the Cython version, and I can't find anyplace where
> this is happening.
>
> Does anybody know of any good code tracing tools? I used trace once, but it
> wasn't very user-friendly...
I think I see it! Yes, the squared distance is calculated. But before
it is summed or meaned, the square root is taken. That turns the
squared distance into just distance.
