# [SciPy-User] kmeans (Re: Mailing list?)

Robert Kern robert.kern@gmail....
Mon Nov 23 01:54:16 CST 2009

```On Mon, Nov 23, 2009 at 01:18, Simon Friedberger
<simon+python@a-oben.org> wrote:
> Hi David,
>
> thanks for your explanation. I agree with your arguments but couldn't it
> have the opposite effect: Weighing features that should have less
> discriminative power more because they have a small variance?

If a variable has a small variance, a large deviation in that variable
is *very* informative and should have a larger impact on the
classification than a small deviation in a variable that has a large
variance.

Let's distinguish two cases: one in which each variable has its own
units (let's say degrees Celsius and meters) and one in which each
variable is commensurable and in the same units (let's say meters).

Now, in the first case, you need some way to put all of the variables
into the same units so you can sensibly compute a distance using all
of the variables. A reasonable choice of units is "one standard
deviation [of the marginal distribution for the variable]".

In the second case, there *may* be a case for not doing prewhitening.
If your points are actually 3D points in real space with a metric,
then you may want to use that space's metric as the distance. However,
if the process that created your data is creating "oblong"
distributions of points, that may indicate that it is using a
different notion of distance. In fact, you may want to do a PCA to
find the right rotation such that your variables are orthogonal to the
principal directions of variation. And then prewhiten in those
directions.

The key point is to find an appropriate definition of distance to use.
Prewhitening is a good default when you don't have a model of your
process, yet. And you usually don't. :-)

--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless
enigma that is made terrible by our own mad attempt to interpret it as
though it had an underlying truth."
-- Umberto Eco
```