[SciPy-user] Kmeans help and C source

eric eric at scipy.org
Thu Jan 10 11:26:46 CST 2002

----- Original Message -----
From: "Costas Malamas" <costasm at hotmail.com>
To: <eric at enthought.com>
Cc: <scipy-user at scipy.org>
Sent: Tuesday, January 08, 2002 5:12 PM
Subject: Re: [SciPy-user] Kmeans help and C source

> Eric,
> This is more lucid than anything I've read so far on the Web. Seriously,
> this should be in the docstring :-)...
> I still need to grok where and what to modify, but at least now I know
> I don't have to look outside of SciPy ;-)

Hey Costas,

I've attached a slightly modified version of scipy/cluster/vq.py that takes
out the shennanigans used to attempt using the C version of vq.  The result
is a pure Python kmeans algorithm in 80 lines of code (The rest is
doc-strings).  Assuming you have SciPy installed, just dump this file into
your PYTHONPATH and start hacking.  It won't be as fast as the version that
uses C, but this shouldn't matter during development.

If your modifying your distance metric, then you'll need to modify py_vq().
If your just modifying how the kmeans uses the distortion information, etc.
you'll need to modify just the kmeans() or the kmeans_() helper functions.

hope that helps,

# pure_python_vq.py
""" Vector Quantization Module

    Provides several routines used in creating a code book from a set of
    observations and comparing a set of observations to a code book.

    All routines expect an "observation vector" to be stored in each
    row of the obs matrix.  Similarly the codes are stored row wise
    in the code book matrix.

    whiten(obs) --
        Normalize a group of observations on a per feature basis
    vq(obs,code_book) --
        Calculate code book membership of obs
    kmeans(obs,k_or_guess,iter=20,thresh=1e-5) --
        Train a codebook for mimimum distortion using the kmeans algorithm

from Numeric import *
from fastumath import *
from RandomArray import randint
import scipy

def whiten(obs):
    """ Normalize a group of observations on a per feature basis


            Before running kmeans algorithms, it is beneficial to "whiten",
            scale, the observation data on a per feature basis.  This is
            by dividing each feature by its standard deviation across all


            obs -- 2D array.
                    Each row of the array is an observation.  The
                    columns are the "features" seen during each observation

                              #   f0    f1    f2
                        obs = [[  1.,   1.,   1.],  #o0
                               [  2.,   2.,   2.],  #o1
                               [  3.,   3.,   3.],  #o2
                               [  4.,   4.,   4.]]) #o3

            XXX perhaps should have an axis variable here.


            result -- 2D array.
                    Contains the values in obs scaled by the standard
                    of each column.


            >>> features  = array([[  1.9,2.3,1.7],
            ...                    [  1.5,2.5,2.2],
            ...                    [  0.8,0.6,1.7,]])
            >>> whiten(features)
            array([[ 3.41250074,  2.20300046,  5.88897275],
                   [ 2.69407953,  2.39456571,  7.62102355],
                   [ 1.43684242,  0.57469577,  5.88897275]])

    std_dev = scipy.std(obs,axis=0)
    return obs / std_dev

def vq(obs,code_book):
    """ Vector Quantization: assign features sets to codes in a code book.

            Vector quantization determines which code in the code book best
            represents an observation of a target.  The features of each
            observation are compared to each code in the book, and assigned
            the one closest to it.  The observations are contained in the
            array. These features should be "whitened," or nomalized by the
            standard deviation of all the features before being quantized.
            The code book can be created using the kmeans algorithm or
            something similar.

           This currently forces 32 bit math precision for speed.  Anyone
           of a situation where this undermines the accuracy of the

            obs -- 2D array.
                    Each row of the array is an observation.  The
                    columns are the "features" seen during each observation
                    The features must be whitened first using the
                    whiten function or something equivalent.
            code_book -- 2D array.
                    The code book is usually generated using the kmeans
                    algorithm.  Each row of the array holds a different
                    code, and the columns are the features of the code.
                                    #   c0    c1    c2   c3
                        code_book = [[  1.,   2.,   3.,   4.],  #f0
                                     [  1.,   2.,   3.,   4.],  #f1
                                     [  1.,   2.,   3.,   4.]]) #f2
            code -- 1D array.
                    If obs is a NxM array, then a length M array
                    is returned that holds the selected code book index for
                    each observation.
            dist -- 1D array.
              The distortion (distance) between the observation and
              its nearest code


            >>> code_book = array([[1.,1.,1.],
            ...                    [2.,2.,2.]])
            >>> features  = array([[  1.9,2.3,1.7],
            ...                    [  1.5,2.5,2.2],
            ...                    [  0.8,0.6,1.7]])
            >>> vq(features,code_book)
            (array([1, 1, 0],'i'), array([ 0.43588989,  0.73484692,

    results = py_vq(obs,code_book)
    return results

def py_vq(obs,code_book):
    """ Python version of vq algorithm.

        This function is slower than the C versions, but it works for
        all input types.  If the inputs have the wrong types for the
        C versions of the function, this one is called as a last resort.

        Its about 20 times slower than the C versions.
    No,Nf = shape(obs) #No = observation count, Nf = feature count
    # code books and observations should have same number of features
    assert(Nf == code_book.shape[1])
    code = [];min_dist = []
    #create a memory block to use for the difference calculations
    diff = zeros(shape(code_book),_common_type(obs,code_book))
    for o in obs:
        subtract(code_book,o,diff) # faster version of --> diff =
code_book - o
        dist = sqrt(sum(diff*diff,-1))
        #something weird here dst does not work reliably because it sometime
        #returns an array of goofy length. Try except fixes it, but is ugly.
        dst = minimum.reduce(dist,0)
        try:    dst = dst[0]
        except: pass
    return array(code,typecode=Int), array(min_dist)

def kmeans_(obs,guess,thresh=1e-5):
    """ See kmeans


        code_book -- the lowest distortion codebook found.
        avg_dist -- the average distance a observation is
                    from a code in the book.  Lower means
                    the code_book matches the data better.

        Note: not whitened in this example.

        >>> features  = array([[ 1.9,2.3],
        ...                    [ 1.5,2.5],
        ...                    [ 0.8,0.6],
        ...                    [ 0.4,1.8],
        ...                    [ 1.0,1.0]])
        >>> book = array((features[0],features[2]))
        >>> kmeans_(features,book)
        (array([[ 1.7       ,  2.4       ],
               [ 0.73333333,  1.13333333]]), 0.40563916697728591)


    code_book = array(guess,copy=1)
    Nc = code_book.shape[0]
    diff = thresh+1.
    while diff>thresh:
        #compute membership and distances between obs and code_book
        obs_code, distort = vq(obs,code_book)
        #recalc code_book as centroids of associated obs
        if(diff > thresh):
            has_members = []
            for i in arange(Nc):
                cell_members = compress(equal(obs_code,i),obs,0)
                if cell_members.shape[0] > 0:
                    code_book[i] = scipy.mean(cell_members,0)
            #remove code_books that didn't have any members
            code_book = take(code_book,has_members,0)
        if len(avg_dist) > 1:
            diff = avg_dist[-2] - avg_dist[-1]
    #print avg_dist
    return code_book, avg_dist[-1]

def kmeans(obs,k_or_guess,iter=20,thresh=1e-5):
    """ Generate a code book with minimum distortion



        obs -- 2D array
                Each row of the array is an observation.  The
                columns are the "features" seen during each observation
                The features must be whitened first using the
                whiten function or something equivalent.
        k_or_guess -- integer or 2D array.
            If integer, it is the number of code book elements.
            If a 2D array, the array is used as the intial guess for
            the code book.  The array should have k rows, and the
            same number of columns (features) as the obs array.
        iter -- integer.
            The number of times to restart the kmeans algorithm with
            a new initial guess.  If k_or_guess is a 2D array (codebook),
            this argument is ignored and only 1 iteration is run.
        thresh -- float
            Terminate each kmeans run when the distortion change from
            one iteration to the next is less than this value.

        codesbook -- 2D array.
            The codes that best fit the observation
        distortion -- float
            The distortion between the observations and the codes.



        ("Not checked carefully for accuracy..." he said sheepishly)

        >>> features  = array([[ 1.9,2.3],
        ...                    [ 1.5,2.5],
        ...                    [ 0.8,0.6],
        ...                    [ 0.4,1.8],
        ...                    [ 0.1,0.1],
        ...                    [ 0.2,1.8],
        ...                    [ 2.0,0.5],
        ...                    [ 0.3,1.5],
        ...                    [ 1.0,1.0]])
        >>> whitened = whiten(features)
        >>> book = array((whitened[0],whitened[2]))
        >>> kmeans(whitened,book)
        (array([[ 2.3110306 ,  2.86287398],
               [ 0.93218041,  1.24398691]]), 0.85684700941625547)

        >>> import RandomArray
        >>> RandomArray.seed(1000,2000)
        >>> codes = 3
        >>> kmeans(whitened,codes)
        (array([[ 2.3110306 ,  2.86287398],
               [ 1.32544402,  0.65607529],
               [ 0.40782893,  2.02786907]]), 0.5196582527686241)

    if int(iter) < 1:
        raise ValueError, 'iter must be >= to 1.'
    if type(k_or_guess) == type(array([])):
        guess = k_or_guess
        result = kmeans_(obs,guess,thresh=thresh)
        best_dist = 100000 #initialize best distance value to a large value
        No = obs.shape[0]
        k = k_or_guess
        #print 'kmeans iter: ',
        for i in range(iter):
            #print i,
            #the intial code book is randomly selected from observations
            guess = take(obs,randint(0,No,k),0)
            book,dist = kmeans_(obs,guess,thresh=thresh)
            if dist < best_dist:
                best_book = book
                best_dist = dist
        result = best_book,best_dist
    return result

More information about the SciPy-user mailing list