[SciPy-user] Kmeans help and C source

eric eric at scipy.org
Thu Jan 10 11:26:46 CST 2002


----- Original Message -----
From: "Costas Malamas" <costasm at hotmail.com>
To: <eric at enthought.com>
Cc: <scipy-user at scipy.org>
Sent: Tuesday, January 08, 2002 5:12 PM
Subject: Re: [SciPy-user] Kmeans help and C source


> Eric,
>
> This is more lucid than anything I've read so far on the Web. Seriously,
> this should be in the docstring :-)...
>
> I still need to grok where and what to modify, but at least now I know
that
> I don't have to look outside of SciPy ;-)

Hey Costas,

I've attached a slightly modified version of scipy/cluster/vq.py that takes
out the shennanigans used to attempt using the C version of vq.  The result
is a pure Python kmeans algorithm in 80 lines of code (The rest is
doc-strings).  Assuming you have SciPy installed, just dump this file into
your PYTHONPATH and start hacking.  It won't be as fast as the version that
uses C, but this shouldn't matter during development.

If your modifying your distance metric, then you'll need to modify py_vq().
If your just modifying how the kmeans uses the distortion information, etc.
you'll need to modify just the kmeans() or the kmeans_() helper functions.

hope that helps,
eric

-------------------------------------------------
# pure_python_vq.py
""" Vector Quantization Module

    Provides several routines used in creating a code book from a set of
    observations and comparing a set of observations to a code book.

    All routines expect an "observation vector" to be stored in each
    row of the obs matrix.  Similarly the codes are stored row wise
    in the code book matrix.

    whiten(obs) --
        Normalize a group of observations on a per feature basis
    vq(obs,code_book) --
        Calculate code book membership of obs
    kmeans(obs,k_or_guess,iter=20,thresh=1e-5) --
        Train a codebook for mimimum distortion using the kmeans algorithm

"""
from Numeric import *
from fastumath import *
from RandomArray import randint
import scipy

def whiten(obs):
    """ Normalize a group of observations on a per feature basis

        Description

            Before running kmeans algorithms, it is beneficial to "whiten",
or
            scale, the observation data on a per feature basis.  This is
done
            by dividing each feature by its standard deviation across all
            observations.

        Arguments

            obs -- 2D array.
                    Each row of the array is an observation.  The
                    columns are the "features" seen during each observation

                              #   f0    f1    f2
                        obs = [[  1.,   1.,   1.],  #o0
                               [  2.,   2.,   2.],  #o1
                               [  3.,   3.,   3.],  #o2
                               [  4.,   4.,   4.]]) #o3

            XXX perhaps should have an axis variable here.

        Outputs

            result -- 2D array.
                    Contains the values in obs scaled by the standard
devation
                    of each column.

        Test

            >>> features  = array([[  1.9,2.3,1.7],
            ...                    [  1.5,2.5,2.2],
            ...                    [  0.8,0.6,1.7,]])
            >>> whiten(features)
            array([[ 3.41250074,  2.20300046,  5.88897275],
                   [ 2.69407953,  2.39456571,  7.62102355],
                   [ 1.43684242,  0.57469577,  5.88897275]])

    """
    std_dev = scipy.std(obs,axis=0)
    return obs / std_dev

def vq(obs,code_book):
    """ Vector Quantization: assign features sets to codes in a code book.

        Description:
            Vector quantization determines which code in the code book best
            represents an observation of a target.  The features of each
            observation are compared to each code in the book, and assigned
            the one closest to it.  The observations are contained in the
obs
            array. These features should be "whitened," or nomalized by the
            standard deviation of all the features before being quantized.
            The code book can be created using the kmeans algorithm or
            something similar.

        Note:
           This currently forces 32 bit math precision for speed.  Anyone
know
           of a situation where this undermines the accuracy of the
algorithm?


        Arguments:
            obs -- 2D array.
                    Each row of the array is an observation.  The
                    columns are the "features" seen during each observation
                    The features must be whitened first using the
                    whiten function or something equivalent.
            code_book -- 2D array.
                    The code book is usually generated using the kmeans
                    algorithm.  Each row of the array holds a different
                    code, and the columns are the features of the code.
                                    #   c0    c1    c2   c3
                        code_book = [[  1.,   2.,   3.,   4.],  #f0
                                     [  1.,   2.,   3.,   4.],  #f1
                                     [  1.,   2.,   3.,   4.]]) #f2
        Outputs:
            code -- 1D array.
                    If obs is a NxM array, then a length M array
                    is returned that holds the selected code book index for
                    each observation.
            dist -- 1D array.
              The distortion (distance) between the observation and
              its nearest code
        Reference

        Test

            >>> code_book = array([[1.,1.,1.],
            ...                    [2.,2.,2.]])
            >>> features  = array([[  1.9,2.3,1.7],
            ...                    [  1.5,2.5,2.2],
            ...                    [  0.8,0.6,1.7]])
            >>> vq(features,code_book)
            (array([1, 1, 0],'i'), array([ 0.43588989,  0.73484692,
0.83066239]))

    """
    results = py_vq(obs,code_book)
    return results

def py_vq(obs,code_book):
    """ Python version of vq algorithm.

        This function is slower than the C versions, but it works for
        all input types.  If the inputs have the wrong types for the
        C versions of the function, this one is called as a last resort.

        Its about 20 times slower than the C versions.
    """
    No,Nf = shape(obs) #No = observation count, Nf = feature count
    # code books and observations should have same number of features
    assert(Nf == code_book.shape[1])
    code = [];min_dist = []
    #create a memory block to use for the difference calculations
    diff = zeros(shape(code_book),_common_type(obs,code_book))
    for o in obs:
        subtract(code_book,o,diff) # faster version of --> diff =
code_book - o
        dist = sqrt(sum(diff*diff,-1))
        code.append(argmin(dist))
        #something weird here dst does not work reliably because it sometime
        #returns an array of goofy length. Try except fixes it, but is ugly.
        dst = minimum.reduce(dist,0)
        try:    dst = dst[0]
        except: pass
        min_dist.append(dst)
    return array(code,typecode=Int), array(min_dist)

def kmeans_(obs,guess,thresh=1e-5):
    """ See kmeans

    Outputs

        code_book -- the lowest distortion codebook found.
        avg_dist -- the average distance a observation is
                    from a code in the book.  Lower means
                    the code_book matches the data better.
    Test

        Note: not whitened in this example.

        >>> features  = array([[ 1.9,2.3],
        ...                    [ 1.5,2.5],
        ...                    [ 0.8,0.6],
        ...                    [ 0.4,1.8],
        ...                    [ 1.0,1.0]])
        >>> book = array((features[0],features[2]))
        >>> kmeans_(features,book)
        (array([[ 1.7       ,  2.4       ],
               [ 0.73333333,  1.13333333]]), 0.40563916697728591)

    """

    code_book = array(guess,copy=1)
    Nc = code_book.shape[0]
    avg_dist=[]
    diff = thresh+1.
    while diff>thresh:
        #compute membership and distances between obs and code_book
        obs_code, distort = vq(obs,code_book)
        avg_dist.append(scipy.mean(distort,axis=-1))
        #recalc code_book as centroids of associated obs
        if(diff > thresh):
            has_members = []
            for i in arange(Nc):
                cell_members = compress(equal(obs_code,i),obs,0)
                if cell_members.shape[0] > 0:
                    code_book[i] = scipy.mean(cell_members,0)
                    has_members.append(i)
            #remove code_books that didn't have any members
            code_book = take(code_book,has_members,0)
        if len(avg_dist) > 1:
            diff = avg_dist[-2] - avg_dist[-1]
    #print avg_dist
    return code_book, avg_dist[-1]

def kmeans(obs,k_or_guess,iter=20,thresh=1e-5):
    """ Generate a code book with minimum distortion

    Description

    Arguments

        obs -- 2D array
                Each row of the array is an observation.  The
                columns are the "features" seen during each observation
                The features must be whitened first using the
                whiten function or something equivalent.
        k_or_guess -- integer or 2D array.
            If integer, it is the number of code book elements.
            If a 2D array, the array is used as the intial guess for
            the code book.  The array should have k rows, and the
            same number of columns (features) as the obs array.
        iter -- integer.
            The number of times to restart the kmeans algorithm with
            a new initial guess.  If k_or_guess is a 2D array (codebook),
            this argument is ignored and only 1 iteration is run.
        thresh -- float
            Terminate each kmeans run when the distortion change from
            one iteration to the next is less than this value.
    Outputs

        codesbook -- 2D array.
            The codes that best fit the observation
        distortion -- float
            The distortion between the observations and the codes.

    Reference

    Test

        ("Not checked carefully for accuracy..." he said sheepishly)

        >>> features  = array([[ 1.9,2.3],
        ...                    [ 1.5,2.5],
        ...                    [ 0.8,0.6],
        ...                    [ 0.4,1.8],
        ...                    [ 0.1,0.1],
        ...                    [ 0.2,1.8],
        ...                    [ 2.0,0.5],
        ...                    [ 0.3,1.5],
        ...                    [ 1.0,1.0]])
        >>> whitened = whiten(features)
        >>> book = array((whitened[0],whitened[2]))
        >>> kmeans(whitened,book)
        (array([[ 2.3110306 ,  2.86287398],
               [ 0.93218041,  1.24398691]]), 0.85684700941625547)

        >>> import RandomArray
        >>> RandomArray.seed(1000,2000)
        >>> codes = 3
        >>> kmeans(whitened,codes)
        (array([[ 2.3110306 ,  2.86287398],
               [ 1.32544402,  0.65607529],
               [ 0.40782893,  2.02786907]]), 0.5196582527686241)

    """
    if int(iter) < 1:
        raise ValueError, 'iter must be >= to 1.'
    if type(k_or_guess) == type(array([])):
        guess = k_or_guess
        result = kmeans_(obs,guess,thresh=thresh)
    else:
        best_dist = 100000 #initialize best distance value to a large value
        No = obs.shape[0]
        k = k_or_guess
        #print 'kmeans iter: ',
        for i in range(iter):
            #print i,
            #the intial code book is randomly selected from observations
            guess = take(obs,randint(0,No,k),0)
            book,dist = kmeans_(obs,guess,thresh=thresh)
            if dist < best_dist:
                best_book = book
                best_dist = dist
        #print
        result = best_book,best_dist
    return result






More information about the SciPy-user mailing list