[SciPy-user] PyEM: custom (non-Euclidean) distance function?

David Warde-Farley dwf@cs.toronto....
Mon Mar 16 18:39:13 CDT 2009


On 16-Mar-09, at 12:46 PM, Emanuele Olivetti wrote:

> You are right. I'm coming from K-means (MacKay's book) and
> moving to GMM, that's why I had in mind custom distances.
> I can try to transform my data to be meaningful under Euclidean
> distance.


Better yet, figure out what distribution might be the right one to use  
in the one-component case. What sort of data are you working with?  
MoGs aren't a magic bullet, and you might be better off putting some  
careful consideration into the form your data takes and choosing an  
appropriate base distribution.

Pretty much any parametric distribution can be turned into a mixture  
distribution. The way a (finite) mixture works in the general case is  
that you have a discrete "hidden" random variable C that takes on  
values corresponding to one of the N clusters, and then N separate  
distributions from a parametric family (you can mix families too but  
that gets complicated and is rarely useful). Mixtures of Bernoulli,  
multinomial, Gamma,  and Poisson distributions (for example) are all  
fairly common. EM will work for all of these cases, and many more; it  
relies on a fairly general set of assumptions, the details of which  
escape me at the moment.

The machinery of the EM algorithm is much the same for any choice of  
parametric family, the difference is how you compute (or estimate) the  
posterior over C, and how you then solve for the maximum likelihood  
estimate given the (expected) "complete" data. MacKay's book should  
have a fairly general treatment of this, but if it doesn't, I know one  
is presented in Bishop (2006), http://tinyurl.com/dmkxe5 , and in  
various online course notes, for example see Andrew Ng's course at  
Stanford http://www.stanford.edu/class/cs229/materials.html or Max  
Welling's site: http://www.ics.uci.edu/~welling/classnotes/classnotes.html

I haven't ever used PyEM so I don't know how general David's code is,  
but it might be a helpful guide.

David



More information about the SciPy-user mailing list