[SciPy-user] PyEM: custom (non-Euclidean) distance function?
David Warde-Farley
dwf@cs.toronto....
Mon Mar 16 18:39:13 CDT 2009
On 16-Mar-09, at 12:46 PM, Emanuele Olivetti wrote:
> You are right. I'm coming from K-means (MacKay's book) and
> moving to GMM, that's why I had in mind custom distances.
> I can try to transform my data to be meaningful under Euclidean
> distance.
Better yet, figure out what distribution might be the right one to use
in the one-component case. What sort of data are you working with?
MoGs aren't a magic bullet, and you might be better off putting some
careful consideration into the form your data takes and choosing an
appropriate base distribution.
Pretty much any parametric distribution can be turned into a mixture
distribution. The way a (finite) mixture works in the general case is
that you have a discrete "hidden" random variable C that takes on
values corresponding to one of the N clusters, and then N separate
distributions from a parametric family (you can mix families too but
that gets complicated and is rarely useful). Mixtures of Bernoulli,
multinomial, Gamma, and Poisson distributions (for example) are all
fairly common. EM will work for all of these cases, and many more; it
relies on a fairly general set of assumptions, the details of which
escape me at the moment.
The machinery of the EM algorithm is much the same for any choice of
parametric family, the difference is how you compute (or estimate) the
posterior over C, and how you then solve for the maximum likelihood
estimate given the (expected) "complete" data. MacKay's book should
have a fairly general treatment of this, but if it doesn't, I know one
is presented in Bishop (2006), http://tinyurl.com/dmkxe5 , and in
various online course notes, for example see Andrew Ng's course at
Stanford http://www.stanford.edu/class/cs229/materials.html or Max
Welling's site: http://www.ics.uci.edu/~welling/classnotes/classnotes.html
I haven't ever used PyEM so I don't know how general David's code is,
but it might be a helpful guide.
David
More information about the SciPy-user
mailing list