[SciPy-user] PyEM: custom (non-Euclidean) distance function?

Emanuele Olivetti emanuele@relativita....
Mon Mar 16 11:46:30 CDT 2009

josef.pktd@gmail.com wrote:
> On Mon, Mar 16, 2009 at 12:05 PM, Emanuele Olivetti
> <emanuele@relativita.com> wrote:
>> Emanuele Olivetti wrote:
>>> Hi All,
>>> I'm playing with PyEM [0] in scikits and would like to feed
>>> a dataset for which Euclidean distance is not supposed to
>>> work. So I'm wondering how simple is to modify the code with
>>> a custom distance (e.g., 1-norm).
>> Additional info. My final goal is to run the EM algorithm
>> and estimate the Gaussian mixture from data, but assuming
>> a different distance function. I had a look to densities.py
>> which seems to be the relevant file for this question. I
>> can see the computation of Euclidean distance in:
>> - _scalar_gauss_den()
>> - _diag_gauss_den()
>> - _full_gauss_den()
>> So the question is: if I change those functions according to a
>> new distance function, is it expected the EM estimation
>> em.train() to work meaningfully? Are there other parts of PyEM
>> that assumes Euclidean distance function?
>> Emanuele
> I don't know the answer, but I'm curious about your data and the
> problem that you cannot calculate Euclidean distance.
> The Gaussian mixture is based on the normal distribution for
> continuous random variables and as such uses euclidean distance, or a
> variant based on the covariance matrix to define the density function.
> This seems to me a conflict between trying to fit the data to a
> gaussian mixture if it doesn't allow gaussian distance calculations.
> If the data is really different, then a gaussian mixture might not be
> appropriate.
> >From a quick look, gmm_em.py and gauss_mix.py are specialized to the
> normal distribution and fully parametric, and I'm not sure what
> distribution you get if you just change the distance function. And to
> correctly allow for other distributions, would require more far
> reaching changes than just changing the distance function, at least
> that is my impression.

You are right. I'm coming from K-means (MacKay's book) and
moving to GMM, that's why I had in mind custom distances.
I can try to transform my data to be meaningful under Euclidean

Thanks anyway.


More information about the SciPy-user mailing list