[SciPy-user] Fitting an arbitrary distribution

David Cournapeau david@ar.media.kyoto-u.ac...
Thu May 21 21:23:00 CDT 2009

josef.pktd@gmail.com wrote:
> On Thu, May 21, 2009 at 9:58 PM, David Cournapeau
> <david@ar.media.kyoto-u.ac.jp> wrote:
>> David Baddeley wrote:
>>> Hi all,
>>> I want to fit an arbitrary distribution (in this case the sum of multiple Gaussians) to some measured data and was wondering if anyone could give me any pointers as to the best way of doing this. I'd like to avoid fitting to a histogram if possible. How do the .fit() methods of the various distributions under scipy.stats do it? My first thought would be to compare the cumulative distribution of my data with that of the model distibution using something like the kolmogorov-smirnov metric (maximum absolute distance between the curves) and to minimize this using optimize.fmin. Is this the right way to do it? Or is there an easier way?
>> That's a complex topic in general, there is no best answer, it depends
>> on your case, and what you intend to do with the estimated distribution.
>> In the case of a sum of mutiple Gaussians, the more commonly used name
>> for this model is mixture models, and there is a vast range of possible
>> techniques for fitting a dataset to this model. There is a package in
>> scikits.learn to use the so-called Expectation Maximization algorithm to
>> estimate the maximum likelihood of such models
>> http://www.ar.media.kyoto-u.ac.jp/members/david/softwares/em/
>> You can have an overview on the wiki page:
>> http://en.wikipedia.org/wiki/Mixture_model
> Sum of random variables are convolutions, and are very different from
> mixtures of distributions. I just got confused in a discussion today
> when the other person talked about convolutions and I thought about
> mixtures and it didn't make a lot of sense.

It depends on what is meant by sum of Gaussians: sum of the random
variables or sum of the distribution. In the case of the sum of random
variables, then it is a convolution as you mentioned (assuming
independence of the random variables). But I think some people think
mostly in terms of histogram/distributions, specially if they are not
statisticians. I don't understand the term "sum of gaussians" as a
technical term.


More information about the SciPy-user mailing list