# [SciPy-user] Fitting an arbitrary distribution

Thu May 21 22:47:21 CDT 2009

```Thanks for the prompt replies!

I guess what I was meaning was that the PDF / histogram was the sum or multiple Gaussians/normal distibutions. Sorry about the ambiguity. I've had a quick look at the Em package and mixture models, and while my problem is similar they might be a little more general.

I guess I should describe the problem in a bit more detail - I'm measuring the length of an objects which can be built up from multiple unit cells. The measured size distribution is thus multimodal, and I want to extract both the unit size and the fraction of objects having each number of unit cells. This makes the problem much more constrained than what is dealt with in the Em package.

So far I've tried overriding rv_continuous to create a distribution which roughly matches - but haven't been able to fit this.

cheers,
David

----- Original Message ----
From: "scipy-user-request@scipy.org" <scipy-user-request@scipy.org>
To: scipy-user@scipy.org
Sent: Friday, 22 May, 2009 2:44:37 PM
Subject: SciPy-user Digest, Vol 69, Issue 43

Send SciPy-user mailing list submissions to
scipy-user@scipy.org

To subscribe or unsubscribe via the World Wide Web, visit
http://mail.scipy.org/mailman/listinfo/scipy-user
or, via email, send a message with subject or body 'help' to
scipy-user-request@scipy.org

You can reach the person managing the list at
scipy-user-owner@scipy.org

than "Re: Contents of SciPy-user digest..."

Today's Topics:

1. Fitting an arbitrary distribution (David Baddeley)
2. Re: Fitting an arbitrary distribution (David Cournapeau)
3. Re: Inconsistent function calls? (Ivo Maljevic)
4. Re: Fitting an arbitrary distribution (josef.pktd@gmail.com)
5. Re: Fitting an arbitrary distribution (josef.pktd@gmail.com)
6. Re: Fitting an arbitrary distribution (David Cournapeau)
7. Re: Fitting an arbitrary distribution (josef.pktd@gmail.com)
8. Re: Fitting an arbitrary distribution (josef.pktd@gmail.com)

----------------------------------------------------------------------

Message: 1
Date: Thu, 21 May 2009 18:47:00 -0700 (PDT)
Subject: [SciPy-user] Fitting an arbitrary distribution
To: scipy-user@scipy.org
Message-ID: <36002.66689.qm@web33005.mail.mud.yahoo.com>
Content-Type: text/plain; charset=utf-8

Hi all,

I want to fit an arbitrary distribution (in this case the sum of multiple Gaussians) to some measured data and was wondering if anyone could give me any pointers as to the best way of doing this. I'd like to avoid fitting to a histogram if possible. How do the .fit() methods of the various distributions under scipy.stats do it? My first thought would be to compare the cumulative distribution of my data with that of the model distibution using something like the kolmogorov-smirnov metric (maximum absolute distance between the curves) and to minimize this using optimize.fmin. Is this the right way to do it? Or is there an easier way?

David

------------------------------

Message: 2
Date: Fri, 22 May 2009 10:58:06 +0900
From: David Cournapeau <david@ar.media.kyoto-u.ac.jp>
Subject: Re: [SciPy-user] Fitting an arbitrary distribution
<scipy-user@scipy.org>
Message-ID: <4A1606AE.1030008@ar.media.kyoto-u.ac.jp>
Content-Type: text/plain; charset=ISO-8859-1

> Hi all,
>
> I want to fit an arbitrary distribution (in this case the sum of multiple Gaussians) to some measured data and was wondering if anyone could give me any pointers as to the best way of doing this. I'd like to avoid fitting to a histogram if possible. How do the .fit() methods of the various distributions under scipy.stats do it? My first thought would be to compare the cumulative distribution of my data with that of the model distibution using something like the kolmogorov-smirnov metric (maximum absolute distance between the curves) and to minimize this using optimize.fmin. Is this the right way to do it? Or is there an easier way?

That's a complex topic in general, there is no best answer, it depends
on your case, and what you intend to do with the estimated distribution.

In the case of a sum of mutiple Gaussians, the more commonly used name
for this model is mixture models, and there is a vast range of possible
techniques for fitting a dataset to this model. There is a package in
scikits.learn to use the so-called Expectation Maximization algorithm to
estimate the maximum likelihood of such models

http://www.ar.media.kyoto-u.ac.jp/members/david/softwares/em/

You can have an overview on the wiki page:

http://en.wikipedia.org/wiki/Mixture_model

cheers,

David

------------------------------

Message: 3
Date: Thu, 21 May 2009 22:17:42 -0400
From: Ivo Maljevic <ivo.maljevic@gmail.com>
Subject: Re: [SciPy-user] Inconsistent function calls?
To: SciPy Users List <scipy-user@scipy.org>
Message-ID:
<826c64da0905211917u15ec1567g72547e6cff117535@mail.gmail.com>
Content-Type: text/plain; charset="iso-8859-1"

Sorry Christopher, I thought since they are used for the same purpose, and
have similar syntax (http://www.scipy.org/NumPy_for_Matlab_Users says
``MATLAB? and NumPy/SciPy have a lot in common``), that SciPy looks more
like Matlab than any other programing language (excluding Octave and other
Matlab clones).

As for everything else you wrote, I already said that I don`t have any
problem with using SciPy the way it is.

Ivo

2009/5/21 Christopher Barker <Chris.Barker@noaa.gov>

> Ivo Maljevic wrote:
> > why bother to make something that looks like matlab,
>
> who ever said numpy "looks like matlab", any more than it look s like
> any number of other programming environments...
>
> > Matplotlib does a pretty good job at  replicating
> > matlab plot functions, at least at the level I need it to.
>
> Because is was designed exactly to do that -- but I think MPL's Matlab
> replicating has been a hindrance, rather than a help, to a good API.
> However, is has been a help to its adoption.
>
> You may have noticed that over the years MPL is moving away from matlab,
> toward a more pythonic API.
>
> Personally, I like python so much more than Matlab exactly for these
> differences (and so many more). I suppose it's tough if you switch back
> and forth, but I haven't touched Matlab in years.
>
> It is rand() that is inconsistent, and that is an accident of history.
>
> > what ones([3,3]) does, the same way random.rand(3,3) does,
>
> well, rand() is a convenience function, and doesn't take a bunch of
> other parameters.  In fact, it's listed under "Compatibility functions",
> and is really a wrapper for:
>
> numpy.random.uniform, which takes a shape argument.
>
> > the reason why I included that error message in my previous message
> > is because I think it is completely non-helpful.
>
> That's another issue -- non-helpful error messages do show up a lot --
> in that case, if the user had typed:
>
> np.zeros(3, dtype=3)
>
> the error message would make sense. If you can suggest a better message,
> patches are always welcome.
>
> -Chris
>
>
>
> --
> Christopher Barker, Ph.D.
> Oceanographer
>
> Emergency Response Division
> NOAA/NOS/OR&R            (206) 526-6959   voice
> 7600 Sand Point Way NE   (206) 526-6329   fax
> Seattle, WA  98115       (206) 526-6317   main reception
>
> Chris.Barker@noaa.gov
> _______________________________________________
> SciPy-user mailing list
> SciPy-user@scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-user
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.scipy.org/pipermail/scipy-user/attachments/20090521/9b06a80b/attachment-0001.html

------------------------------

Message: 4
Date: Thu, 21 May 2009 22:27:20 -0400
From: josef.pktd@gmail.com
Subject: Re: [SciPy-user] Fitting an arbitrary distribution
<scipy-user@scipy.org>
Message-ID:
<1cd32cbb0905211927l2ec6e3fbs1a5922b21bc966bd@mail.gmail.com>
Content-Type: text/plain; charset=ISO-8859-1

On Thu, May 21, 2009 at 9:47 PM, David Baddeley
>
> Hi all,
>
> I want to fit an arbitrary distribution (in this case the sum of multiple Gaussians) to some measured data and was wondering if anyone could give me any pointers as to the best way of doing this. I'd like to avoid fitting to a histogram if possible. How do the .fit() methods of the various distributions under scipy.stats do it? My first thought would be to compare the cumulative distribution of my data with that of the model distibution using something like the kolmogorov-smirnov metric (maximum absolute distance between the curves) and to minimize this using optimize.fmin. Is this the right way to do it? Or is there an easier way?
>

I have an example script that tries to fit a dataset to all
distributions in scipy.stats

I use ksstat as distance metric.

If you have data with full support on the real line and look only at
those distributions, then the current fit method works pretty well.
Problems exist for distribution with a finite support boundary point.
And stats.distributions only has univariate distributions, there is no
support for multivariate distributions.
I have also written several extension distributions (also univariate
only), that are however not yet in scipy.

What exactly do you mean with "sum of multiple Gaussians"? If i take
it literally as sum of several normal distributed random variables,
then the distribution would be just normal again.

would be better able to see if scipy.stats can handle them.

Josef

------------------------------

Message: 5
Date: Thu, 21 May 2009 22:33:12 -0400
From: josef.pktd@gmail.com
Subject: Re: [SciPy-user] Fitting an arbitrary distribution
To: SciPy Users List <scipy-user@scipy.org>
Message-ID:
<1cd32cbb0905211933x53ea5b88na2f64934c5f121c@mail.gmail.com>
Content-Type: text/plain; charset=ISO-8859-1

On Thu, May 21, 2009 at 9:58 PM, David Cournapeau
<david@ar.media.kyoto-u.ac.jp> wrote:
>> Hi all,
>>
>> I want to fit an arbitrary distribution (in this case the sum of multiple Gaussians) to some measured data and was wondering if anyone could give me any pointers as to the best way of doing this. I'd like to avoid fitting to a histogram if possible. How do the .fit() methods of the various distributions under scipy.stats do it? My first thought would be to compare the cumulative distribution of my data with that of the model distibution using something like the kolmogorov-smirnov metric (maximum absolute distance between the curves) and to minimize this using optimize.fmin. Is this the right way to do it? Or is there an easier way?
>
> That's a complex topic in general, there is no best answer, it depends
> on your case, and what you intend to do with the estimated distribution.
>
> In the case of a sum of mutiple Gaussians, the more commonly used name
> for this model is mixture models, and there is a vast range of possible
> techniques for fitting a dataset to this model. There is a package in
> scikits.learn to use the so-called Expectation Maximization algorithm to
> estimate the maximum likelihood of such models
>
> http://www.ar.media.kyoto-u.ac.jp/members/david/softwares/em/
>
> You can have an overview on the wiki page:
>
> http://en.wikipedia.org/wiki/Mixture_model
>

Sum of random variables are convolutions, and are very different from
mixtures of distributions. I just got confused in a discussion today
mixtures and it didn't make a lot of sense.

so, which is it?

Josef

------------------------------

Message: 6
Date: Fri, 22 May 2009 11:23:00 +0900
From: David Cournapeau <david@ar.media.kyoto-u.ac.jp>
Subject: Re: [SciPy-user] Fitting an arbitrary distribution
To: SciPy Users List <scipy-user@scipy.org>
Message-ID: <4A160C84.9050500@ar.media.kyoto-u.ac.jp>
Content-Type: text/plain; charset=ISO-8859-1

josef.pktd@gmail.com wrote:
> On Thu, May 21, 2009 at 9:58 PM, David Cournapeau
> <david@ar.media.kyoto-u.ac.jp> wrote:
>
>>
>>> Hi all,
>>>
>>> I want to fit an arbitrary distribution (in this case the sum of multiple Gaussians) to some measured data and was wondering if anyone could give me any pointers as to the best way of doing this. I'd like to avoid fitting to a histogram if possible. How do the .fit() methods of the various distributions under scipy.stats do it? My first thought would be to compare the cumulative distribution of my data with that of the model distibution using something like the kolmogorov-smirnov metric (maximum absolute distance between the curves) and to minimize this using optimize.fmin. Is this the right way to do it? Or is there an easier way?
>>>
>> That's a complex topic in general, there is no best answer, it depends
>> on your case, and what you intend to do with the estimated distribution.
>>
>> In the case of a sum of mutiple Gaussians, the more commonly used name
>> for this model is mixture models, and there is a vast range of possible
>> techniques for fitting a dataset to this model. There is a package in
>> scikits.learn to use the so-called Expectation Maximization algorithm to
>> estimate the maximum likelihood of such models
>>
>> http://www.ar.media.kyoto-u.ac.jp/members/david/softwares/em/
>>
>> You can have an overview on the wiki page:
>>
>> http://en.wikipedia.org/wiki/Mixture_model
>>
>>
>
> Sum of random variables are convolutions, and are very different from
> mixtures of distributions. I just got confused in a discussion today
> when the other person talked about convolutions and I thought about
> mixtures and it didn't make a lot of sense.
>

It depends on what is meant by sum of Gaussians: sum of the random
variables or sum of the distribution. In the case of the sum of random
variables, then it is a convolution as you mentioned (assuming
independence of the random variables). But I think some people think
mostly in terms of histogram/distributions, specially if they are not
statisticians. I don't understand the term "sum of gaussians" as a
technical term.

David

------------------------------

Message: 7
Date: Thu, 21 May 2009 22:41:37 -0400
From: josef.pktd@gmail.com
Subject: Re: [SciPy-user] Fitting an arbitrary distribution
To: SciPy Users List <scipy-user@scipy.org>
Message-ID:
<1cd32cbb0905211941l5b7f6611g84aaedcd57150b9e@mail.gmail.com>
Content-Type: text/plain; charset=ISO-8859-1

On Thu, May 21, 2009 at 10:33 PM,  <josef.pktd@gmail.com> wrote:
> On Thu, May 21, 2009 at 9:58 PM, David Cournapeau
> <david@ar.media.kyoto-u.ac.jp> wrote:
>>> Hi all,
>>>
>>> I want to fit an arbitrary distribution (in this case the sum of multiple Gaussians) to some measured data and was wondering if anyone could give me any pointers as to the best way of doing this. I'd like to avoid fitting to a histogram if possible. How do the .fit() methods of the various distributions under scipy.stats do it? My first thought would be to compare the cumulative distribution of my data with that of the model distibution using something like the kolmogorov-smirnov metric (maximum absolute distance between the curves) and to minimize this using optimize.fmin. Is this the right way to do it? Or is there an easier way?
>>
>> That's a complex topic in general, there is no best answer, it depends
>> on your case, and what you intend to do with the estimated distribution.
>>
>> In the case of a sum of mutiple Gaussians, the more commonly used name
>> for this model is mixture models, and there is a vast range of possible
>> techniques for fitting a dataset to this model. There is a package in
>> scikits.learn to use the so-called Expectation Maximization algorithm to
>> estimate the maximum likelihood of such models
>>
>> http://www.ar.media.kyoto-u.ac.jp/members/david/softwares/em/
>>
>> You can have an overview on the wiki page:
>>
>> http://en.wikipedia.org/wiki/Mixture_model
>>
>
> Sum of random variables are convolutions, and are very different from
> mixtures of distributions. I just got confused in a discussion today
> when the other person talked about convolutions and I thought about
> mixtures and it didn't make a lot of sense.
>
> so, which is it?
>

Actually, Gaussians is in this context ambiguous, does it mean a
random variable or refer to the density/distribution function.
Sum of random variable is very different from a (weighted) sum of
distribution functions, which both are possible interpretation of "sum
of Gaussians"

Josef

------------------------------

Message: 8
Date: Thu, 21 May 2009 22:44:30 -0400
From: josef.pktd@gmail.com
Subject: Re: [SciPy-user] Fitting an arbitrary distribution
To: SciPy Users List <scipy-user@scipy.org>
Message-ID:
Content-Type: text/plain; charset=ISO-8859-1

On Thu, May 21, 2009 at 10:23 PM, David Cournapeau
<david@ar.media.kyoto-u.ac.jp> wrote:
> josef.pktd@gmail.com wrote:
>> On Thu, May 21, 2009 at 9:58 PM, David Cournapeau
>> <david@ar.media.kyoto-u.ac.jp> wrote:
>>
>>>
>>>> Hi all,
>>>>
>>>> I want to fit an arbitrary distribution (in this case the sum of multiple Gaussians) to some measured data and was wondering if anyone could give me any pointers as to the best way of doing this. I'd like to avoid fitting to a histogram if possible. How do the .fit() methods of the various distributions under scipy.stats do it? My first thought would be to compare the cumulative distribution of my data with that of the model distibution using something like the kolmogorov-smirnov metric (maximum absolute distance between the curves) and to minimize this using optimize.fmin. Is this the right way to do it? Or is there an easier way?
>>>>
>>> That's a complex topic in general, there is no best answer, it depends
>>> on your case, and what you intend to do with the estimated distribution.
>>>
>>> In the case of a sum of mutiple Gaussians, the more commonly used name
>>> for this model is mixture models, and there is a vast range of possible
>>> techniques for fitting a dataset to this model. There is a package in
>>> scikits.learn to use the so-called Expectation Maximization algorithm to
>>> estimate the maximum likelihood of such models
>>>
>>> http://www.ar.media.kyoto-u.ac.jp/members/david/softwares/em/
>>>
>>> You can have an overview on the wiki page:
>>>
>>> http://en.wikipedia.org/wiki/Mixture_model
>>>
>>>
>>
>> Sum of random variables are convolutions, and are very different from
>> mixtures of distributions. I just got confused in a discussion today
>> when the other person talked about convolutions and I thought about
>> mixtures and it didn't make a lot of sense.
>>
>
> It depends on what is meant by sum of Gaussians: sum of the random
> variables or sum of the distribution. In the case of the sum of random
> variables, then it is a convolution as you mentioned (assuming
> independence of the random variables). But I think some people think
> mostly in terms of histogram/distributions, specially if they are not
> statisticians. I don't understand the term "sum of gaussians" as a
> technical term.
>

Yes, I agree, you were ahead of me on realizing this.

Josef

------------------------------

_______________________________________________
SciPy-user mailing list
SciPy-user@scipy.org
http://mail.scipy.org/mailman/listinfo/scipy-user

End of SciPy-user Digest, Vol 69, Issue 43
******************************************

```