[SciPy-user] Fitting an arbitrary distribution

David Warde-Farley dwf@cs.toronto....
Thu May 21 23:46:43 CDT 2009


On 21-May-09, at 11:47 PM, David Baddeley wrote:

> Thanks for the prompt replies!
>
> I guess what I was meaning was that the PDF / histogram was the sum  
> or multiple Gaussians/normal distibutions. Sorry about the  
> ambiguity. I've had a quick look at the Em package and mixture  
> models, and while my problem is similar they might be a little more  
> general.
>
> I guess I should describe the problem in a bit more detail - I'm  
> measuring the length of an objects which can be built up from  
> multiple unit cells. The measured size distribution is thus  
> multimodal, and I want to extract both the unit size and the  
> fraction of objects having each number of unit cells. This makes the  
> problem much more constrained than what is dealt with in the Em  
> package.

So there is exactly one kind of 'unit cell' and then different lengths  
of objects? Are your observations expected to be particularly noisy?

What you're staring down isn't quite a standard mixture model, your   
'hidden variable' is just this unit size.

Depending on the scale of your problem PyMC would work very well here.  
You'd have one Stochastic for the unit size, another for the integer  
multiple associated with each quantity, a Deterministic that  
multiplies the two and either make the Deterministic observed or add  
another Stochastic with the Deterministic as its mean if you believe  
your observations are noisy with a certain noise distribution.

Note that since these things you are measuring are supposedly discrete  
multiples of the unit size a Gaussian distribution isn't appropriate  
for the multiples. Something like a Poisson would make more sense.

Then you'd just fit the model and look at the posterior distributions  
over each quantity of interest, taking the maximum a posteriori  
(posterior mean) estimate where appropriate.  To determine the  
fraction of the population that have a given number of unit cells, you  
basically just count (you'd have an estimate for each observation of  
how many unit cells it has).

You could also do this by EM, but pyem would not be suitable, as it's  
built specifically for the case of vanilla Gaussian mixtures. PyMC  
would be a ready-made solution which would give you the additional  
flexibility of inferring a distribution over all estimated parameters  
rather than just a point-estimate.

David


More information about the SciPy-user mailing list