# [SciPy-user] Fitting an arbitrary distribution

David Warde-Farley dwf@cs.toronto....
Thu May 21 23:46:43 CDT 2009

```On 21-May-09, at 11:47 PM, David Baddeley wrote:

> Thanks for the prompt replies!
>
> I guess what I was meaning was that the PDF / histogram was the sum
> or multiple Gaussians/normal distibutions. Sorry about the
> ambiguity. I've had a quick look at the Em package and mixture
> models, and while my problem is similar they might be a little more
> general.
>
> I guess I should describe the problem in a bit more detail - I'm
> measuring the length of an objects which can be built up from
> multiple unit cells. The measured size distribution is thus
> multimodal, and I want to extract both the unit size and the
> fraction of objects having each number of unit cells. This makes the
> problem much more constrained than what is dealt with in the Em
> package.

So there is exactly one kind of 'unit cell' and then different lengths
of objects? Are your observations expected to be particularly noisy?

What you're staring down isn't quite a standard mixture model, your
'hidden variable' is just this unit size.

Depending on the scale of your problem PyMC would work very well here.
You'd have one Stochastic for the unit size, another for the integer
multiple associated with each quantity, a Deterministic that
multiplies the two and either make the Deterministic observed or add
another Stochastic with the Deterministic as its mean if you believe
your observations are noisy with a certain noise distribution.

Note that since these things you are measuring are supposedly discrete
multiples of the unit size a Gaussian distribution isn't appropriate
for the multiples. Something like a Poisson would make more sense.

Then you'd just fit the model and look at the posterior distributions
over each quantity of interest, taking the maximum a posteriori
(posterior mean) estimate where appropriate.  To determine the
fraction of the population that have a given number of unit cells, you
basically just count (you'd have an estimate for each observation of
how many unit cells it has).

You could also do this by EM, but pyem would not be suitable, as it's
built specifically for the case of vanilla Gaussian mixtures. PyMC