[SciPy-User] [Numpy-discussion] Fitting a curve on a log-normal distributed data
Tue Nov 17 15:11:59 CST 2009
I guess it depends on how accurately you want to estimate the missing bin, and whether you can get any information about the amount of error in the individual measurements. Just looking at the curve you posted it looks like the variability at low particle sizes is a lot higher than at larger particle sizes. Although you would expect a similar effect due to the Poisson nature of counting, I'd expect it to be smaller. This might suggest that there is additional structure in your size distribution at these sizes, and that the best you can hope for with a log-normal model is a fairly rough approximation.
If this is the case, I suspect you might be able to get away with just doing a least-squares fit of a log-normal model function to your measured values, potentially with weights which reflect the estimated error in each bin (obtained either by taking the std. deviation of repeated measurements, or by analysing the noise characteristics of the measurement instrument). Although it's not strictly optimal, and you ought to be aware of the potential hiccups, it's often good enough for the task at hand (I use it routinely to fit 2D Gaussians to point like objects in image data).
----- Original Message ----
From: Robert Kern <firstname.lastname@example.org>
To: SciPy Users List <email@example.com>
Sent: Wed, 18 November, 2009 9:41:47 AM
Subject: Re: [SciPy-User] [Numpy-discussion] Fitting a curve on a log-normal distributed data
On Tue, Nov 17, 2009 at 14:04, <firstname.lastname@example.org> wrote:
> The way I see it, you have to variables, size and counts (or concentration).
> My initial interpretation was you want to model the relationship between
> these two variables.
> When the total number of particles is fixed, then the conditional size
> distribution is univariate, and could be modeled by a log-normal
> distribution. (This still leaves the total count unmodelled.)
> If you have the total particle count per bin, then it
> should be possible to write down the likelihood function that is
> discretized to the bins from the continuous distribution.
> Given a random particle, what's the probability of being in bin 1,
> bin 2 and so on. Then add the log-likelihood over all particles
> and maximize as a function of the log-normal parameters.
> (There might be a numerical trick using fraction instead of
> conditional count, but I'm not sure what the analogous discrete
> distribution would be. )
I usually use the multinomial as the likelihood for such
"histogram-fitting" exercises. The two problem points here are that we
have real-valued concentrations, not integer-valued counts, and that
we don't have a measurement for the censored region. For the former, I
would suggest simply multiplying by the concentrations by a factor of
10 (equivalently, changing the units to particles/<10^n larger
volume>) such that the resolution of the measurements is 1
particle/<volume>. Then just apply the multinomial. It should be a
close enough approximation.
I'm not entirely sure what to do about the censored probability mass.
I think there might be a simple correction factor that you can apply
to the multinomial likelihood, but I haven't worked it out.
> Once the parameters of the log-normal distribution are
> estimated, the distribution would be defined over all of
> the real line (where the out of sample pdf is determined
> by assumption not data).
Since we are extrapolating to the censored region, it would probably
be a good idea to estimate the uncertainty of the estimate. I would
probably suggest using PyMC to do a Bayesian model. A parametric
bootstrap might also serve.
"I have come to believe that the whole world is an enigma, a harmless
enigma that is made terrible by our own mad attempt to interpret it as
though it had an underlying truth."
-- Umberto Eco
SciPy-User mailing list
More information about the SciPy-User