[SciPy-User] [Numpy-discussion] Fitting a curve on a log-normal distributed data

Robert Kern robert.kern@gmail....
Tue Nov 17 15:12:17 CST 2009


On Tue, Nov 17, 2009 at 15:01,  <josef.pktd@gmail.com> wrote:
> On Tue, Nov 17, 2009 at 3:41 PM, Robert Kern <robert.kern@gmail.com> wrote:
>> On Tue, Nov 17, 2009 at 14:04,  <josef.pktd@gmail.com> wrote:
>>
>>> The way I see it, you have to variables, size and counts (or concentration).
>>> My initial interpretation was you want to model the relationship between
>>> these two variables.
>>> When the total number of particles is fixed, then the conditional size
>>> distribution is univariate, and could be modeled by a log-normal
>>> distribution. (This still leaves the total count unmodelled.)
>>>
>>> If you have the total particle count per bin, then it
>>> should be possible to write down the likelihood function that is
>>> discretized to the bins from the continuous distribution.
>>> Given a random particle, what's the probability of being in bin 1,
>>> bin 2 and so on. Then add the log-likelihood over all particles
>>> and maximize as a function of the log-normal parameters.
>>> (There might be a numerical trick using fraction instead of
>>> conditional count, but I'm not sure what the analogous discrete
>>> distribution would be. )
>>
>> I usually use the multinomial as the likelihood for such
>> "histogram-fitting" exercises. The two problem points here are that we
>> have real-valued concentrations, not integer-valued counts, and that
>> we don't have a measurement for the censored region. For the former, I
>> would suggest simply multiplying by the concentrations by a factor of
>> 10 (equivalently, changing the units to particles/<10^n larger
>> volume>) such that the resolution of the measurements is 1
>> particle/<volume>. Then just apply the multinomial. It should be a
>> close enough approximation.
>>
>> I'm not entirely sure what to do about the censored probability mass.
>> I think there might be a simple correction factor that you can apply
>> to the multinomial likelihood, but I haven't worked it out.
>
> I think, for the continuous distribution it would be just dividing by
> the probability of the not-censored region (which is also a function of
> the distribution parameters). This would then just be a truncated
> log-normal. multinomial might work the same, as long as the
> probabilities are defined by the discretization.
>
> Would you apply the multinomial directly? I don't see in that case
> how you would recover the parameters of the continuous distribution.

You would just be using the multinomial to build the likelihood. For
each iteration in the likelihood maximization, you are given the
parameters of the continuous distribution. Given the bin edges and
those parameters, you compute the probability mass within each bin for
that specific distribution (the difference of the CDF between bin
edges). That is the p-vector for the multinomial. The probability of
getting the observed counts is the likelihood for the given parameters
of the continuous distribution.

And now that I think about it, you don't need to apply any correction
to the multinomial in the likelihood. The number of counts in the
censored region is just another unknown parameter to optimize along
with the continuous distribution's parameters.

-- 
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless
enigma that is made terrible by our own mad attempt to interpret it as
though it had an underlying truth."
  -- Umberto Eco


More information about the SciPy-User mailing list