[SciPy-user] curve_fit failure when supplying uncertainties

josef.pktd@gmai... josef.pktd@gmai...
Sat Mar 7 21:22:15 CST 2009

On Sat, Mar 7, 2009 at 8:52 PM, Anne Archibald
<peridot.faceted@gmail.com> wrote:
> 2009/3/7 ElMickerino <elmickerino@hotmail.com>:
>> Great, thanks so much for your help.  I seem to recall that empty bins
>> require special handling when fitting histograms, so this makes sense.  I
>> was using variance of the bin content, which are assumed to poisson
>> distributed (ie. if bin 'i' has content N_i, then var(bin_i) = sqrt(N_i) ).
>> This works fine for large N, but I believe that asymmetric errors must be
>> used for small bin content.
>> Is there a convenient way of supplying asymmetric errors to a least square
>> fit in scipy?
> If you get into small-number statistics like this, least-squares
> fitting will no longer give you the right answer. If the issue is just
> a few bins with low values you might be able to get away with
> discarding them from the fit, but that's very iffy statistically. You
> can do a maximum-likelihood fit directly, by constructing the
> likelihood (-log probability of your data) using scipy.stats.poisson.
> Moreover, if you're doing this to a histogram, you may want to think
> about avoiding the binning entirely and fitting the unbinned data
> directly, possibly going for a maximum probability on the K-S test
> (also in scipy, though you may want to look at how it works and write
> your own for fitting purposes).
> However you do it I strongly advise doing a round-trip: take the
> best-fit model, generate some fake data sets, and fit them using the
> same code. Not only does this check that your fitting code does what
> you think it does, it also gives you an estimate of the errors on the
> fitted parameters.
> Anne

Depends on what you want to estimate.
If you want a known parametric distribution, then you are better of
using maximum likelihood, or moment estimator or other estimators of
the parameters. For unbounded support, the fit method in stats
distribution works quite well (maximum likelihood estimator).

If you want a non-parametric distribution, e.g. by fitting some
arbitrary function or polynomial to the histogram, then the recent
discussion by mcsfl: "histogram q's" with reference to
Freedman-Diaconis might be helpful. It gives the optimal binsize for a
quadratic estimation loss, the normalization is sample size * binsize,
which is the same as your integral, Freedman-Diaconis give the
variance in terms of the true distribution. (This is based on quick
reading and no guarantee that I didn't misread things, and until
recently I didn't even know that there is a theory behind histograms.)

Hx = data/float(nGen)/widths[0]
 #in analogy to Freedman-Diaconis theoretical variance, but using
histogram values:
curve_fit(gaus, centers, data/float(nGen)/widths[0], sigma=1e-9+sig)

an idea is to drop empty bins, to avoid 0 variance problem

>>> idx = (data/float(nGen)/widths[0]>0)
>>> curve_fit(gaus, centers[idx], (data/float(nGen)/widths[0])[idx], sigma=sig[idx])
(array([ 0.99628941, -0.01978185,  1.00475741]), array([[
4.97224664e-12,   9.23881182e-15,   1.06746682e-13],
       [  9.23881182e-15,   5.26682373e-12,   1.33888239e-13],
       [  1.06746682e-13,   1.33888239e-13,   2.72832368e-12]]))

which is close to the moments of the sample
>>> y.mean()
>>> y.std()

An alternative non-parametric approach is to use kernel density
estimate in stats.kde


More information about the SciPy-user mailing list