[SciPy-User] How to estimate error in polynomial coefficients from scipy.polyfit?

Anne Archibald peridot.faceted@gmail....
Mon Mar 29 13:02:09 CDT 2010


On 29 March 2010 13:13, David Goldsmith <d.l.goldsmith@gmail.com> wrote:
> On Mon, Mar 29, 2010 at 8:08 AM, Jeremy Conlin <jlconlin@gmail.com> wrote:
>>
>> On Thu, Mar 25, 2010 at 9:40 PM, David Goldsmith
>> <d.l.goldsmith@gmail.com> wrote:
>> > On Thu, Mar 25, 2010 at 3:40 PM, Jeremy Conlin <jlconlin@gmail.com>
>> > wrote:
>> >>
>> >> Yikes!  This sounds like it may be more trouble than it's worth.  I
>> >> have a few sets of statistical data that each need to have curves fit
>> >> to them.
>> >
>> > That's an awfully generic need - it may be obvious from examination of
>> > the
>> > data that a line is inappropriate, but besides polynomials there are
>> > many
>> > other non-linear models (which can be linearly fit to data by means of
>> > data
>> > transformation) which possess fewer parameters (and thus are simpler
>> > from a
>> > parameter analysis perspective).  So, the question is: why are you
>> > fitting
>> > to polynomials?  If it's just to get a good fit to the data, you might
>> > be
>> > getting "more fit" than your data warrants (and even if that isn't a
>> > problem, you probably want to use a polynomial form different from
>> > "standard
>> > form," e.g., Lagrange interpolators).  Are you sure something like an
>> > exponential growth/decay or power law model (both of which are "more
>> > natural," linearizable, two-parameter models) wouldn't be more
>> > appropriate -
>> > it would almost certainly be simpler to analyze (and perhaps easier to
>> > justify to a referee).
>> >
>> > On this note, perhaps some of our experts might care to comment: what
>> > "physics" (in a generalized sense) gives rise to a polynomial dependency
>> > of
>> > degree higher than two?  The only generic thing I can think of is
>> > something
>> > where third or higher order derivatives proportional to the independent
>> > variable are important, and those are pretty uncommon.
>>
>> I will only be fitting data to a first or second degree polynomial.
>> Eventually I would like to fit my data to an exponential or a power
>> law, just to see how it compares to a low-order polynomial.  Choosing
>> these functions was based on qualitative analysis (i.e. "it looks
>> quadratic").
>
> This is a good approach if applied to the right data: the residuals, i.e.,
> the (signed, not absolute value) differences between the predicted and
> corresponding measured dependent variable values.  For example, if you fit a
> line to quadratic data and then plot the residuals v. the independent
> variable, you'll see systematic error: most of the residuals at the
> extremities will be on one side of zero, while most of those around the
> center will be on the other side of zero, i.e., the residuals will look like
> a parabola.  If you then fit the original data to a quadratic model (and a
> quadratic model is "physically" appropriate) and plot the residuals, these
> should appear to be randomly (and rather symmetrically) distributed around
> zero.  This generalizes: systematic behavior of residuals reveals
> deficiencies in the assumed form of your model - the "best" (most
> "physical") model produces residuals randomly (though not necessarily
> uniformly) distributed around zero (because the "best" model captures all
> non-randomness in your data).  Moral: when doing regression where you're not
> a priori certain of your model, always plot the residuals.

I would shorten that to "always plot the residuals". In fact, that's
one of the problems with unbinned maximum-likelihood methods: there
often isn't any feasible notion of "residual".

As a quality-of-fit measure, the standard tool is the chi-squared
distribution: if you (linearly) fit an n-parameter model to an m-point
data set, and the model captures the true behaviour of your system,
then the sum of the squared errors divided by the variances is
distributed according to a chi-squared distribution with m-n degrees
of freedom. So you can calculate this number and evaluate the
likelihood that the Gaussian noise would produce a quality-of-fit this
bad. If you get a tiny probability, you can be pretty sure your model
is not fitting the data adequately.

Also worth noting is the highly annoying situation where you have what
purports to be a good fit: the chi-squared gives you a decent
probability of noise producing a fit this bad, and the error bars on
the residuals all nicely include zero, but there's *still* a
discernible trend in the residuals. I interpret this to mean that not
only does my model not capture the system's behaviour, but I have also
overestimated my error bars: the fact that there's a good fit in spite
of the trend means that there is not as much scatter in the residuals
as there should be.

Anne

> DG
>
>
>
>>
>> The best case scenario would be that I take what I learn from this
>> "simple" example and apply it to more difficult problems as they come
>> along down the road.  It appears, however, that it's not so simple to
>> apply it to other problems.  I wish I had more time to learn about
>> fitting data to curves.  I'm sure there are a lot of powerful tools
>> that can help.
>>
>> Jeremy
>> _______________________________________________
>> SciPy-User mailing list
>> SciPy-User@scipy.org
>> http://mail.scipy.org/mailman/listinfo/scipy-user
>
>
> _______________________________________________
> SciPy-User mailing list
> SciPy-User@scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-user
>
>


More information about the SciPy-User mailing list