[SciPy-user] Numerical Recipes robust fit implementation

Angus McMorland amcmorl@gmail....
Mon Jul 16 17:07:22 CDT 2007


Hi all,

On 16/07/07, Anne Archibald <peridot.faceted@gmail.com> wrote:
> On 16/07/07, Angus McMorland <amcmorl@gmail.com> wrote:
>
> > I'm going to have to think a bit more what I want to achieve to see if
> > RANSAC is useful. Ultimately I hope to determine the probability of a
> > a given data set being exponentially distributed, by comparing the raw
> > frequency distribution to an expected distribution based on a linear
> > fit to the log transform of the raw one. It seems a bit like basing my
> > 'expected' distribution on a subset of data from which outliers have
> > been completely excluded is self-fulfilling, and having some other
> > criterion for weighting of the error term (as medfit does) seems more
> > appropriate. This is however very much just a gut feeling rather than
> > an educated assessment, so any other comments are welcome.
>
> If what you're trying to do is test whether your data points are
> likely to have been drawn from a given distribution, you may be able
> to do much better by not putting them in a histogram first, and using
> something like the Kolmogorov-Smirnov test (scipy.stats.kstest). If
> you have outliers you may have a problem. (It's feasible to fit
> parameters so they maximize the kstest p value, although of course the
> p value you get at the end is no longer an actual probability.) I
> suspect if you look in the statistical literature you'll find other
> tricks for fitting distributions directly (rather than fitting
> histograms).

Kolmogorov-Smirnov is the way I had intended originally to go (there's
a variant that approximates the probability for grouped (read:
frequency) data). But, as Anne rightly points out I can apply it to
the individual variates since I have them--- my earlier approach arose
mainly from how I was looking at the data, and I need to get away from
that.

I still need an expected cdf based on my hypothesized distribution,
parameters for which I want to estimate from my data: that's where the
fitting comes in. My only remaining decision is whether a robust or
least-squares fitting approach is more appropriate for deriving the
expected distribution. The former is inherently self-fulfilling, in
that it excludes from the estimation of the expected distribution the
outliers that are likely the important deviations, and the latter will
include all the data, but fit none of it very well. Time to play
around and see how much difference there is, I think.

Thanks for all your suggestions, they've been very useful.

Angus.
-- 
AJC McMorland, PhD Student
Physiology, University of Auckland


More information about the SciPy-user mailing list