[SciPy-user] Robust Statistics

Bruce Southey bsouthey@gmail....
Thu Jul 31 08:47:14 CDT 2008

Nuttall, Brandon C wrote:
> Folks,
> I have some rate/time data I'm in the process of analyzing. The data represent the natural gas produced from a well over time. The data are fit best by either an exponential model (least squares best fit of log of rate data, just qi and di see below) or a "hyperbolic" model:
> qt = qi *(1-b*di*t)**(-1.0/b)
>         qt = rate at time t
>         qi = initial rate (calculated)
>         b = decline exponent (rate of change of the decline rate, calculated)
>         di = initial decline rate (calculated)
>         t = time
> The calculated parameters are found using the optimization routine scipy.optimize.fmin_tnc(). I'm already rejecting some best fit results because either the correlation coefficient is not statistically significant or the di (slope) is not significantly different from 0.
> The image I've attached shows two things. An exponential best fit is shown by the solid line and a hyperbolic best fit by the dotted line. It is common practice in analyzing this type of data to begin the best fit analysis at an early time point where the data actually start to "decline"; thus, the hyperbolic best fit started at time 2 and the first rate point (rate = about 1350) is ignored. The other thing the graph shows is that there appear to be outliers.
> My question is, what Python resources are there for testing for outliers and robust statistics?
> Is RANSAC appropriate for this? Note that when I run ransack.py from the cookbook, I get:
> Traceback (most recent call last):
> --Snip other traceback info---
>   File "C:\Documents and Settings\bnuttall\Desktop\ransac.py", line 132, in fit
>     A = numpy.vstack([data[:,i] for i in self.input_columns]).T
> AttributeError: 'numpy.ndarray' object has no attribute 'T'
> I presume the T attribute is supposed to mean the transpose operator and have changed the offending statement(s, there are 4 total) to the form:
> A = numpy.vstack([data[:,i] for i in self.input_columns]).transpose()
> Thanks.
> Brandon Nuttall
> ------------------------------------------------------------------------
> ------------------------------------------------------------------------
> _______________________________________________
> SciPy-user mailing list
> SciPy-user@scipy.org
> http://projects.scipy.org/mailman/listinfo/scipy-user
First, what do you mean by 'outlier'? Please see, for example, 
http://en.wikipedia.org/wiki/Outlier because outliers fully depend on 
the (implicit or explicit) model fitted. Note this is really defined for 
linear models not nonlinear models that you are fitting.

Second, NumPy provides the tools to do fit various models but you need 
to implement your own statistics especially for nonlinear models. (In 
particular I would strongly suggest looking at likelihood ratio test and 
Bayesian information criterion for model selection.) Really your best 
bet is to use to R directly or interface to R with Rpy. That way you 
also have access to many nonlinear statistical routines that can fit 
that type of data.

If you do not need a parametric model, try fitting splines. These may be 
more suited to handle the variability near t=60 although that may be 
more related to data collection and time scale.


More information about the SciPy-user mailing list