[Numpy-discussion] recommendations on goodness of fit functions?

Bruce Southey bsouthey@gmail....
Tue Apr 14 09:13:51 CDT 2009

Brennan Williams wrote:
> Charles R Harris wrote:
>> On Mon, Apr 13, 2009 at 6:28 PM, Brennan Williams 
>> <brennan.williams@visualreservoir.com 
>> <mailto:brennan.williams@visualreservoir.com>> wrote:
>>     Hi numpy/scipy users,
>>     I'm looking to add some basic goodness-of-fit functions/plots to
>>     my app.
>>     I have a set of simulated y vs time data and a set of observed y
>>     vs time
>>     data.
>>     The time values aren't always the same, i.e. there are often fewer
>>     observed data points.
>>     Some variables will be in a 0.0...1.0 range, others in a
>>     0.0.....1.0e+12
>>     range.
Ignoring the time scale, if time is linear with respect to the 
parameters then you should be okay. If these cause numerical issues, you 
probably want to scale or standardize the time scale first. If your 
model has polynomials then you probably want to select an orthogonal 
type that has good properties for what you want. Otherwise, you make 
either transform the data or fit an appropriate non-linear model.

>>     I'm also hoping to update the calculated goodness of fit value at 
>> each
>>     simulated timestep, the idea being to allow the user to  set a
>>     tolerance
>>     level which if exceeded stops the simulation (which otherwise can 
>> keep
>>     running for many hours/days).
Exactly how does this work?

I would have thought that you have simulated model that has set 
parameters and simulates the data based on the inputs that include a 
time range. For example, if I have a simple linear model Y=Xb, I would 
estimate parameters b and then allow the user to provide their X so it 
should stop relatively quickly depending on the model complexity and 
dimensions of X.

It appears like you are doing some search over some values to maximize 
the parameters b or some type of 'what if' or sensitivity scenarios. In 
the first case you probably should use one of the optimization 
algorithms in scipy. In the second case, the simulation would stop when 
it exceeds the parameter space.

> Before I try and answer the following, attached is an example of a 
> suggested GOF function.
>> Some questions.
>> 1) What kind of fit are you doing?
>> 2) What is the measurement model?
>> 3) What do you know apriori about the measurement errors?
>> 4) How is the time series modeled?
> The simulated data are output by a oil reservoir simulator.
So much depends on the simulator model - for simple linear models 
assuming normality you could get away with R2 or mean square error or 
mean absolute deviance. But that really doesn't cut it with complex 
models. If you are trying different models, then you should look at 
model comparison techniques.
> Time series is typically monthly or annual timesteps over anything 
> from 5-30 years
> but it could also be in some cases 10 minute timesteps over 24 hours
> The timesteps output by the simulator are controlled by the user and 
> are not always even, e.g. for a simulation over 30 years you may
> have annual timesteps from year 0 to year 25 and then 3 monthly from 
> year 26-29 and then monthly for the most recent year.
If the data is not measured on the same regular interval across the 
complete period (say every month) you have a potential problem of 
selecting the correct time scale and making suitable assumptions for 
missing data points (like assuming that the value from one year is equal 
to the value at 1/4 of a year). If you can not make suitable 
assumptions, you probably can not mix the different time scales so you 
probably need a piece-wise solution to handle the different periods.

> Not sure about measurement errors - the older the data the higher the 
> errors due to changes in oil field measurement technology.
> And the error range varies depending on the data type as well, e.g. 
> error range for a water meter is likely to be higher than that for an 
> oil or gas meter.
This implies heterogeneity of variance over time which further 
complicates things. But again this is something that should be addressed 
when creating the simulator.


More information about the Numpy-discussion mailing list