[Numpy-discussion] recommendations on goodness of fit functions?
Tue Apr 14 09:13:51 CDT 2009
Brennan Williams wrote:
> Charles R Harris wrote:
>> On Mon, Apr 13, 2009 at 6:28 PM, Brennan Williams
>> <mailto:email@example.com>> wrote:
>> Hi numpy/scipy users,
>> I'm looking to add some basic goodness-of-fit functions/plots to
>> my app.
>> I have a set of simulated y vs time data and a set of observed y
>> vs time
>> The time values aren't always the same, i.e. there are often fewer
>> observed data points.
>> Some variables will be in a 0.0...1.0 range, others in a
Ignoring the time scale, if time is linear with respect to the
parameters then you should be okay. If these cause numerical issues, you
probably want to scale or standardize the time scale first. If your
model has polynomials then you probably want to select an orthogonal
type that has good properties for what you want. Otherwise, you make
either transform the data or fit an appropriate non-linear model.
>> I'm also hoping to update the calculated goodness of fit value at
>> simulated timestep, the idea being to allow the user to set a
>> level which if exceeded stops the simulation (which otherwise can
>> running for many hours/days).
Exactly how does this work?
I would have thought that you have simulated model that has set
parameters and simulates the data based on the inputs that include a
time range. For example, if I have a simple linear model Y=Xb, I would
estimate parameters b and then allow the user to provide their X so it
should stop relatively quickly depending on the model complexity and
dimensions of X.
It appears like you are doing some search over some values to maximize
the parameters b or some type of 'what if' or sensitivity scenarios. In
the first case you probably should use one of the optimization
algorithms in scipy. In the second case, the simulation would stop when
it exceeds the parameter space.
> Before I try and answer the following, attached is an example of a
> suggested GOF function.
>> Some questions.
>> 1) What kind of fit are you doing?
>> 2) What is the measurement model?
>> 3) What do you know apriori about the measurement errors?
>> 4) How is the time series modeled?
> The simulated data are output by a oil reservoir simulator.
So much depends on the simulator model - for simple linear models
assuming normality you could get away with R2 or mean square error or
mean absolute deviance. But that really doesn't cut it with complex
models. If you are trying different models, then you should look at
model comparison techniques.
> Time series is typically monthly or annual timesteps over anything
> from 5-30 years
> but it could also be in some cases 10 minute timesteps over 24 hours
> The timesteps output by the simulator are controlled by the user and
> are not always even, e.g. for a simulation over 30 years you may
> have annual timesteps from year 0 to year 25 and then 3 monthly from
> year 26-29 and then monthly for the most recent year.
If the data is not measured on the same regular interval across the
complete period (say every month) you have a potential problem of
selecting the correct time scale and making suitable assumptions for
missing data points (like assuming that the value from one year is equal
to the value at 1/4 of a year). If you can not make suitable
assumptions, you probably can not mix the different time scales so you
probably need a piece-wise solution to handle the different periods.
> Not sure about measurement errors - the older the data the higher the
> errors due to changes in oil field measurement technology.
> And the error range varies depending on the data type as well, e.g.
> error range for a water meter is likely to be higher than that for an
> oil or gas meter.
This implies heterogeneity of variance over time which further
complicates things. But again this is something that should be addressed
when creating the simulator.
More information about the Numpy-discussion