Model and experiment fitting

Tom Loredo loredo at astro.cornell.edu
Mon Oct 23 16:32:11 CDT 2006


Hi Sebastian,

I'm still unclear about the problem.  From your last description, it
sounds like the problem is *not* 2-D; you are trying to model
a 1-d function of a 1-D argument (unless B is not the scalar field
strength, but rather \vec B, the 3-D field vector).

Also, your problem appears to be a regression (i.e., curve/surface
fitting) problem, not a density estimation problem (comparing samples
drawn from two distributions), so the Numerical Recipes suggestion
(which addresses comparing densities) is probably not relevant.

It sounds like you have some observed data that perhaps can be modeled as

y_i = f(x_i; theta) + e_i

where e_i represent measurement error, f(x;theta) is the "true
model" that gives the intensity as a function of the field strength,
which depends on some parameters theta, and y_i are the measured
intensities.  I presume you know something about the y_i measurement
errors (like a std error for all or for each of them).

You also have a computational model that gives you simulation
data that can be modeled as

Y_i = g(X_i; theta)

with (presumably) no Y_i "measurement error" (though perhaps there is 
an equivalent if you use Monte Carlo in the calculation or have
other quantifiable sources of error).

As you phrased the problem, it appears you know theta exactly for
both the observational and simulation data (an unusual situation!).
You just want to ascertain whether g is a good approximation to
f.  Is this correct?

There is a large literature on problems like this where theta
is *unknown* and one wants to either calibrate the simulation or
infer theta for the observations from a sparse set of runs of
the simulation at various theta.  But no papers come immediately
to mind for the known theta case, though the "validation" stage of
some of the available papers address it.  For the unknown theta case,
the relevant literature is fairly new and goes under various
names (DACE: Design & Analysis of Computer Experiments;
BACCO: Bayesian Analysis of Computer Code Output; MUCM: Managing
Uncertainty in Complex Models).  The main tools are
interpolators that *quantify uncertainty* in the interpolation and 
machinery to propagate uncertainty through subsequent analyses.  Gaussian
processes (which include things like splines, kriging and random
walks as special cases, but with "error propagation") are
used to build an "emulator" for the simulation (an emulator is
an interpolator that also gives you measures of uncertainty
in the interpolation).  There are free codes available for
Matlab and R implementing this paradigm, but it's a tricky
business (as I am slowly discovering, having just started to
delve into it).

It is unclear from your description why the X_i cannot be made
equal to the x_i.  If either or both are uncontrollable---esp. if
you cannot set x_i but have to rely on "nature" providing you
x_i values that would differ from one set of observations to
the next---this adds a layer of complexity that is not trivial.
It becomes a "measurement error problem" (aka "errors-in-the-variables
problem"), with subtle aspects (you can easily go astray by 
simply ignoring measurement errors on the x's; they do *not*
"average out" as you get more sample points).  This can
(and has) been incorporated into the emulator framework, though
I don't know if any of the free code does this.

Part of the MUCM/BACCO approach is estimation of a "bias" or
"misfit" term between the simulation and calibration data (here
it would be delta(x) = g(x)-f(x)).  Perhaps
your problem can be phrased in terms of whether the bias function
is significantly nonzero anywhere, and if so, where.

There are various places to go to learn about this stuff if it
interests you; here are a few links.  The first two are to the
home pages of two working groups at SAMSI (an interdisciplinary
statistics/appl. math institute), which is devoting this year
to research on methods and applications in analysis of computer
models.

http://www.stat.duke.edu/~fei/samsi/index.html
http://www4.stat.ncsu.edu/~gawhite/SAMSIweb/
http://mucm.group.shef.ac.uk/
http://cran.r-project.org/src/contrib/Descriptions/BACCO.html
http://www2.imm.dtu.dk/~hbn/dace/

A particularly accessible paper is by the Los Alamos statistics
group:

http://portal.acm.org/citation.cfm?id=1039891.1039922

Again, I don't think this literature directly addresses your problem,
but it may provide the "right" way to approach it, if you are willing
to do some work to connect your problem to this framework.  The short
answer to your question is that I think you will be hard pressed to
get a good answer to your question using off-the-shelf SciPy tools.

Good luck,
Tom Loredo


-------------------------------------------------
This mail sent through IMP: http://horde.org/imp/

-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642




More information about the Numpy-discussion mailing list