[SciPy-Dev] A few minor changes to distributions .fit and scipy.optimize
Mon Jun 28 19:53:39 CDT 2010
On Mon, Jun 28, 2010 at 7:12 PM, Travis Oliphant <email@example.com> wrote:
> There was some recent discussion about the fact that the .fit functions only use the Nelder Mead fmin algorithm, and that the .fit functions do not allow dependency on other variables.
> I would like to check in a quick change to allow an optimizer keyword to the function which can take any optimizer which accepts a similar interface (and I've changed all the fmin_ style optimizers to adjust the few that have trickled in over the past few years which did not have a disp= keyword).
> Comments on this are welcome. It is a simple enough change that it seems appropriate to just check-it in.
I agree after a brief look at the change sets that this is a simple
one thing I was trying to get away from in some cases is to require
that callables (or distributions) are in the scipy namespace
optimizer = getattr(optimize, optimizer)
restricts the choice to optimizers that are defined in scipy.optimize,
instead we could also allow user defined optimizers or optimizers from
other packages assuming they have a compatible interface.
In a similar case, I have distributions that are not in scipy.stats
and I like functions to be indifferent to the location of a callable.
A comment on tests: the test directory has a test_fit.py that tests
whether the estimated results are reasonably good. It's currently
disabled because there are too many problems with the generic fit
(name of the function is est_ instead of test_ so that nose doesn't
pick it up). I left it in there for the time that fit works across the
I just ran in on an older trunk and get
Ran 83 tests in 495.703s
> Regarding the matter of parameterization of the shape and location or scale variables: I think it would be straightforward to use a similar mechanism that allows one to now fix the shape, location, or scale parameters during the optimization, to allow for these variables to be parameterized by additional independent variables via a function. In other words, instead of fixing the value of shape, location, or scale, you could specify that this parameter should be a specific function of some other variables.
> Then, the optimization would proceed using these underlying functions. This interface should also be flexible enough to allow you to specify a function that returns several of the shape, location, and/or scale values given the same set of underlying functions.
> I'm thinking of merging this with the fixing of the parameters approach using a couple of object factory functions that are passed in via a single keyword argument to the .fit function. This would be a simple yet sufficiently flexible approach. I don't have a great name for the keyword function, perhaps params
> # s0..sn fix any shape parameters
> .fit(data, params=fix(s1=3,loc=4))
> .fit(data, params=expand(s1=func, loc=(func3, start))) # passing a tuple in fixes starting guess for underlying function.
> TBD: how to specify a function that returns several of the parameters: perhaps a keyword with names strung together: s0_loc_scale = func4
I think to do this in a really useful way that works for a majority of
distributions, requires quite a bit of experimentation and thinking
about the overall design.
A basic enhancement to fit would be easy enough, but the question is
how flexible we want the new options to be and what results we want to
return. For example, Per Brodtkorb's original enhancement returned a
Profile class to do some further post estimation analysis. (We never
reviewed his proposed changes and they have not been incorporated
( next part is taken care of in your proposal:
Take the simplest case, y distributed N(mu, sigma**2) with mu =
X*beta, which is just standard linear regression or MLE. We would like
to get the estimates for beta and sigma**2 plus most likely the
associated standard errors. In the case of the t-distribution, we
would have to estimate the dof (shape parameter) additional to beta
and the scale. So, I think, the function that specifies any of the
distribution parameters, will have its own parameters that need to be
jointly optimized, unless we do two-stage optimization, which would be
a way around this problem. In the former case, we need a way to
specify additional starting parameters and pool all parameters, that
are estimated for the optimization problem.
In terms of results, I was also thinking of returning at least the
Hessian and maybe the Gradient, or a result class that allows the
calculation of additional statistics. Similar to what we are working
on for the generic MLE framework in statsmodels.
If we are able to produce a good estimation procedure, then I wouldn't
hide it as a method of scipy.stats.distributions. I think the helper
functions are a possible approach, but I would write estimation model
classes independent of scipy.stats.distributions and delegate to it in
the method. I think, it's more useful to directly attach distribution
specific information to scipy.stats.distributions like log-likelihood
and I started to work with some characteristic functions then write
generic estimation models as methods.
That said, and since my plans are always larger than the available
time, maybe this is a "simple" extension for just the basic fit
results. It might work for most of the distributions with support the
entire real line.
One small comment: I was wondering for a while if we can get the names
of the shape parameters as keyword parameters for the interface. I
think R and some others do it this way instead of the generic s1,
s2,.... Parsing the shape parameter string it should be possible.
I don't think "expand" is an informative name, but I don't know any
good alternative (explain?, fromfunc?). In your proposal how can you
fix and expand at the same time, e.g. fix loc to fix support and make
a shape parameter dependent on some explanatory variables?
> SciPy-Dev mailing list
More information about the SciPy-Dev