[SciPy-User] Vectorizing scipy.optimize.curve_fit
Sun Feb 3 12:47:00 CST 2013
On Sun, Feb 3, 2013 at 1:29 PM, Paul Hobson <email@example.com> wrote:
> On Sun, Feb 3, 2013 at 4:28 AM, <firstname.lastname@example.org> wrote:
>> On Sat, Feb 2, 2013 at 9:38 PM, Ryan Nelson <email@example.com> wrote:
>> > Hi Paul,
>> > I've played around with this for scipy.optimize.leastsq, and I've never
>> > been
>> > successful. I'm not an expert, but I don't think the original code was
>> > set
>> > up for that type of usage. (Just speculating here... An equally likely
>> > explanation is that I'm not smart enough to figure it out :)
>> > What I've found really useful for this type of problem is IPython's new
>> > parallel architecture (IPython version 0.13.1). It's easy to set up an
>> > IPython cluster and get things running in parallel through the notebook
>> > interface, so I've attached a simple notebook that does a trivial
>> > fitting
>> > (using leastsq) of some noisy data. Before you run the notebook, you'll
>> > need
>> > to set up a cluster through the Cluster tab in the IPython notebook
>> > dashboard.
>> > Unfortunately, in my experience I've found that the speed improvement is
>> > only noticeable until the number of IPython engines in your cluster
>> > equals
>> > the number of cores not the number of processor threads. (This might be
>> > obvious for those in the know, but it took me a while to figure out.)
>> > But
>> > every little bit of improvement helps, I suppose.
>> > Ryan
>> > On 2/1/2013 6:07 PM, Paul Hobson wrote:
>> > Hey folks,
>> > I've run into a bit of a roadblock. I've got some model runs (x) in an
>> > Nx2
>> > array where the first column is the input, and the second column is the
>> > output. So in a single case, I'd do:
>> > popt, pcov = scipy.optimize.curve_fit(myFit, x[:,0], x[:,1])
>> > But how should I handle, say, 5000 model runs such that x.shape = (500,
>> > N,
>> > 2) and I want the 5000 results for popt?
>> > This works:
>> > popt_array = np.empty(5000, 2)
>> > for r, layer in enumerate(model_runs):
>> > popt, pcov = scipy.optimize.curve_fit(myFit, layer[:,0] layer[:,1])
>> > popt_array[r] = popt
>> > But is there a better (faster way)? The number of model runs and data
>> > points
>> > may grow sustatially (~10^4 runs and 10^3 data points).
>> I think there is no efficient way to avoid the loop.
>> besides going parallel:
>> If you are only interested in popt, then using optimize.leastsq
>> directly will save some calculations.
>> The other major speedup for this kind of problems is to find good
>> starting values. For example, if the solutions in the sequence are
>> close to each other, then using previous solutions as starting values
>> for the next case will speed up convergence.
>> IIRC, in statsmodels we use an average of previous solutions, after
>> some initial warm-up, in a similar problem. A moving average could
>> also be useful.
>> > Thanks,
>> > -paul
> Thanks for the advice, Ryan and Josef. I've been meaning to get a feel of
> for IPython Parallel for some time now, so this will be a good impetus to do
If you just want to run it on multiple cores, then joblib is easier.
scikit-learn uses it extensively, and statsmodels uses it at the few
parts, especially for bootstrap.
> Also, Josef, the suggestion about using optimize.leastsq with an initial
> guess, will be a very good one, I think. A better description of what I'm
> doing is actually bootstrapping the regression parameters to find their 95%
> confidence intervals. So it make a lot sense, really, to get the initial
> guess at the parameters from the full data set for use in the bootstrap
The only problems I would worry a bit about in this case are the
possible presence of multiple local minima, if there are any, and
> SciPy-User mailing list
More information about the SciPy-User