[SciPy-User] Vectorizing scipy.optimize.curve_fit
josef.pktd@gmai...
josef.pktd@gmai...
Sun Feb 3 23:16:08 CST 2013
On Mon, Feb 4, 2013 at 12:13 AM, Paul Hobson <pmhobson@gmail.com> wrote:
> On Sun, Feb 3, 2013 at 10:47 AM, <josef.pktd@gmail.com> wrote:
>>
>> On Sun, Feb 3, 2013 at 1:29 PM, Paul Hobson <pmhobson@gmail.com> wrote:
>> >
>> > On Sun, Feb 3, 2013 at 4:28 AM, <josef.pktd@gmail.com> wrote:
>> >>
>> >> On Sat, Feb 2, 2013 at 9:38 PM, Ryan Nelson <rnelsonchem@gmail.com>
>> >> wrote:
>> >> > Hi Paul,
>> >> >
>> >> > I've played around with this for scipy.optimize.leastsq, and I've
>> >> > never
>> >> > been
>> >> > successful. I'm not an expert, but I don't think the original code
>> >> > was
>> >> > set
>> >> > up for that type of usage. (Just speculating here... An equally
>> >> > likely
>> >> > explanation is that I'm not smart enough to figure it out :)
>> >> >
>> >> > What I've found really useful for this type of problem is IPython's
>> >> > new
>> >> > parallel architecture (IPython version 0.13.1). It's easy to set up
>> >> > an
>> >> > IPython cluster and get things running in parallel through the
>> >> > notebook
>> >> > interface, so I've attached a simple notebook that does a trivial
>> >> > fitting
>> >> > (using leastsq) of some noisy data. Before you run the notebook,
>> >> > you'll
>> >> > need
>> >> > to set up a cluster through the Cluster tab in the IPython notebook
>> >> > dashboard.
>> >> >
>> >> > Unfortunately, in my experience I've found that the speed improvement
>> >> > is
>> >> > only noticeable until the number of IPython engines in your cluster
>> >> > equals
>> >> > the number of cores not the number of processor threads. (This might
>> >> > be
>> >> > obvious for those in the know, but it took me a while to figure out.)
>> >> > But
>> >> > every little bit of improvement helps, I suppose.
>> >> >
>> >> > Ryan
>> >> >
>> >> >
>> >> > On 2/1/2013 6:07 PM, Paul Hobson wrote:
>> >> >
>> >> > Hey folks,
>> >> >
>> >> > I've run into a bit of a roadblock. I've got some model runs (x) in
>> >> > an
>> >> > Nx2
>> >> > array where the first column is the input, and the second column is
>> >> > the
>> >> > output. So in a single case, I'd do:
>> >> >
>> >> > popt, pcov = scipy.optimize.curve_fit(myFit, x[:,0], x[:,1])
>> >> >
>> >> > But how should I handle, say, 5000 model runs such that x.shape =
>> >> > (500,
>> >> > N,
>> >> > 2) and I want the 5000 results for popt?
>> >> >
>> >> > This works:
>> >> > popt_array = np.empty(5000, 2)
>> >> > for r, layer in enumerate(model_runs):
>> >> > popt, pcov = scipy.optimize.curve_fit(myFit, layer[:,0]
>> >> > layer[:,1])
>> >> > popt_array[r] = popt
>> >> >
>> >> > But is there a better (faster way)? The number of model runs and data
>> >> > points
>> >> > may grow sustatially (~10^4 runs and 10^3 data points).
>> >>
>> >> I think there is no efficient way to avoid the loop.
>> >>
>> >> besides going parallel:
>> >>
>> >> If you are only interested in popt, then using optimize.leastsq
>> >> directly will save some calculations.
>> >>
>> >> The other major speedup for this kind of problems is to find good
>> >> starting values. For example, if the solutions in the sequence are
>> >> close to each other, then using previous solutions as starting values
>> >> for the next case will speed up convergence.
>> >> IIRC, in statsmodels we use an average of previous solutions, after
>> >> some initial warm-up, in a similar problem. A moving average could
>> >> also be useful.
>> >>
>> >> Josef
>> >>
>> >> >
>> >> > Thanks,
>> >> > -paul
>> >>
>> >
>> > Thanks for the advice, Ryan and Josef. I've been meaning to get a feel
>> > of
>> > for IPython Parallel for some time now, so this will be a good impetus
>> > to do
>> > so.
>>
>> If you just want to run it on multiple cores, then joblib is easier.
>> scikit-learn uses it extensively, and statsmodels uses it at the few
>> parts, especially for bootstrap.
>>
>> >
>> > Also, Josef, the suggestion about using optimize.leastsq with an initial
>> > guess, will be a very good one, I think. A better description of what
>> > I'm
>> > doing is actually bootstrapping the regression parameters to find their
>> > 95%
>> > confidence intervals. So it make a lot sense, really, to get the initial
>> > guess at the parameters from the full data set for use in the bootstrap
>> > iterations.
>>
>> The only problems I would worry a bit about in this case are the
>> possible presence of multiple local minima, if there are any, and
>> convergence failures.
>>
>> Josef
>>
>
> Thanks for the advice. Currently working with linear fits, but I'm hoping to
> expand this work to be more general. I'll be sure to keep these things in
> mind.
If you work with linear fits, why do you use curve_fit/leastsq instead
of linalg?
Josef
> -p
>
>
> _______________________________________________
> SciPy-User mailing list
> SciPy-User@scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-user
>
More information about the SciPy-User
mailing list