[SciPy-User] Vectorizing scipy.optimize.curve_fit

Ryan Nelson rnelsonchem@gmail....
Sat Feb 2 20:38:31 CST 2013


Hi Paul,

I've played around with this for scipy.optimize.leastsq, and I've never 
been successful. I'm not an expert, but I don't think the original code 
was set up for that type of usage. (Just speculating here... An equally 
likely explanation is that I'm not smart enough to figure it out :)

What I've found really useful for this type of problem is IPython's new 
parallel architecture (IPython version 0.13.1). It's easy to set up an 
IPython cluster and get things running in parallel through the notebook 
interface, so I've attached a simple notebook that does a trivial 
fitting (using leastsq) of some noisy data. Before you run the notebook, 
you'll need to set up a cluster through the Cluster tab in the IPython 
notebook dashboard.

Unfortunately, in my experience I've found that the speed improvement is 
only noticeable until the number of IPython engines in your cluster 
equals the number of cores not the number of processor threads. (This 
might be obvious for those in the know, but it took me a while to figure 
out.) But every little bit of improvement helps, I suppose.

Ryan

On 2/1/2013 6:07 PM, Paul Hobson wrote:
> Hey folks,
>
> I've run  into a bit of a roadblock. I've got some model runs (x) in 
> an Nx2 array where the first column is the input, and the second 
> column is the output. So in a single case, I'd do:
>
> popt, pcov = scipy.optimize.curve_fit(myFit, x[:,0], x[:,1])
>
> But how should I handle, say, 5000 model runs such that x.shape = 
> (500, N, 2) and I want the 5000 results for popt?
>
> This works:
> popt_array = np.empty(5000, 2)
> for r, layer in enumerate(model_runs):
>     popt, pcov = scipy.optimize.curve_fit(myFit, layer[:,0] layer[:,1])
> popt_array[r] = popt
>
> But is there a better (faster way)? The number of model runs and data 
> points may grow sustatially (~10^4 runs and 10^3 data points).
>
> Thanks,
> -paul
>
>
> _______________________________________________
> SciPy-User mailing list
> SciPy-User@scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-user

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.scipy.org/pipermail/scipy-user/attachments/20130202/b79fb6f2/attachment.html 
-------------- next part --------------
{
 "metadata": {
  "name": "Parallel Opt"
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Import all of the stuff we'll need into the current IPython kernel. The Client object will let us connect to our parallel IPython kernels, which were started from the Clusters tab on the IPython dashboard."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import numpy as np\n",
      "import matplotlib.pyplot as plt\n",
      "import scipy.optimize as spo\n",
      "from IPython.parallel import Client"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Set up our parallel client object, and make a direct view to the cluster kernels. Block the execution of the code in the main loop while the parallel code is running on the direct view kernels."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "c = Client()\n",
      "dview = c[:]\n",
      "dview.block = True"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "The %%px magic sends all of the cell code to the parallel kernels. Here I've done a couple imports, defined a function for fitting (from a colleague), define a generic residual function, and set up a bunch of 'global' values that I'll need for each fit."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "%%px\n",
      "import numpy as np\n",
      "import scipy.optimize as spo\n",
      "\n",
      "def my_funct(x, h, csmax, Klf):\n",
      "    numerator = csmax*(Klf*x)**h\n",
      "    denominator = 1 + (Klf*x)**h\n",
      "    return numerator/denominator\n",
      "\n",
      "def resid(parameters, y, x, funct):\n",
      "    return y - funct(x, *parameters)\n",
      "\n",
      "true_values = np.array([3., 1., 5.]) # Actual values\n",
      "x_vals = np.array([0.01, 0.03, 0.07, 0.1, 0.2, 0.5, 0.8]) # x axis data\n",
      "cs_noise_global = my_funct(x_vals, *true_values) # Generate non-noisy data\n",
      "guesses = [0, 0, 0.1] # An initial guess for fitting"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Here I'll define a function that does my fitting. In this case, I am simply adding random noise to my original data and then fitting it. I put a little test in there to make sure I don't get bad fit values.\n",
      "\n",
      "Using the my direct view 'map' function, I can run this function on each of the parallel kernels and collect the results as a Python list. The first argument is the function name, and the second is an array to use for each execution of the function. In my case, it's a simple array of ints, but I guess it could be a list of values for each fit. Added some timing info because I was testing things out. In my tests, I've noticed that the execution time decreases until you reach the number of cores on your machine, not the number of processor threads. I don't know why that is, and it might just be something I'm doing wrong."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "def tester(n):\n",
      "    # Make an array of random numbers\n",
      "    random_norm = np.random.randn( len(cs_noise_global) )\n",
      "    # Use the random numbers to add noise to my clean data.\n",
      "    cs_noise = cs_noise_global + cs_noise_global*0.07*random_norm\n",
      "    # Fit the noisy data using the residual function defined above\n",
      "    fit, error = spo.leastsq(resid, guesses, args=(cs_noise, x_vals, my_funct) )\n",
      "    # If the fit was successful, return the fit results, or else return None. That\n",
      "    # way it is pretty easy to loop through the output and drop the duds.\n",
      "    if error in (1, 2, 3, 4):\n",
      "        return fit\n",
      "    else:\n",
      "        return None\n",
      "\n",
      "%time vals = dview.map(tester, np.arange(5000))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "len(vals)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [],
     "language": "python",
     "metadata": {},
     "outputs": []
    }
   ],
   "metadata": {}
  }
 ]
}


More information about the SciPy-User mailing list