# [Numpy-discussion] Quick Question about Optimization

Eric Firing efiring@hawaii....
Mon May 19 20:14:58 CDT 2008

Robert Kern wrote:
> On Mon, May 19, 2008 at 6:55 PM, James Snyder <jbsnyder@gmail.com> wrote:
>> Also note, I'm not asking to match MATLAB performance.  It'd be nice,
>> but again I'm just trying to put together decent, fairly efficient
>> numpy code.
>
> I can cut the time by about a quarter by just using the boolean mask
> directly instead of using where().
>
>             for n in range(0,time_milliseconds):
>                 u  =  expfac_m  *  prev_u + (1-expfac_m) * aff_input[n,:]
>                 v = u + sigma * stdnormrvs[n, :]
>                 theta = expfac_theta * prev_theta - (1-expfac_theta)
>
>                 mask = (v >= theta)
>
>                 S[n,np.squeeze(mask)] = 1
>                 theta[mask] += b
>
>                 prev_u = u
>                 prev_theta = theta
>
>
> There aren't any good line-by-line profiling tools in Python, but you
> can fake it by making a local function for each line:
>
>             def f1():
>                 u  =  expfac_m  *  prev_u + (1-expfac_m) * aff_input[n,:]
>                 return u
>             def f2():
>                 v = u + sigma * stdnormrvs[n, :]
>                 return v
>             def f3():
>                 theta = expfac_theta * prev_theta - (1-expfac_theta)
>                 return theta
>             def f4():
>                 mask = (v >= theta)
>             def f5():
>                 S[n,np.squeeze(mask)] = 1
>             def f6():
>                 theta[mask] += b
>
>             # Run Standard, Unoptimized Model
>             for n in range(0,time_milliseconds):
>                 u = f1()
>                 v = f2()
>                 theta = f3()
>                 mask = f4()
>                 f5()
>                 f6()
>
>                 prev_u = u
>                 prev_theta = theta
>
> I get f6() as being the biggest bottleneck, followed by the general
> time spent in the loop (about the same), followed by f5(), f1(), and
> f3() (each about half of f6()), followed by f2() (about half of f5()).
> f4() is negligible.
>
> Masked operations are inherently slow. They mess up CPU's branch
> prediction. Worse, the use of iterators in that part of the code
> frustrates compilers' attempts to optimize that away in the case of
> contiguous arrays.
>

f6 can be sped up more than a factor of 2 by using putmask:

In [10]:xx = np.random.rand(100000)

In [11]:mask = xx > 0.5

In [12]:timeit xx[mask] += 2.34
100 loops, best of 3: 4.06 ms per loop