-1- The code style is good and the performance vs matlab is good.
With 400x400:
Matlab = 1.56 sec  (with nested "for" loop, so no optimization)
Numpy  = 0.99 sec  (with broadcasting)

-2- Now with the code below I have strange result.
With w=h=400:
- Using "slice"        =>  0.99 sec
- Using "numpy.ogrid"  =>  0.01 sec

With w=400 and h=300:
- Using "slice",       => 0.719sec
- Using "numpy.ogrid", => broadcast ERROR !

"ValueError: shape mismatch: objects cannot be broadcast to a single shape"

#######################################################
def test():
w = 400

if 1:  #---Case with different w and h
h = 300
else:  #---Case with same w and h
h = 400

a = numpy.zeros((h,w))      #Normally loaded with real data
b = numpy.zeros((h,w,3))

if 1:   #---Case with SLICE
w0 = slice(0,w-2)
w1 = slice(1,w-1)
w2 = slice(2,w)
h0 = slice(0,h-2)
h1 = slice(1,h-1)
h2 = slice(2,h)
else:   #---Case with OGRID
w0 = numpy.ogrid[0:w-2]
w1 = numpy.ogrid[1:w-1]
w2 = numpy.ogrid[2:w]
h0 = numpy.ogrid[0:h-2]
h1 = numpy.ogrid[1:h-1]
h2 = numpy.ogrid[2:h]

p00, p01, p02 = [h0,w0], [h0,w1],[h0,w2]
p10, p11, p12 = [h1,w0], [h1,w1],[h1,w2]
p20, p21, p22 = [h2,w0], [h2,w1],[h2,w2]

b[p11+[1]] =  a[p11] - numpy.min([a[p11]-a[p00],
a[p11]-a[p01],
a[p11]-a[p02],
a[p11]-a[p10],
a[p11]-a[p12],
a[p11]-a[p20],
a[p11]-a[p21],
a[p11]-a[p22]])  \
+ 0.123*numpy.max([a[p11]-a[p00],
a[p11]-a[p01],
a[p11]-a[p02],
a[p11]-a[p10],
a[p11]-a[p12],
a[p11]-a[p20],
a[p11]-a[p21],
a[p11]-a[p22]])
#######################################################

It seems "ogrid" got better performance, but broadcasting is not working any
more.
Do you think it's normal that broadcast is not possible using ogrid and
different w & h ?
Did I missed any row/colomn missmatch ???

Thanks.

Cheers,
Nicolas

> I simplified the code to focus only on "what I" need, rather to bother you
> with the full code.

def test():
w = 3096
h = 2048
a = numpy.zeros((h,w), order='F')      #Normally loaded with real data
b = numpy.zeros((h,w,3), order='F')

w0 = slice(0,w-2)
w1 = slice(1,w-1)
w2 = slice(2,w)
h0 = slice(0,h-2)
h1 = slice(1,h-1)
h2 = slice(2,h)

p00, p10, p20 = [h0,w0], [h1,w0], [h2,w0]
p01, p11, p21 = [h0,w1], [h1,w1], [h2,w1]
p02, p12, p22 = [h0,w2], [h1,w2], [h2,w2]

b[p11 + [1]] =  a[p11] + 1.23*a[p22]  \
- numpy.min([a[p11]-a[p00],
a[p11]-a[p01],
a[p11]-a[p02],
a[p11]-a[p10],
a[p11]-a[p12],
a[p11]-a[p20],
a[p11]-a[p21],
a[p11]-a[p22]])  \
+ 0.123*numpy.max([a[p11]-a[p00],
a[p11]-a[p01],
a[p11]-a[p02],
a[p11]-a[p10],
a[p11]-a[p12],
a[p11]-a[p20],
a[p11]-a[p21],
a[p11]-a[p22]])

Does this work for you?

