[Numpy-discussion] f2py performance question from a rookie
Sturla Molden
sturla@molden...
Mon Aug 16 17:01:29 CDT 2010
> After using time.clock, running f2py with the REPORT_ON_ARRAY_COPY enabled
> and
> passing arrays as np.asfortranarray(array) to the fortran routines I still
> get a
> slow performance on f2py. No copied arrays are reported.
That is not any better as np.asfortranarray will make a copy instead. Pass
the transpose to Fortran, and transpose the return value. Make sure you
neither make a copy before or after calling Fortran. A transpose does not
make a copy in NumPy.
You want C ordering in NumPy and Fortran ordering in Fortran. Which means
you probably want to call Fortran like this:
foobar(x.T, y.T)
If you let f2py return y as a result, make sure it is C ordered after
transpose:
y = foobar(x.T).T
assert(y.flags['C_CONTIGUOUS'])
Even if you don't get a "copy and transpose" immediately (i.e. f2py does
not report a copy), it might happen silently later on if you get Fortran
ordered arrays returned into Python.
> Running on f2py with a 6499x6499 array takes about 1.2sec while the python
> for
> loop does the job slightly faster at 0.9 sec.
I suppose you are including the solve time here? In which case the
dominating factors are np.dot and np.linalg.solve. Fortran will not help a
lot then.
Well if you have access to performance libraries (ACML or MKL) you might
save a little by using Fortran DGEMM instead of np.dot and a carefully
selected Fortran LAPACK solver (that is, np.linalg.solve use Gaussian
substitution, i.e. LU, but there might be structure in your matrices to
exploit, I haven't looked too carefully.)
> subroutine fill_B(j, beta, lamda, mu, b)
> integer, intent(in) :: j
> integer :: row, col
> real(kind = 8), intent(in) :: beta(j), lamda(j), mu(j)
> real(kind = 8), intent(out) :: b(j,j)
>
> do row=2,j-1
> do col=1,j
> if (col == row-1) then
> b(row,col) = beta(row) - lamda(row) + mu(row)
> elseif (col == row) then
> b(row,col) = 1 - 2*beta(row)
> elseif (col == row+1) then
> b(row, col) = beta(row) + lamda(row) - mu(row)
> else
> b(row, col) = 0
> endif
> enddo
> enddo
> b(1,1) = 1
> b(j,j) = 1
>
> end
That iterates over strided memory in Fortran. Memory access is slow,
computation is fast.
So there are two problems with your Fortran code:
1. You want b to be transposed in Fortran compared to Python.
2. Iterate in column major order in Fortran, not row major order.
Sturla
More information about the NumPy-Discussion
mailing list