[Numpy-discussion] OT: performance in C extension; OpenMP, or SSE ?

Sebastian Haase seb.haase@gmail....
Sat Feb 19 11:13:44 CST 2011


Thanks a lot. Very informative. I guess what you say about "cache line
is dirtied" is related to the info I got with valgrind (see my email
in this thread: L1 Data Write Miss 3636).
Can one assume that the cache line is always a few mega bytes ?

Thanks,
Sebastian

On Sat, Feb 19, 2011 at 12:40 AM, Sturla Molden <sturla@molden.no> wrote:
> Den 17.02.2011 16:31, skrev Matthieu Brucher:
>
> It may also be the sizes of the chunk OMP uses. You can/should specify
> them.in
>
> Matthieu
>
> the OMP pragma so that it is a multiple of the cache line size or something
> close.
>
> Also beware of "false sharing" among the threads. When one processor updates
> the array "dist" in Sebastian's code, the cache line is dirtied for the
> other processors:
>
>   #pragma omp parallel for private(j, i,ax,ay, dif_x, dif_y)
>   for(i=0;i<na;i++) {
>      ax=a_ps[i*nx1];
>      ay=a_ps[i*nx1+1];
>      for(j=0;j<nb;j++) {
>          dif_x = ax - b_ps[j*nx2];
>          dif_y = ay - b_ps[j*nx2+1];
>
>          /* update shared memory */
>
>          dist[2*i+j]  = sqrt(dif_x*dif_x+dif_y*dif_y);
>
>          /* ... and poof the cache is dirty */
>
>      }
>   }
>
> Whenever this happens, the processors must stop whatever they are doing to
> resynchronize their cache lines. "False sharing" can therefore work as an
> "invisible GIL" inside OpenMP code.The processors can appear to run in
> syrup, and there is excessive traffic on the memory bus.
>
> This is also why MPI programs often scale better than OpenMP programs,
> despite the IPC overhead.
>
> An advice when working with OpenMP is to let each thread write to private
> data arrays, and only share read-only arrays.
>
> One can e.g. use OpenMP's "reduction" pragma to achieve this. E.g. intialize
> the array dist with zeros, and use reduction(+:dist) in the OpenMP pragma
> line.
>
> Sturla
>


More information about the NumPy-Discussion mailing list