[Numpy-discussion] multiprocessing (shared memory) with numpy array multiplication

Brandt Belson bbelson@princeton....
Mon Jun 13 22:18:27 CDT 2011


Hi all,
Thanks for your replies.

> Brandt Belson wrote:
> > Unfortunately I can't flatten the arrays. I'm writing a library where
> > the user supplies an inner product function for two generic objects, and
> > almost always the inner product function does large array
> > multiplications at some point. The library doesn't get to know about the
> > underlying arrays.
>
> Now I'm confused -- if the user is providing the inner product
> implementation, how can you optimize that? Or are you trying to provide
> said user with an optimized "large array multiplication" that he/she can
> use?

I'm sorry if I wasn't clear. I'm not providing a new array
multiplication function. I'm taking the inner product function (which
usually contains numpy array multiplication) from the user as a given.
I am parallelizing the process of performing *many* inner products so
that each core can do them independently. The parallelization is in
performing many individual inner products, not within each inner
product/array multiplication.

> If so, then I'd post your implementation here, and folks can suggest
> improvements.

I did attach some code showing what I'm doing but that was a few days
ago so I'll attach it again.

> If it's regular old element-wise multiplication:
>
> a*b
>
> (where a and b are numpy arrays)
>
> then you are right, numpy isn't using any fancy multi-core aware
> optimized  package, so you should be able to make a faster version.
>
> You might try numexpr also -- it's pretty cool, though may not help for
> a single operation. It might give you some ideas, though.
>
> http://www.scipy.org/SciPyPackages/NumExpr
>
>
> -Chris

NumExpr looks helpful and I'll definitely look into it, but the main
issue is parallelizing many element-wise array multiplications, not
speeding-up the array multiplication operation. It might be that
parallelizing the individual inner products among cores isn't the
right approach, but I'm not sure it's wrong yet.

> >     Message: 2
> >     Date: Fri, 10 Jun 2011 09:23:10 -0400
> >     From: Olivier Delalleau <shish@keba.be <mailto:shish@keba.be>>
> >     Subject: Re: [Numpy-discussion] Using multiprocessing (shared memory)
> >            with numpy array multiplication
> >     To: Discussion of Numerical Python <numpy-discussion@scipy.org
> >     <mailto:numpy-discussion@scipy.org>>
> >     Message-ID: <BANLkTikjppC90yE56T1mr+byAxXAw32YJA@mail.gmail.com
> >     <mailto:BANLkTikjppC90yE56T1mr%2BbyAxXAw32YJA@mail.gmail.com>>
> >     Content-Type: text/plain; charset="iso-8859-1"
> >
> >     It may not work for you depending on your specific problem
> >     constraints, but
> >     if you could flatten the arrays, then it would be a dot, and you
> >     could maybe
> >     compute multiple such dot products by storing those flattened arrays
> >     into a
> >     matrix.
> >
> >     -=- Olivier
> >
> >     2011/6/10 Brandt Belson <bbelson@princeton.edu
> >     <mailto:bbelson@princeton.edu>>
> >
> >      > Hi,
> >      > Thanks for getting back to me.
> >      > I'm doing element wise multiplication, basically innerProduct =
> >      > numpy.sum(array1*array2) where array1 and array2 are, in general,
> >      > multidimensional. I need to do many of these operations, and I'd
> >     like to
> >      > split up the tasks between the different cores. I'm not using
> >     numpy.dot, if
> >      > I'm not mistaken I don't think that would do what I need.
> >      > Thanks again,
> >      > Brandt
> >      >
> >      >
> >      > Message: 1
> >      >> Date: Thu, 09 Jun 2011 13:11:40 -0700
> >      >> From: Christopher Barker <Chris.Barker@noaa.gov
> >     <mailto:Chris.Barker@noaa.gov>>
> >      >> Subject: Re: [Numpy-discussion] Using multiprocessing (shared
> >     memory)
> >      >>        with numpy array multiplication
> >      >> To: Discussion of Numerical Python <numpy-discussion@scipy.org
> >     <mailto:numpy-discussion@scipy.org>>
> >      >> Message-ID: <4DF128FC.8000807@noaa.gov
> >     <mailto:4DF128FC.8000807@noaa.gov>>
> >      >> Content-Type: text/plain; charset=ISO-8859-1; format=flowed
> >      >>
> >      >> Not much time, here, but since you got no replies earlier:
> >      >>
> >      >>
> >      >> >      > I'm parallelizing some code I've written using the built in
> >      >> >     multiprocessing
> >      >> >      > module. In my application, I need to multiply many
> >     large arrays
> >      >> >     together
> >      >>
> >      >> is the matrix multiplication, or element-wise? If matrix, then numpy
> >      >> should be using LAPACK, which, depending on how its built, could be
> >      >> using all your cores already. This is heavily dependent on your your
> >      >> numpy (really the LAPACK it uses0 is built.
> >      >>
> >      >> >      > and
> >      >> >      > sum the resulting product arrays (inner products).
> >      >>
> >      >> are you using numpy.dot() for that? If so, then the above applies to
> >      >> that as well.
> >      >>
> >      >> I know I could look at your code to answer these questions, but I
> >      >> thought this might help.
> >      >>
> >      >> -Chris
> >      >>
> >      >>
> >      >>
> >      >>
> >      >>
> >      >> --
> >      >> Christopher Barker, Ph.D.
> >      >> Oceanographer
> >      >>
> >      >> Emergency Response Division
> >      >> NOAA/NOS/OR&R            (206) 526-6959
> >     <tel:%28206%29%20526-6959>   voice
> >      >> 7600 Sand Point Way NE   (206) 526-6329
> >     <tel:%28206%29%20526-6329>   fax
> >      >> Seattle, WA  98115       (206) 526-6317
> >     <tel:%28206%29%20526-6317>   main reception
> >      >>
> >      >> Chris.Barker@noaa.gov <mailto:Chris.Barker@noaa.gov>


> Message: 2
> Date: Mon, 13 Jun 2011 12:51:08 -0500
> From: srean <srean.list@gmail.com>
> Subject: Re: [Numpy-discussion] Using multiprocessing (shared memory)
>        with numpy array multiplication
> To: Discussion of Numerical Python <numpy-discussion@scipy.org>
> Message-ID: <BANLkTimkSYsD142D5e99bb7xKRVwEHgnzg@mail.gmail.com>
> Content-Type: text/plain; charset=ISO-8859-1
>
> Looking at the code the arrays that you are multiplying seem fairly
> small (300, 200) and you have 50 of them. So it might the case that
> there is not enough computational work to compensate for the cost of
> forking new processes and communicating the results. Have you tried
> larger arrays and more of them ?

I've tried varying the sizes and the trends are consistent - using
multiprocessing on numpy array multiplication is slower than not using
it. For reference, I'm on a mac with the following numpy
configuration:

>>> print numpy.show_config()
lapack_opt_info:
    extra_link_args = ['-Wl,-framework', '-Wl,Accelerate']
    extra_compile_args = ['-faltivec']
    define_macros = [('NO_ATLAS_INFO', 3)]
blas_opt_info:
    extra_link_args = ['-Wl,-framework', '-Wl,Accelerate']
    extra_compile_args = ['-faltivec',
'-I/System/Library/Frameworks/vecLib.framework/Headers']
    define_macros = [('NO_ATLAS_INFO', 3)]
None

> If you are on an intel machine and you have MKL libraries around I
> would strongly recommend that you use the matrix multiplication
> routine if possible. MKL will do the parallelization for you. Well,
> any good BLAS implementation would do the same, you dont really need
> MKL. ATLAS and ACML would work too, just that MKL has been setup for
> us and it works well.
>
> To give an idea, given the amount of tuning and optimization that
> these libraries have undergone a numpy.sum would be slower that an
> multiplication with a vector of all ones. So in the interest of speed
> the longer you stay in the BLAS context the better.
>
> --srean

That seems like a good option. While I'd like the user to have minimal
restrictions and dependencies to consider when writing the inner
product function, maybe I should put the burden on them to parallelize
the inner products, which could be simply done by configuring numpy
with MKL I guess (I haven't tried this yet).

I'm still a bit curious what is causing my script to be slower when
the multiple inner products are parallelized.

Thanks,
Brandt
-------------- next part --------------
A non-text attachment was scrubbed...
Name: myutil.py
Type: application/octet-stream
Size: 383 bytes
Desc: not available
Url : http://mail.scipy.org/pipermail/numpy-discussion/attachments/20110613/367e75e5/attachment.obj 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: shared_mem.py
Type: application/octet-stream
Size: 1521 bytes
Desc: not available
Url : http://mail.scipy.org/pipermail/numpy-discussion/attachments/20110613/367e75e5/attachment-0001.obj 


More information about the NumPy-Discussion mailing list