[Numpy-discussion] Scipy dot

Dag Sverre Seljebotn d.s.seljebotn@astro.uio...
Fri Nov 9 18:38:31 CST 2012


On 11/09/2012 11:57 PM, Matthieu Brucher wrote:
> Hi,
>
> A.A slower than A.A' is not a surprise for me. The latter is far more
> cache friendly that the former. Everything follows cache lines, so it is
> faster than something that will use one element from each cache line. In
> fact it is exactly what "proves" that the new version is correct.
> Good job (if all the tests were made and still pass ;) )

Cache lines shouldn't matter much with a decent BLAS?

http://dl.acm.org/citation.cfm?id=1356053

(Googling for "Anatomy of High-Performance Matrix Multiplication" will 
give you a preprint outside of paywall, but Google appears to not want 
to give me the URL of a too long search result so I can't paste it).

Dag Sverre

>
> Cheers,
>
> Matthieu
>
>
> 2012/11/9 Nicolas SCHEFFER <scheffer.nicolas@gmail.com
> <mailto:scheffer.nicolas@gmail.com>>
>
>     Ok: comparing apples to apples. I'm clueless on my observations and
>     would need input from you guys.
>
>     Using ATLAS 3.10, numpy with and without my changes, I'm getting these
>     timings and comparisons.
>
>     #
>     #I. Generate matrices using regular dot:
>     #
>     big = np.array(np.random.randn(2000, 2000), 'f');
>     np.savez('out', big=big, none=big.dot(big), both=big.T.dot(big.T),
>     left=big.T.dot(big), right=big.dot(big.T))"
>
>     #
>     #II. Timings with regular dot
>     #
>     In [3]: %timeit np.dot(big, big)
>     10 loops, best of 3: 138 ms per loop
>
>     In [4]: %timeit np.dot(big, big.T)
>     10 loops, best of 3: 166 ms per loop
>
>     In [5]: %timeit np.dot(big.T, big.T)
>     10 loops, best of 3: 193 ms per loop
>
>     In [6]: %timeit np.dot(big.T, big)
>     10 loops, best of 3: 165 ms per loop
>
>     #
>     #III. I load these arrays and time again with the "fast" dot
>     #
>     In [21]: %timeit np.dot(big, big)
>     10 loops, best of 3: 138 ms per loop
>
>     In [22]: %timeit np.dot(big.T, big)
>     10 loops, best of 3: 104 ms per loop
>
>     In [23]: %timeit np.dot(big.T, big.T)
>     10 loops, best of 3: 138 ms per loop
>
>     In [24]: %timeit np.dot(big, big.T)
>     10 loops, best of 3: 102 ms per loop
>
>     1. A'.A': great!
>     2. A.A' becomes faster than A.A !?!
>
>     #
>     #IV. MSE on differences
>     #
>     In [25]: np.sqrt(((arr['none'] - none)**2).sum())
>     Out[25]: 0.0
>
>     In [26]: np.sqrt(((arr['both'] - both)**2).sum())
>     Out[26]: 0.0
>
>     In [27]: np.sqrt(((arr['left'] - left)**2).sum())
>     Out[27]: 0.015900515
>
>     In [28]: np.sqrt(((arr['right'] - right)**2).sum())
>     Out[28]: 0.015331409
>
>     #
>     # CCl
>     #
>     While the MSE are small, I'm wondering whether:
>     - It's a bug: it should be exactly the same
>     - It's a feature: BLAS is taking shortcuts when you have A.A'. The
>     difference is not significant. Quick: PR that asap!
>
>     I don't have enough expertise to answer that...
>
>     Thanks much!
>
>     -nicolas
>     On Fri, Nov 9, 2012 at 2:13 PM, Nicolas SCHEFFER
>     <scheffer.nicolas@gmail.com <mailto:scheffer.nicolas@gmail.com>> wrote:
>      > I too encourage users to use scipy.linalg for speed and robustness
>      > (hence calling this scipy.dot), but it just brings so much confusion!
>      > When using the scipy + numpy ecosystem, you'd almost want everything
>      > be done with scipy so that you get the best implementation in all
>      > cases: scipy.zeros(), scipy.array(), scipy.dot(), scipy.linalg.inv().
>      >
>      > Anyway this is indeed for another thread, the confusion we'd like to
>      > fix here is that users shouldn't have to understand the C/F
>     contiguous
>      > concepts to get the maximum speed for np.dot()
>      >
>      > To summarize:
>      > - The python snippet I posted is still valid and can speed up your
>      > code if you can change all your dot() calls.
>      > - The change in dotblas.c is a bit more problematic because it's very
>      > core. I'm having issues right now to replicate the timings, I've got
>      > better timing for a.dot(a.T) than for a.dot(a). There might be a bug.
>      >
>      > It's a pain to test because I cannot do the test in a single
>     python session.
>      > I'm going to try to integrate most of your suggestions, I cannot
>      > guarantee I'll have time to do them all though.
>      >
>      > -nicolas
>      > On Fri, Nov 9, 2012 at 8:56 AM, Nathaniel Smith <njs@pobox.com
>     <mailto:njs@pobox.com>> wrote:
>      >> On Fri, Nov 9, 2012 at 4:25 PM, Gael Varoquaux
>      >> <gael.varoquaux@normalesup.org
>     <mailto:gael.varoquaux@normalesup.org>> wrote:
>      >>> On Fri, Nov 09, 2012 at 03:12:42PM +0000, Nathaniel Smith wrote:
>      >>>> But what if someone compiles numpy against an optimized blas (mkl,
>      >>>> say) and then compiles SciPy against the reference blas? What
>     do you
>      >>>> do then!? ;-)
>      >>>
>      >>> This could happen. But the converse happens very often. What
>     happens is
>      >>> that users (eg on shared computing resource) ask for a
>     scientific python
>      >>> environment. The administrator than installs the package
>     starting from
>      >>> the most basic one, to the most advanced one, thus starting
>     with numpy
>      >>> that can very well build without any external blas. When he
>     gets to scipy
>      >>> he hits the problem that the build system does not detect
>     properly the
>      >>> blas, and he solves that problem.
>      >>>
>      >>> Also, it used to be that on the major linux distributions,
>     numpy would not
>      >>> be build with an optimize lapack because numpy was in the
>     'base' set of
>      >>> packages, but not lapack. On the contrary, scipy being in the
>     'contrib'
>      >>> set, it could depend on lapack. I just checked, and this has
>     been fixed
>      >>> in the major distributions (Fedora, Debian, Ubuntu).
>      >>>
>      >>> Now we can discuss with such problems should not happen, and
>     put the
>      >>> blame on the users/administrators, the fact is that they happen
>     often. I
>      >>> keep seeing environments in which np.linalg is unreasonnably slow.
>      >>
>      >> If this is something that's been a problem for you, maybe we should
>      >> start another thread on things we could do to fix it directly?
>     Improve
>      >> build instructions, advertise build systems that set up the whole
>      >> environment (and thus do the right thing), make numpy's setup.py
>      >> scream and yell if blas isn't available...?
>      >>
>      >> -n
>      >> _______________________________________________
>      >> NumPy-Discussion mailing list
>      >> NumPy-Discussion@scipy.org <mailto:NumPy-Discussion@scipy.org>
>      >> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>     _______________________________________________
>     NumPy-Discussion mailing list
>     NumPy-Discussion@scipy.org <mailto:NumPy-Discussion@scipy.org>
>     http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
>
>
>
> --
> Information System Engineer, Ph.D.
> Blog: http://matt.eifelle.com
> LinkedIn: http://www.linkedin.com/in/matthieubrucher
> Music band: http://liliejay.com/
>
>
>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>



More information about the NumPy-Discussion mailing list