[Numpy-discussion] Slicing slower than matrix multiplication?

Francesc Alted faltet@pytables....
Fri Dec 11 11:28:44 CST 2009

```A Friday 11 December 2009 17:36:54 Bruce Southey escrigué:
> On 12/11/2009 10:03 AM, Francesc Alted wrote:
> > A Friday 11 December 2009 16:44:29 Dag Sverre Seljebotn escrigué:
> >> Jasper van de Gronde wrote:
> >>> Dag Sverre Seljebotn wrote:
> >>>> Jasper van de Gronde wrote:
> >>>>> I've attached a test file which shows the problem. It also tries
> >>>>> adding columns instead of rows (in case the memory layout is playing
> >>>>> tricks), but this seems to make no difference. This is the output I
> >>>>> got:
> >>>>>
> >>>>>      Dot product: 5.188786
> >>>>>      Add a row: 8.032767
> >>>>>      Add a column: 8.070953
> >>>>>
> >>>>> Any ideas on why adding a row (or column) of a matrix is slower than
> >>>>> computing a matrix product with a similarly sized matrix... (Xi has
> >>>>> less columns than Xi2, but just as many rows.)
> >>>>
> >>>> I think we need some numbers to put this into context -- how big are
> >>>> the vectors/matrices? How many iterations was the loop run? If the
> >>>> vectors are small and the loop is run many times, how fast the
> >>>> operation "ought" to be is irrelevant as it would drown in Python
> >>>
> >>> Originally I had attached a Python file demonstrating the problem, but
> >>> apparently this wasn't accepted by the list. In any case, the matrices
> >>> and vectors weren't too big (60x20), so I tried making them bigger and
> >>> indeed the "fast" version was now considerably faster.
> >>
> >> 60x20 is "nothing", so a full matrix multiplication or a single
> >> matrix-vector probably takes the same time (that is, the difference
> >> between them in itself is likely smaller than the error you make during
> >> measuring).
> >>
> >> In this context, the benchmarks will be completely dominated by the
> >> number of Python calls you make (each, especially taking the slice,
> >> means allocating Python objects, calling a bunch of functions in C, etc.
> >> etc). So it's not that strange, taking a slice isn't free, some Python
> >> objects must be created etc. etc.
> >
> > Yeah, I think taking slices here is taking quite a lot of time:
> >
> > In [58]: timeit E + Xi2[P/2,:]
> > 100000 loops, best of 3: 3.95 µs per loop
> >
> > In [59]: timeit E + Xi2[P/2]
> > 100000 loops, best of 3: 2.17 µs per loop
> >
> > don't know why the additional ',:' in the slice is taking so much time,
> > but my guess is that passing&  analyzing the second argument
> > (slice(None,None,None)) could be the responsible for the slowdown (but
> > that is taking too much time). Mmh, perhaps it would be worth to study
> > this more carefully so that an optimization could be done in NumPy.
> >
> >> I think the lesson mostly should be that with so little data,
> >> benchmarking becomes a very difficult art.
> >
> > Well, I think it is not difficult, it is just that you are perhaps
> > benchmarking Python/NumPy machinery instead ;-)  I'm curious whether
> > Matlab can do slicing much more faster than NumPy.  Jasper?
>
> What are using actually trying to test here?

Good question.  I don't know for sure :-)

> I do not see any equivalence in the operations or output here.
> -With your slices you need two dot products but ultimately you are only
> using one for your dot product.
> -There are addition operations on the slices that are not present in the
> dot product.
> -The final E arrays are not the same for all three operations.

I don't understand the ultimate goal of the OP either, but what caught my
attention was that:

In [74]: timeit Xi2[P/2]
1000000 loops, best of 3: 278 ns per loop

In [75]: timeit Xi2[P/2,:]
1000000 loops, best of 3: 1.04 µs per loop

i.e. adding an additional parameter (the ':') to the slice, drives the time to
run almost 4x slower.  And with this, it is *partially* explained the problem
exposed by OP, i.e.:

In [77]: timeit np.dot(Xi,w)
100000 loops, best of 3: 2.91 µs per loop

In [78]: timeit E + Xi2[P/2]
100000 loops, best of 3: 2.05 µs per loop

In [79]: timeit E + Xi2[P/2,:]
100000 loops, best of 3: 3.81 µs per loop

But again, don't ask me whether the results are okay or not.  I'm playing here
the role of a pure computational scientist on a very concrete problem ;-)

> Having said that, the more you can vectorize your function, the more
> efficient it will likely be especially with Atlas etc.

Except if your arrays are small enough, which is the underlying issue here
IMO.

--
Francesc Alted
```