[Numpy-discussion] aligned / unaligned structured dtype behavior

Kurt Smith kwmsmith@gmail....
Thu Mar 7 21:28:22 CST 2013


On Thu, Mar 7, 2013 at 12:26 PM, Frédéric Bastien <nouiz@nouiz.org> wrote:
> Hi,
>
> It is normal that unaligned access are slower. The hardware have been
> optimized for aligned access. So this is a user choice space vs speed.

The quantitative difference is still important, so this thread is
useful for future reference, I think.  If reading in data into a
packed array is 3x faster than reading into an aligned array, but the
core computation is 4x slower with a packed array...you get the idea.

I would have benefitted years ago knowing (1) numpy structured dtypes
are packed by default, and (2) computations with unaligned data can be
several factors slower than aligned.  That's strong motivation to
always make sure I'm using 'aligned=True' except when memory usage is
an issue, or for file IO with packed binary data, etc.

> We can't go around that. We can only minimize the cost of unaligned
> access in some cases, but not all and those optimization depend of the
> CPU. But newer CPU have lowered in cost of unaligned access.
>
> I'm surprised that Theano worked with the unaligned input. I added
> some check to make this raise an error, as we do not support that!
> Francesc, can you check if Theano give the good result? It is possible
> that someone (maybe me), just copy the input to an aligned ndarray
> when we receive an not aligned one. That could explain why it worked,
> but my memory tell me that we raise an error.
>
> As you saw in the number, this is a bad example for Theano as the
> function compiled is too fast . Their is more Theano overhead then
> computation time in that example. We have reduced recently the
> overhead, but we can do more to lower it.
>
> Fred
>
> On Thu, Mar 7, 2013 at 1:06 PM, Francesc Alted <francesc@continuum.io> wrote:
>> On 3/7/13 6:47 PM, Francesc Alted wrote:
>>> On 3/6/13 7:42 PM, Kurt Smith wrote:
>>>> And regarding performance, doing simple timings shows a 30%-ish
>>>> slowdown for unaligned operations:
>>>>
>>>> In [36]: %timeit packed_arr['b']**2
>>>> 100 loops, best of 3: 2.48 ms per loop
>>>>
>>>> In [37]: %timeit aligned_arr['b']**2
>>>> 1000 loops, best of 3: 1.9 ms per loop
>>>
>>> Hmm, that clearly depends on the architecture.  On my machine:
>>>
>>> In [1]: import numpy as np
>>>
>>> In [2]: aligned_dt = np.dtype([('a', 'i1'), ('b', 'i8')], align=True)
>>>
>>> In [3]: packed_dt = np.dtype([('a', 'i1'), ('b', 'i8')], align=False)
>>>
>>> In [4]: aligned_arr = np.ones((10**6,), dtype=aligned_dt)
>>>
>>> In [5]: packed_arr = np.ones((10**6,), dtype=packed_dt)
>>>
>>> In [6]: baligned = aligned_arr['b']
>>>
>>> In [7]: bpacked = packed_arr['b']
>>>
>>> In [8]: %timeit baligned**2
>>> 1000 loops, best of 3: 1.96 ms per loop
>>>
>>> In [9]: %timeit bpacked**2
>>> 100 loops, best of 3: 7.84 ms per loop
>>>
>>> That is, the unaligned column is 4x slower (!).  numexpr allows
>>> somewhat better results:
>>>
>>> In [11]: %timeit numexpr.evaluate('baligned**2')
>>> 1000 loops, best of 3: 1.13 ms per loop
>>>
>>> In [12]: %timeit numexpr.evaluate('bpacked**2')
>>> 1000 loops, best of 3: 865 us per loop
>>
>> Just for completeness, here it is what Theano gets:
>>
>> In [18]: import theano
>>
>> In [20]: a = theano.tensor.vector()
>>
>> In [22]: f = theano.function([a], a**2)
>>
>> In [23]: %timeit f(baligned)
>> 100 loops, best of 3: 7.74 ms per loop
>>
>> In [24]: %timeit f(bpacked)
>> 100 loops, best of 3: 12.6 ms per loop
>>
>> So yeah, Theano is also slower for the unaligned case (but less than 2x
>> in this case).
>>
>>>
>>> Yes, in this case, the unaligned array goes faster (as much as 30%).
>>> I think the reason is that numexpr optimizes the unaligned access by
>>> doing a copy of the different chunks in internal buffers that fits in
>>> L1 cache.  Apparently this is very beneficial in this case (not sure
>>> why, though).
>>>
>>>>
>>>> Whereas summing shows just a 10%-ish slowdown:
>>>>
>>>> In [38]: %timeit packed_arr['b'].sum()
>>>> 1000 loops, best of 3: 1.29 ms per loop
>>>>
>>>> In [39]: %timeit aligned_arr['b'].sum()
>>>> 1000 loops, best of 3: 1.14 ms per loop
>>>
>>> On my machine:
>>>
>>> In [14]: %timeit baligned.sum()
>>> 1000 loops, best of 3: 1.03 ms per loop
>>>
>>> In [15]: %timeit bpacked.sum()
>>> 100 loops, best of 3: 3.79 ms per loop
>>>
>>> Again, the 4x slowdown is here.  Using numexpr:
>>>
>>> In [16]: %timeit numexpr.evaluate('sum(baligned)')
>>> 100 loops, best of 3: 2.16 ms per loop
>>>
>>> In [17]: %timeit numexpr.evaluate('sum(bpacked)')
>>> 100 loops, best of 3: 2.08 ms per loop
>>
>> And with Theano:
>>
>> In [26]: f2 = theano.function([a], a.sum())
>>
>> In [27]: %timeit f2(baligned)
>> 100 loops, best of 3: 2.52 ms per loop
>>
>> In [28]: %timeit f2(bpacked)
>> 100 loops, best of 3: 7.43 ms per loop
>>
>> Again, the unaligned case is significantly slower (as much as 3x here!).
>>
>> --
>> Francesc Alted
>>
>> _______________________________________________
>> NumPy-Discussion mailing list
>> NumPy-Discussion@scipy.org
>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion


More information about the NumPy-Discussion mailing list