[Numpy-discussion] aligned / unaligned structured dtype behavior

Frédéric Bastien nouiz@nouiz....
Thu Mar 7 12:26:27 CST 2013


Hi,

It is normal that unaligned access are slower. The hardware have been
optimized for aligned access. So this is a user choice space vs speed.
We can't go around that. We can only minimize the cost of unaligned
access in some cases, but not all and those optimization depend of the
CPU. But newer CPU have lowered in cost of unaligned access.

I'm surprised that Theano worked with the unaligned input. I added
some check to make this raise an error, as we do not support that!
Francesc, can you check if Theano give the good result? It is possible
that someone (maybe me), just copy the input to an aligned ndarray
when we receive an not aligned one. That could explain why it worked,
but my memory tell me that we raise an error.

As you saw in the number, this is a bad example for Theano as the
function compiled is too fast . Their is more Theano overhead then
computation time in that example. We have reduced recently the
overhead, but we can do more to lower it.

Fred

On Thu, Mar 7, 2013 at 1:06 PM, Francesc Alted <francesc@continuum.io> wrote:
> On 3/7/13 6:47 PM, Francesc Alted wrote:
>> On 3/6/13 7:42 PM, Kurt Smith wrote:
>>> And regarding performance, doing simple timings shows a 30%-ish
>>> slowdown for unaligned operations:
>>>
>>> In [36]: %timeit packed_arr['b']**2
>>> 100 loops, best of 3: 2.48 ms per loop
>>>
>>> In [37]: %timeit aligned_arr['b']**2
>>> 1000 loops, best of 3: 1.9 ms per loop
>>
>> Hmm, that clearly depends on the architecture.  On my machine:
>>
>> In [1]: import numpy as np
>>
>> In [2]: aligned_dt = np.dtype([('a', 'i1'), ('b', 'i8')], align=True)
>>
>> In [3]: packed_dt = np.dtype([('a', 'i1'), ('b', 'i8')], align=False)
>>
>> In [4]: aligned_arr = np.ones((10**6,), dtype=aligned_dt)
>>
>> In [5]: packed_arr = np.ones((10**6,), dtype=packed_dt)
>>
>> In [6]: baligned = aligned_arr['b']
>>
>> In [7]: bpacked = packed_arr['b']
>>
>> In [8]: %timeit baligned**2
>> 1000 loops, best of 3: 1.96 ms per loop
>>
>> In [9]: %timeit bpacked**2
>> 100 loops, best of 3: 7.84 ms per loop
>>
>> That is, the unaligned column is 4x slower (!).  numexpr allows
>> somewhat better results:
>>
>> In [11]: %timeit numexpr.evaluate('baligned**2')
>> 1000 loops, best of 3: 1.13 ms per loop
>>
>> In [12]: %timeit numexpr.evaluate('bpacked**2')
>> 1000 loops, best of 3: 865 us per loop
>
> Just for completeness, here it is what Theano gets:
>
> In [18]: import theano
>
> In [20]: a = theano.tensor.vector()
>
> In [22]: f = theano.function([a], a**2)
>
> In [23]: %timeit f(baligned)
> 100 loops, best of 3: 7.74 ms per loop
>
> In [24]: %timeit f(bpacked)
> 100 loops, best of 3: 12.6 ms per loop
>
> So yeah, Theano is also slower for the unaligned case (but less than 2x
> in this case).
>
>>
>> Yes, in this case, the unaligned array goes faster (as much as 30%).
>> I think the reason is that numexpr optimizes the unaligned access by
>> doing a copy of the different chunks in internal buffers that fits in
>> L1 cache.  Apparently this is very beneficial in this case (not sure
>> why, though).
>>
>>>
>>> Whereas summing shows just a 10%-ish slowdown:
>>>
>>> In [38]: %timeit packed_arr['b'].sum()
>>> 1000 loops, best of 3: 1.29 ms per loop
>>>
>>> In [39]: %timeit aligned_arr['b'].sum()
>>> 1000 loops, best of 3: 1.14 ms per loop
>>
>> On my machine:
>>
>> In [14]: %timeit baligned.sum()
>> 1000 loops, best of 3: 1.03 ms per loop
>>
>> In [15]: %timeit bpacked.sum()
>> 100 loops, best of 3: 3.79 ms per loop
>>
>> Again, the 4x slowdown is here.  Using numexpr:
>>
>> In [16]: %timeit numexpr.evaluate('sum(baligned)')
>> 100 loops, best of 3: 2.16 ms per loop
>>
>> In [17]: %timeit numexpr.evaluate('sum(bpacked)')
>> 100 loops, best of 3: 2.08 ms per loop
>
> And with Theano:
>
> In [26]: f2 = theano.function([a], a.sum())
>
> In [27]: %timeit f2(baligned)
> 100 loops, best of 3: 2.52 ms per loop
>
> In [28]: %timeit f2(bpacked)
> 100 loops, best of 3: 7.43 ms per loop
>
> Again, the unaligned case is significantly slower (as much as 3x here!).
>
> --
> Francesc Alted
>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion


More information about the NumPy-Discussion mailing list