[Numpy-discussion] Looking for people interested in helping with Python compiler to LLVM
Dag Sverre Seljebotn
Tue Mar 20 23:22:07 CDT 2012
On 03/20/2012 09:20 PM, Dag Sverre Seljebotn wrote:
> On 03/20/2012 12:56 PM, Francesc Alted wrote:
>> On Mar 20, 2012, at 2:29 PM, Dag Sverre Seljebotn wrote:
>>> Francesc Alted<email@example.com> wrote:
>>>> On Mar 20, 2012, at 12:49 PM, mark florisson wrote:
>>>>>> Cython and Numba certainly overlap. However, Cython requires:
>>>>>> 1) learning another language
>>>>>> 2) creating an extension module --- loading bit-code files
>>>> and dynamically executing (even on a different machine from the one
>>>> that initially created them) can be a powerful alternative for run-time
>>>> compilation and distribution of code.
>>>>>> These aren't show-stoppers obviously. But, I think some users
>>>> would prefer an even simpler approach to getting fast-code than Cython
>>>> (which currently doesn't do enought type-inference and requires
>>>> building a dlopen extension module).
>>>>> Dag and I have been discussing this at PyCon, and here is my take on
>>>>> it (at this moment :).
>>>>> Definitely, if you can avoid Cython then that is easier and more
>>>>> desirable in many ways. So perhaps we can create a third project
>>>>> called X (I'm not very creative, maybe ArrayExprOpt), that takes an
>>>>> abstract syntax tree in a rather simple form, performs code
>>>>> optimizations such as rewriting loops with array accesses to vector
>>>>> expressions, fusing vector expressions and loops, etc, and spits out
>>>>> transformed AST containing these optimizations. If runtime
>>>>> is given such as actual shape and stride information the
>>>>> transformations could figure out there and then whether to do things
>>>>> like collapsing, axes swapping, blocking (as in, introducing more
>>>>> or loops to retain discontiguous blocks in the cache), blocked memory
>>>>> copies to contiguous chunks, etc. The AST could then also say whether
>>>>> the final expressions are vectorizable. Part of this functionality is
>>>>> already in numpy's nditer, except that this would be implicit and do
>>>>> more (and hopefully with minimal overhead).
>>>>> So numba, Cython and maybe numexpr could use the functionality,
>>>>> by building the AST from Python and converting back (if necessary) to
>>>>> its own AST. As such, the AST optimizer would be only part of any
>>>>> (runtime) compiler's pipeline, and it should be very flexible to
>>>>> retain any information (metadata regarding actual types, control flow
>>>>> information, etc) provided by the original AST. It would not do
>>>>> control flow analysis, type inference or promotion, etc, but only
>>>>> with abstract types like integers, reals and arrays (C, Fortran or
>>>>> partly contiguous or strided). It would not deal with objects, but
>>>>> would allow to insert nodes like UnreorderableNode and SideEffectNode
>>>>> wrapping parts of the original AST. In short, it should be as easy as
>>>>> possible to convert from an original AST to this project's AST and
>>>>> back again afterwards.
>>>> I think this is a very interesting project, and certainly projects like
>>>> numba can benefit of it. So, in order to us have an idea on what you
>>>> are after, can we assume that your project (call it X) would be kind of
>>>> an compiler optimizer, and then the produced, optimized code could be
>>>> feed into numba for optimized LLVM code generation (that on its turn,
>>>> can be run on top of CPUs or GPUs or a combination)? Is that correct?
>>> I think so. Another way of thinking about it is that it is a reimplementation of the logic in the (good and closed source) Fortran 90 compilers, in a reusable component for inclusion in various compilers.
>>> Various c++ metaprogramming libraries (like Blitz++) are similar too.
>> Aha, thanks.
>>>> Giving that my interpretation above is correct, it is bit more
>>>> difficult to me to see how your X project could be of benefit for
>>>> numexpr. In fact, I actually see this the other way round: once the
>>>> optimizer has discovered the vectorization parts, then go one step
>>>> further and generate code that uses numexpr automatically (call this,
>>>> vectorization through numexpr). This is what you mean, or I'm missing
>>> No. I think in some ways this is a competitor to numexpr -- you would gut out the middle of numexpr and keep the frontend and backend, but use this to optimize iteration order and blocking strategies.
>> I see. Yes, I can easily see Mark's project X + numba more as a competitor (killer?) to numexpr too.
>>> I think the goal is for higher performance than what I understand numexpr can provide (in some cases, not all!). For instance, can numexpr deal well with
>>> a + a.T
>>> where a is a c-contiguous array? Any numpy-like iteration order will not work well, one needs to use higher-dimensional (eg 2D) blocking, not 1D blocking.
>> No. numexpr cannot deal with the above problem efficiently. numexpr is about 1d blocking, so its approach is pretty naive (but effective for these 1d blocking tasks).
>>> (if numexpr can do this then great, the task might then reduce to refactoring numexpr so that cython and numba can use the same logic)
>> Well, now that I think about it, the virtual machine in latest numexpr (2.0) is based on the new nditer iterator included in NumPy 1.6, so I'm wondering if with a little bit of more effort, the 2d blocking could be implemented. While I'm pretty sure this is the case, I don't know if it would be really worth the effort. Perhaps concentrating on things like numba + projectX, or just Theano would be better targets. Perhaps Mark Wiebe (who contributed the new nditer-aware VM) could say a bit more about this.
> My guess is that the answer is that nditer works well in many
> situations, but can be sub-optimal for some array shapes, particularly
> if the arrays fit in cache.
> Here's a contrived bad case (not sure if it is the worst): Consider
> arr[::2, :]
> where arr is C-contiguous with shape (n, 2), and let n be such that the
> array fits in L2 cache :-)
> The ::2 ensures that the array can't be flattened. Any nditer approach,
> as I understand it, would need to execute the inner subroutine for each
> row of 4 elements, and then invoke the iteration machinery, which I
> imagine is quite slow compared to LLVM code generated specifically for
> this array, which could even unroll the inner 4-element loop.
Sorry: "2 elements", "2-element loop".
> If one really wants to compete with Fortran, one must take into account
> that the scientific end-user may already be structuring the computation
> in a cache-efficient way. It doesn't always help with good performance
> "as long as arrays are>10 MB".
> If the array is large, the memory bottleneck should makes a lot of the
> CPU overhead of nditer irrelevant even for worst-case arrays; though I'd
> be curious to see benchmarks of how much overhead is left, and my
> (rather unfounded) suspicion is that hard-wired compiled LLVM code for a
> specific array should play better with the CPUs cache predictor to
> better mask memory access latencies (?)
> (Mark F., apropos benchmarks, let me know what kind of different
> hardware you have access to; I could see if I can give you access to a
> couple of boxes with different memory bus characteristics, i.e., Intel
> vs. AMD and so on.)
> NumPy-Discussion mailing list
More information about the NumPy-Discussion