[Numpy-discussion] Looking for people interested in helping with Python compiler to LLVM
Dag Sverre Seljebotn
Tue Mar 20 23:20:52 CDT 2012
On 03/20/2012 12:56 PM, Francesc Alted wrote:
> On Mar 20, 2012, at 2:29 PM, Dag Sverre Seljebotn wrote:
>> Francesc Alted<firstname.lastname@example.org> wrote:
>>> On Mar 20, 2012, at 12:49 PM, mark florisson wrote:
>>>>> Cython and Numba certainly overlap. However, Cython requires:
>>>>> 1) learning another language
>>>>> 2) creating an extension module --- loading bit-code files
>>> and dynamically executing (even on a different machine from the one
>>> that initially created them) can be a powerful alternative for run-time
>>> compilation and distribution of code.
>>>>> These aren't show-stoppers obviously. But, I think some users
>>> would prefer an even simpler approach to getting fast-code than Cython
>>> (which currently doesn't do enought type-inference and requires
>>> building a dlopen extension module).
>>>> Dag and I have been discussing this at PyCon, and here is my take on
>>>> it (at this moment :).
>>>> Definitely, if you can avoid Cython then that is easier and more
>>>> desirable in many ways. So perhaps we can create a third project
>>>> called X (I'm not very creative, maybe ArrayExprOpt), that takes an
>>>> abstract syntax tree in a rather simple form, performs code
>>>> optimizations such as rewriting loops with array accesses to vector
>>>> expressions, fusing vector expressions and loops, etc, and spits out
>>>> transformed AST containing these optimizations. If runtime
>>>> is given such as actual shape and stride information the
>>>> transformations could figure out there and then whether to do things
>>>> like collapsing, axes swapping, blocking (as in, introducing more
>>>> or loops to retain discontiguous blocks in the cache), blocked memory
>>>> copies to contiguous chunks, etc. The AST could then also say whether
>>>> the final expressions are vectorizable. Part of this functionality is
>>>> already in numpy's nditer, except that this would be implicit and do
>>>> more (and hopefully with minimal overhead).
>>>> So numba, Cython and maybe numexpr could use the functionality,
>>>> by building the AST from Python and converting back (if necessary) to
>>>> its own AST. As such, the AST optimizer would be only part of any
>>>> (runtime) compiler's pipeline, and it should be very flexible to
>>>> retain any information (metadata regarding actual types, control flow
>>>> information, etc) provided by the original AST. It would not do
>>>> control flow analysis, type inference or promotion, etc, but only
>>>> with abstract types like integers, reals and arrays (C, Fortran or
>>>> partly contiguous or strided). It would not deal with objects, but
>>>> would allow to insert nodes like UnreorderableNode and SideEffectNode
>>>> wrapping parts of the original AST. In short, it should be as easy as
>>>> possible to convert from an original AST to this project's AST and
>>>> back again afterwards.
>>> I think this is a very interesting project, and certainly projects like
>>> numba can benefit of it. So, in order to us have an idea on what you
>>> are after, can we assume that your project (call it X) would be kind of
>>> an compiler optimizer, and then the produced, optimized code could be
>>> feed into numba for optimized LLVM code generation (that on its turn,
>>> can be run on top of CPUs or GPUs or a combination)? Is that correct?
>> I think so. Another way of thinking about it is that it is a reimplementation of the logic in the (good and closed source) Fortran 90 compilers, in a reusable component for inclusion in various compilers.
>> Various c++ metaprogramming libraries (like Blitz++) are similar too.
> Aha, thanks.
>>> Giving that my interpretation above is correct, it is bit more
>>> difficult to me to see how your X project could be of benefit for
>>> numexpr. In fact, I actually see this the other way round: once the
>>> optimizer has discovered the vectorization parts, then go one step
>>> further and generate code that uses numexpr automatically (call this,
>>> vectorization through numexpr). This is what you mean, or I'm missing
>> No. I think in some ways this is a competitor to numexpr -- you would gut out the middle of numexpr and keep the frontend and backend, but use this to optimize iteration order and blocking strategies.
> I see. Yes, I can easily see Mark's project X + numba more as a competitor (killer?) to numexpr too.
>> I think the goal is for higher performance than what I understand numexpr can provide (in some cases, not all!). For instance, can numexpr deal well with
>> a + a.T
>> where a is a c-contiguous array? Any numpy-like iteration order will not work well, one needs to use higher-dimensional (eg 2D) blocking, not 1D blocking.
> No. numexpr cannot deal with the above problem efficiently. numexpr is about 1d blocking, so its approach is pretty naive (but effective for these 1d blocking tasks).
>> (if numexpr can do this then great, the task might then reduce to refactoring numexpr so that cython and numba can use the same logic)
> Well, now that I think about it, the virtual machine in latest numexpr (2.0) is based on the new nditer iterator included in NumPy 1.6, so I'm wondering if with a little bit of more effort, the 2d blocking could be implemented. While I'm pretty sure this is the case, I don't know if it would be really worth the effort. Perhaps concentrating on things like numba + projectX, or just Theano would be better targets. Perhaps Mark Wiebe (who contributed the new nditer-aware VM) could say a bit more about this.
My guess is that the answer is that nditer works well in many
situations, but can be sub-optimal for some array shapes, particularly
if the arrays fit in cache.
Here's a contrived bad case (not sure if it is the worst): Consider
where arr is C-contiguous with shape (n, 2), and let n be such that the
array fits in L2 cache :-)
The ::2 ensures that the array can't be flattened. Any nditer approach,
as I understand it, would need to execute the inner subroutine for each
row of 4 elements, and then invoke the iteration machinery, which I
imagine is quite slow compared to LLVM code generated specifically for
this array, which could even unroll the inner 4-element loop.
If one really wants to compete with Fortran, one must take into account
that the scientific end-user may already be structuring the computation
in a cache-efficient way. It doesn't always help with good performance
"as long as arrays are >10 MB".
If the array is large, the memory bottleneck should makes a lot of the
CPU overhead of nditer irrelevant even for worst-case arrays; though I'd
be curious to see benchmarks of how much overhead is left, and my
(rather unfounded) suspicion is that hard-wired compiled LLVM code for a
specific array should play better with the CPUs cache predictor to
better mask memory access latencies (?)
(Mark F., apropos benchmarks, let me know what kind of different
hardware you have access to; I could see if I can give you access to a
couple of boxes with different memory bus characteristics, i.e., Intel
vs. AMD and so on.)
More information about the NumPy-Discussion