+The PyOP2 Intermediate Representation
+The parallel loop
is the main construct of PyOP2.
+It applies a specific Kernel
to all elements in the iteration
+set of the parallel loop. Here, we describe how to use the PyOP2 API to build
+a kernel and, also, we provide simple guidelines on how to write efficient
+kernels.
+
+Using the Intermediate Representation
+In the previous section, we described the API for
+PyOP2 kernels in terms of the C code that gets executed.
+Passing in a string of C code is the simplest way of creating a
+Kernel
. Another possibility is to use PyOP2 Intermediate
+Representation (IR) objects to express the Kernel
semantics.
+An Abstract Syntax Tree of the kernel code can be manually built using IR
+objects. Since PyOP2 has been primarily thought to be fed by higher layers
+of abstractions, rather than by users, no C-to-AST parser is currently provided.
+The advantage of providing an AST, instead of C code, is that it enables PyOP2
+to inspect and transform the kernel, which is aimed at achieving performance
+portability among different architectures and, more generally, better execution
+times.
+For the purposes of exposition, let us consider a simple
+kernel init
which initialises the members of a Dat
+to zero.
+from op2 import Kernel
+
+code = """void init(double* edge_weight) {
+ for (int i = 0; i < 3; i++)
+ edge_weight[i] = 0.0;
+}"""
+kernel = Kernel(code, "init")
+
+
+Here, we describe how we can use PyOP2 IR objects to build an AST for
+the this kernel. For example, the most basic AST one can come up with
+is
+from op2 import Kernel
+from ir.ast_base import *
+
+ast = FlatBlock("""void init(double* edge_weight) {
+ for (int i = 0; i < 3; i++)
+ edge_weight[i] = 0.0;
+}""")
+kernel = Kernel(ast, "init")
+
+
+The FlatBlock
object encapsulates a “flat” block
+of code, which is not modified by the IR engine. A
+FlatBlock
is used to represent (possibly large)
+fragments of code for which we are not interested in any kind of
+transformation, so it may be particularly useful to speed up code development
+when writing, for example, test cases or non-expensive kernels. On the other
+hand, time-demanding kernels should be properly represented using a “real”
+AST. For example, an useful AST for init
could be the following
+from op2 import Kernel
+from ir.ast_base import *
+
+ast_body = [FlatBlock("...some code can go here..."),
+ c_for("i", 3, Assign(Symbol("edge_weight", ("i",)), c_sym("0.0")))]
+ast = FunDecl("void", "init",
+ [Decl("double*", c_sym("edge_weight"))],
+ ast_body)
+kernel = Kernel(ast, "init")
+
+
+In this example, we first construct the body of the kernel function. We have
+an initial FlatBlock
that contains, for instance,
+some sort of initialization code. c_for()
is a shortcut
+for building a for loop
. It takes an
+iteration variable (i
), the extent of the loop and its body. Multiple
+statements in the body can be passed in as a list.
+c_sym()
is a shortcut for building symbols
. You may want to use
+c_sym()
when the symbol makes no explicit use of
+iteration variables.
+We use Symbol
instead of
+c_sym()
, when edge_weight
accesses a specific
+element using the iteration variable i
. This is fundamental to allow the
+IR engine to perform many kind of transformations involving the kernel’s
+iteration space(s). Finally, the signature of the function is constructed
+using the FunDecl
.
+Other examples on how to build ASTs can be found in the tests folder,
+particularly looking into test_matrices.py
and
+test_iteration_space_dats.py
.
+
+
+
+Optimizing kernels on CPUs
+So far, some effort has been spent on optimizations for CPU platforms. Being a
+DSL, PyOP2 provides specific support for those (linear algebra) operations that
+are common among unstructured-mesh-based numerical methods. For example, PyOP2
+is capable of aggressively optimizing local assembly codes for applications
+based on the Finite Element Method. We therefore distinguish optimizations in
+two categories:
+
+Generic optimizations, such as data alignment and support for autovectorization.
+Domain-specific optimizations (DSO)
+
+To trigger DSOs, statements must be decorated using the kernel DSL. For example,
+if the kernel computes the local assembly of an element in an unstructured mesh,
+then a pragma pyop2 assembly(itvar1, itvar2)
should be added on top of the
+corresponding statement. When constructing the AST of a kernel, this can be
+simply achieved by
+from ir.ast_base import *
+
+s1 = Symbol("X", ("i",))
+s2 = Symbol("Y", ("j",))
+tensor = Symbol("A", ("i", "j"))
+pragma = "#pragma pyop2 outerproduct(j,k)"
+code = c_for("i", 3, c_for("j", 3, Incr(tensor, Prod(s1, s2), pragma)))
+
+
+That, conceptually, corresponds to
+#pragma pyop2 itspace
+for (int i = 0; i < 3; i++)
+ #pragma pyop2 itspace
+ for (int j = 0; j < 3; j++)
+ #pragma pyop2 assembly(i, j)
+ A[i][j] += X[i]*Y[j]
+
+
+Visiting the AST, PyOP2 finds a 2-dimensional iteration space and an assembly
+statement. Currently, #pragma pyop2 itspace
is ignored when the backend is
+a CPU. The #pragma pyop2 assembly(i, j)
can trigger multiple DSOs.
+PyOP2 currently lacks an autotuning system that automatically finds out the
+best possible kernel implementation; that is, the optimizations that minimize
+the kernel run-time. To drive the optimization process, the user (or the
+higher layer) can specify which optimizations should be applied. Currently,
+PyOP2 can automate:
+
+Alignment and padding of data structures: for issuing aligned loads and stores.
+Loop trip count adjustment according to padding: useful for autovectorization
+when the trip count is not a multiple of the vector length
+Loop-invariant code motion and autovectorization of invariant code: this is
+particularly useful since trip counts are typically small, and hoisted code
+can still represent a significant proportion of the execution time
+Register tiling for rectangular iteration spaces
+(DSO for pragma assembly): Outer-product vectorization + unroll-and-jam of
+outer loops to improve register re-use or to mitigate register pressure
+
+
+
+How to select specific kernel optimizations
+When constructing a Kernel
, it is possible to specify the set
+of optimizations we want PyOP2 to apply. The IR engine will analyse the kernel
+AST and will try to apply, incrementally, such optimizations. The PyOP2’s FFC
+interface, which build a Kernel
object given an AST provided
+by FFC, makes already use of the available optimizations. Here, we take the
+emblematic case of the FFC interface and describe how to play with the various
+optimizations through a series of examples.
+ast = ...
+opts = {'licm': False,
+ 'tile': None,
+ 'ap': False,
+ 'vect': None}
+kernel = Kernel(ast, 'my_kernel', opts)
+
+
+In this example, we have an AST ast
and we specify optimizations through
+the dictionary opts
; then, we build the Kernel
, passing in
+the optional argument opts
. No optimizations are enabled here. The
+possible options are:
+
+licm
: Loop-Invariant Code Motion.
+tile
: Register Tiling (of rectangular iteration spaces)
+ap
: Data alignment, padding. Trip count adjustment.
+vect
: SIMD intra-kernel vectorization.
+
+If we wanted to apply both loop-invariant code motion and data alignment, we
+would simply write
+ast = ...
+opts = {'licm': True,
+ 'ap': True}
+kernel = Kernel(ast, 'my_kernel', opts)
+
+
+Now, let’s assume we know the kernel has a rectangular iteration space. We want
+to try register tiling, with a particular tile size. The way to get it is
+ast = ...
+opts = {'tile': (True, 8)}
+kernel = Kernel(ast, 'my_kernel', opts)
+
+
+In this case, the iteration space is sliced into tiles of size 8x8. If the
+iteration space is smaller than the slice, then the transformation is not
+applied. By specifying -1
instead of 8
, we leave PyOP2 free to choose
+automatically a certain tile size.
+A fundamental optimization for any PyOP2 kernel is SIMD vectorization. This is
+because almost always kernels fit the L1 cache and are likely to be compute-
+bound. Backend compilers’ AutoVectorization (AV) is therefore an opportunity.
+By enforcing data alignment and padding, we can increase the chance AV is
+successful. To try AV, one should write
+import ir.ast_plan as ap
+
+ast = ...
+opts = {'ap': True,
+ 'vect': (ap.AUTOVECT, -1)}
+kernel = Kernel(ast, 'my_kernel', opts)
+
+
+The vect
’s second parameter (-1) is ignored when AV is requested.
+If our kernel is computing an assembly-like operation, then we can ask PyOP2
+to optimize for register locality and register pressure, by resorting to a
+different vectorization technique. Early experiments show that this approach
+can be particularly useful when the amount of data movement in the assembly
+loops is “significant”. Of course, this depends on kernel parameters (e.g.
+size of assembly loop, number and size of arrays involved in the assembly) as
+well as on architecture parameters (e.g. size of L1 cache, number of available
+registers). This strategy takes the name of Outer-Product Vectorization
+(OP), and can be activated in the following way (again, we suggest to use it
+along with data alignment and padding).
+import ir.ast_plan as ap
+
+ast = ...
+opts = {'ap': True,
+ 'vect': (ap.V_OP_UAJ, 1)}
+kernel = Kernel(ast, 'my_kernel', opts)
+
+
+UAJ
in V_OP_UAJ
stands for Unroll-and-Jam
. It has been proved that
+OP shows a much better performance when used in combination with unrolling the
+outer assembly loop and incorporating (jamming) the unrolled iterations
+within the inner loop. The second parameter, therefore, specifies the unroll-
+and-jam factor: the higher it is, the larger is the number of iterations
+unrolled. A factor 1 means that no unroll-and-jam is performed. The optimal
+factor highly depends on the computational characteristics of the kernel.
+
+