[Doc] Deep Dive TensorIR

This PR adds a new section in the documentation to introduce the TensorIR abstraction, its learning resources, and tutorials.
apache · Sep 3, 2024 · 50999fa · 50999fa
1 parent cd34486
commit 50999fa
Show file tree

Hide file tree

Showing 8 changed files with 828 additions and 0 deletions.
diff --git a/docs/conf.py b/docs/conf.py
@@ -424,6 +424,7 @@ def jupyter_notebook(script_blocks, gallery_conf, target_dir, real_func):
     # New tutorial structure under docs folder
     tvm_path.joinpath("docs", "get_started", "tutorials"),
     tvm_path.joinpath("docs", "how_to", "tutorials"),
+    tvm_path.joinpath("docs", "deep_dive", "tensor_ir", "tutorials"),
 ]
 
 gallery_dirs = [
@@ -442,6 +443,7 @@ def jupyter_notebook(script_blocks, gallery_conf, target_dir, real_func):
     # New tutorial structure under docs folder
     "get_started/tutorials/",
     "how_to/tutorials/",
+    "deep_dive/tensor_ir/tutorials/",
 ]
 
 

diff --git a/docs/deep_dive/tensor_ir/abstraction.rst b/docs/deep_dive/tensor_ir/abstraction.rst
@@ -0,0 +1,73 @@
+..  Licensed to the Apache Software Foundation (ASF) under one
+    or more contributor license agreements.  See the NOTICE file
+    distributed with this work for additional information
+    regarding copyright ownership.  The ASF licenses this file
+    to you under the Apache License, Version 2.0 (the
+    "License"); you may not use this file except in compliance
+    with the License.  You may obtain a copy of the License at
+
+..    http://www.apache.org/licenses/LICENSE-2.0
+
+..  Unless required by applicable law or agreed to in writing,
+    software distributed under the License is distributed on an
+    "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+    KIND, either express or implied.  See the License for the
+    specific language governing permissions and limitations
+    under the License.
+
+.. _tir-abstraction:
+
+Tensor Program Abstraction
+--------------------------
+Before we dive into the details of TensorIR, let's first introduce what is a primitive tensor
+function. Primitive tensor functions are functions that correspond to a single "unit" of
+computational operation. For example, a convolution operation can be a primitive tensor function,
+and a fused convolution + relu operation can also be a primitive tensor function.
+Usually, a typical abstraction for primitive tensor function implementation contains the following
+elements: multi-dimensional buffers, loop nests that drive the tensor computations, and finally,
+the compute statements themselves.
+
+.. code:: python
+
+    from tvm.script import tir as T
+
+    @T.prim_func
+    def main(
+        A: T.Buffer((128,), "float32"),
+        B: T.Buffer((128,), "float32"),
+        C: T.Buffer((128,), "float32"),
+    ) -> None:
+        for i in range(128):
+            with T.block("C"):
+                vi = T.axis.spatial(128, i)
+                C[vi] = A[vi] + B[vi]
+
+Key Elements of Tensor Programs
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The demonstrated primitive tensor function calculates the element-wise sum of two vectors.
+The function:
+
+- Accepts three **multi-dimensional buffers** as parameters, and generates one **multi-dimensional
+  buffer** as output.
+- Incorporates a solitary **loop nest** ``i`` that facilitates the computation.
+- Features a singular **compute statement** that calculates the element-wise sum of the two
+  vectors.
+
+Extra Structure in TensorIR
+~~~~~~~~~~~~~~~~~~~~~~~~~~~
+Crucially, we are unable to execute arbitrary transformations on the program, as certain
+computations rely on the loop's sequence. Fortunately, the majority of primitive tensor
+functions we focus on possess favorable properties, such as independence among loop iterations.
+For instance, the aforementioned program includes block and iteration annotations:
+
+- The **block annotation** ``with T.block("C")`` signifies that the block is the fundamental
+  computation unit designated for scheduling. A block may encompass a single computation
+  statement, multiple computation statements with loops, or opaque intrinsics such as Tensor
+  Core instructions.
+- The **iteration annotation** ``T.axis.spatial``, indicating that variable ``vi`` is mapped
+  to ``i``, and all iterations are independent.
+
+While this information isn't crucial for *executing* the specific program, it proves useful when
+transforming the program. Consequently, we can confidently parallelize or reorder loops associated
+with ``vi``, provided we traverse all the index elements from 0 to 128.
diff --git a/docs/deep_dive/tensor_ir/index.rst b/docs/deep_dive/tensor_ir/index.rst
@@ -0,0 +1,31 @@
+..  Licensed to the Apache Software Foundation (ASF) under one
+    or more contributor license agreements.  See the NOTICE file
+    distributed with this work for additional information
+    regarding copyright ownership.  The ASF licenses this file
+    to you under the Apache License, Version 2.0 (the
+    "License"); you may not use this file except in compliance
+    with the License.  You may obtain a copy of the License at
+
+..    http://www.apache.org/licenses/LICENSE-2.0
+
+..  Unless required by applicable law or agreed to in writing,
+    software distributed under the License is distributed on an
+    "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+    KIND, either express or implied.  See the License for the
+    specific language governing permissions and limitations
+    under the License.
+
+.. _tensor-ir:
+
+TensorIR
+========
+TensorIR is one of the core abstraction in Apache TVM Unity stack, which is used to
+represent and optimize the primitive tensor functions.
+
+.. toctree::
+    :maxdepth: 2
+
+    abstraction
+    learning
+    tutorials/creation
+    tutorials/transformation
diff --git a/docs/deep_dive/tensor_ir/learning.rst b/docs/deep_dive/tensor_ir/learning.rst
@@ -0,0 +1,253 @@
+..  Licensed to the Apache Software Foundation (ASF) under one
+    or more contributor license agreements.  See the NOTICE file
+    distributed with this work for additional information
+    regarding copyright ownership.  The ASF licenses this file
+    to you under the Apache License, Version 2.0 (the
+    "License"); you may not use this file except in compliance
+    with the License.  You may obtain a copy of the License at
+
+..    http://www.apache.org/licenses/LICENSE-2.0
+
+..  Unless required by applicable law or agreed to in writing,
+    software distributed under the License is distributed on an
+    "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+    KIND, either express or implied.  See the License for the
+    specific language governing permissions and limitations
+    under the License.
+
+.. _tir-learning:
+
+Understand TensorIR Abstraction
+===============================
+TensorIR is the tensor program abstraction in Apache TVM, which is one of the standard
+machine learning compilation frameworks. The principal objective of tensor program abstraction
+is to depict loops and associated hardware acceleration options, including threading, the
+application of specialized hardware instructions, and memory access.
+
+To help our explanations, let us use the following sequence of tensor computations as
+a motivating example. Specifically, for two :math:`128 \times 128` matrices ``A`` and ``B``, let us perform the
+following two steps of tensor computations.
+
+.. math::
+
+  Y_{i, j} &= \sum_k A_{i, k} \times B_{k, j} \\
+  C_{i, j} &= \mathbb{relu}(Y_{i, j}) = \mathbb{max}(Y_{i, j}, 0)
+
+
+The above computations resemble a typical primitive tensor function commonly seen in neural networks,
+a linear layer with relu activation. We use TensorIR to depict the above computations as follows.
+
+Before we invoke TensorIR, let's use native Python codes with NumPy to show the computation:
+
+.. code:: python
+
+    def lnumpy_mm_relu(A: np.ndarray, B: np.ndarray, C: np.ndarray):
+        Y = np.empty((128, 128), dtype="float32")
+        for i in range(128):
+            for j in range(128):
+                for k in range(128):
+                    if k == 0:
+                        Y[i, j] = 0
+                    Y[i, j] = Y[i, j] + A[i, k] * B[k, j]
+        for i in range(128):
+            for j in range(128):
+                C[i, j] = max(Y[i, j], 0)
+
+With the low-level NumPy example in mind, now we are ready to introduce TensorIR. The code block
+below shows a TensorIR implementation of ``mm_relu``. The particular code is implemented in a
+language called TVMScript, which is a domain-specific dialect embedded in python AST.
+
+.. code:: python
+
+    @tvm.script.ir_module
+    class MyModule:
+        @T.prim_func
+        def mm_relu(A: T.Buffer((128, 128), "float32"),
+                    B: T.Buffer((128, 128), "float32"),
+                    C: T.Buffer((128, 128), "float32")):
+            Y = T.alloc_buffer((128, 128), dtype="float32")
+            for i, j, k in T.grid(128, 128, 128):
+                with T.block("Y"):
+                    vi = T.axis.spatial(128, i)
+                    vj = T.axis.spatial(128, j)
+                    vk = T.axis.reduce(128, k)
+                    with T.init():
+                        Y[vi, vj] = T.float32(0)
+                    Y[vi, vj] = Y[vi, vj] + A[vi, vk] * B[vk, vj]
+            for i, j in T.grid(128, 128):
+                with T.block("C"):
+                    vi = T.axis.spatial(128, i)
+                    vj = T.axis.spatial(128, j)
+                    C[vi, vj] = T.max(Y[vi, vj], T.float32(0))
+
+
+Next, let's invest the elements in the above TensorIR program.
+
+Function Parameters and Buffers
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+**The function parameters correspond to the same set of parameters on the numpy function.**
+
+.. code:: python
+
+    # TensorIR
+    def mm_relu(A: T.Buffer[(128, 128), "float32"],
+                B: T.Buffer[(128, 128), "float32"],
+                C: T.Buffer[(128, 128), "float32"]):
+        ...
+    # NumPy
+    def lnumpy_mm_relu(A: np.ndarray, B: np.ndarray, C: np.ndarray):
+        ...
+
+Here ``A``, ``B``, and ``C`` takes a type named ``T.Buffer``, which with shape
+argument ``(128, 128)`` and data type ``float32``. This additional information
+helps possible MLC process to generate code that specializes in the shape and data
+type.
+
+**Similarly, TensorIR also uses a buffer type in intermediate result allocation.**
+
+.. code:: python
+
+    # TensorIR
+    Y = T.alloc_buffer((128, 128), dtype="float32")
+    # NumPy
+    Y = np.empty((128, 128), dtype="float32")
+
+Loop Iterations
+~~~~~~~~~~~~~~~
+**There are also direct correspondence of loop iterations.**
+
+``T.grid`` is a syntactic sugar in TensorIR for us to write multiple nested iterators.
+
+.. code:: python
+
+    # TensorIR with `T.grid`
+    for i, j, k in T.grid(128, 128, 128):
+        ...
+    # TensorIR with `range`
+    for i in range(128):
+        for j in range(128):
+            for k in range(128):
+                ...
+    # NumPy
+    for i in range(128):
+        for j in range(128):
+            for k in range(128):
+                ...
+
+Computational Block
+~~~~~~~~~~~~~~~~~~~
+A significant distinction lies in computational statements:
+**TensorIR incorporates an additional construct termed** ``T.block``.
+
+.. code:: python
+
+    # TensorIR
+    with T.block("Y"):
+        vi = T.axis.spatial(128, i)
+        vj = T.axis.spatial(128, j)
+        vk = T.axis.reduce(128, k)
+        with T.init():
+            Y[vi, vj] = T.float32(0)
+        Y[vi, vj] = Y[vi, vj] + A[vi, vk] * B[vk, vj]
+    # NumPy
+    vi, vj, vk = i, j, k
+    if vk == 0:
+        Y[vi, vj] = 0
+    Y[vi, vj] = Y[vi, vj] + A[vi, vk] * B[vk, vj]
+
+A **block** represents a fundamental computation unit within TensorIR. Importantly,
+a block encompasses more information than standard NumPy code. It comprises a set of block axes
+``(vi, vj, vk)`` and the computations delineated around them.
+
+.. code:: python
+
+    vi = T.axis.spatial(128, i)
+    vj = T.axis.spatial(128, j)
+    vk = T.axis.reduce(128, k)
+
+The above three lines declare the **key properties** about block axes in the following syntax.
+
+.. code:: python
+
+    [block_axis] = T.axis.[axis_type]([axis_range], [mapped_value])
+
+These three lines convey the following details:
+
+- They specify the binding of ``vi``, ``vj``, ``vk`` (in this instance, to ``i``, ``j``, ``k``).
+- They declare the original range intended for ``vi``, ``vj``, ``vk``
+  (the 128 in ``T.axis.spatial(128, i)``).
+- They announce the properties of the iterators (spatial, reduce).
+
+Block Axis Properties
+~~~~~~~~~~~~~~~~~~~~~
+Let's delve deeper into the properties of the block axis. These properties signify the axis's
+relationship to the computation in progress. The block comprises three axes ``vi``, ``vj``, and
+``vk``, meanwhile the block reads the buffer ``A[vi, vk]``, ``B[vk, vj]`` and writs the buffer
+``Y[vi, vj]``. Strictly speaking, the block performs (reduction) updates to Y, which we label
+as write for the time being, as we don't require the value of Y from another block.
+
+Significantly, for a fixed value of ``vi`` and ``vj``, the computation block yields a point
+value at a spatial location of ``Y`` (``Y[vi, vj]``) that is independent of other locations in ``Y``
+(with different ``vi``, ``vj`` values). We can refer to ``vi``, ``vj`` as **spatial axes** since
+they directly correspond to the start of a spatial region of buffers that the block writes to.
+The axes involved in reduction (``vk``) are designated as **reduce axes**.
+
+Why Extra Information in Block
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+One crucial observation is that the additional information (block axis range and their properties)
+makes the block to be **self-contained** when it comes to the iterations that it is supposed to
+carry out independent from the external loop-nest ``i, j, k``.
+
+The block axis information also provides additional properties that help us to validate the correctness of the
+external loops that are used to carry out the computation. For example, the above code block will result in an
+error because the loop expects an iterator of size 128, but we only bound it to a for loop of size 127.
+
+.. code:: python
+
+    # wrong program due to loop and block iteration mismatch
+    for i in range(127):
+        with T.block("C"):
+            vi = T.axis.spatial(128, i)
+            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
+            error here due to iterator size mismatch
+            ...
+
+Sugars for Block Axes Binding
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+In situations where each of the block axes is directly mapped to an outer loop iterator,
+we can use ``T.axis.remap`` to declare the block axis in a single line.
+
+.. code:: python
+
+    # SSR means the properties of each axes are "spatial", "spatial", "reduce"
+    vi, vj, vk = T.axis.remap("SSR", [i, j, k])
+
+which is equivalent to
+
+.. code:: python
+
+    vi = T.axis.spatial(range_of_i, i)
+    vj = T.axis.spatial(range_of_j, j)
+    vk = T.axis.reduce (range_of_k, k)
+
+So we can also write the programs as follows.
+
+.. code:: python
+
+    @tvm.script.ir_module
+    class MyModuleWithAxisRemapSugar:
+        @T.prim_func
+        def mm_relu(A: T.Buffer((128, 128), "float32"),
+                    B: T.Buffer((128, 128), "float32"),
+                    C: T.Buffer((128, 128), "float32")):
+            Y = T.alloc_buffer((128, 128), dtype="float32")
+            for i, j, k in T.grid(128, 128, 128):
+                with T.block("Y"):
+                    vi, vj, vk = T.axis.remap("SSR", [i, j, k])
+                    with T.init():
+                        Y[vi, vj] = T.float32(0)
+                    Y[vi, vj] = Y[vi, vj] + A[vi, vk] * B[vk, vj]
+            for i, j in T.grid(128, 128):
+                with T.block("C"):
+                    vi, vj = T.axis.remap("SS", [i, j])
+                    C[vi, vj] = T.max(Y[vi, vj], T.float32(0))
diff --git a/docs/deep_dive/tensor_ir/tutorials/README.txt b/docs/deep_dive/tensor_ir/tutorials/README.txt
@@ -0,0 +1,2 @@
+Deep Dive: TensorIR
+-------------------