Skip to content

Commit

Permalink
[Doc] Deep Dive TensorIR
Browse files Browse the repository at this point in the history
This PR adds a new section in the documentation to introduce the
TensorIR abstraction, its learning resources, and tutorials.
  • Loading branch information
Hzfengsy committed Sep 3, 2024
1 parent cd34486 commit 50999fa
Show file tree
Hide file tree
Showing 8 changed files with 828 additions and 0 deletions.
2 changes: 2 additions & 0 deletions docs/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -424,6 +424,7 @@ def jupyter_notebook(script_blocks, gallery_conf, target_dir, real_func):
# New tutorial structure under docs folder
tvm_path.joinpath("docs", "get_started", "tutorials"),
tvm_path.joinpath("docs", "how_to", "tutorials"),
tvm_path.joinpath("docs", "deep_dive", "tensor_ir", "tutorials"),
]

gallery_dirs = [
Expand All @@ -442,6 +443,7 @@ def jupyter_notebook(script_blocks, gallery_conf, target_dir, real_func):
# New tutorial structure under docs folder
"get_started/tutorials/",
"how_to/tutorials/",
"deep_dive/tensor_ir/tutorials/",
]


Expand Down
73 changes: 73 additions & 0 deletions docs/deep_dive/tensor_ir/abstraction.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
.. Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
.. http://www.apache.org/licenses/LICENSE-2.0
.. Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
.. _tir-abstraction:

Tensor Program Abstraction
--------------------------
Before we dive into the details of TensorIR, let's first introduce what is a primitive tensor
function. Primitive tensor functions are functions that correspond to a single "unit" of
computational operation. For example, a convolution operation can be a primitive tensor function,
and a fused convolution + relu operation can also be a primitive tensor function.
Usually, a typical abstraction for primitive tensor function implementation contains the following
elements: multi-dimensional buffers, loop nests that drive the tensor computations, and finally,
the compute statements themselves.

.. code:: python
from tvm.script import tir as T
@T.prim_func
def main(
A: T.Buffer((128,), "float32"),
B: T.Buffer((128,), "float32"),
C: T.Buffer((128,), "float32"),
) -> None:
for i in range(128):
with T.block("C"):
vi = T.axis.spatial(128, i)
C[vi] = A[vi] + B[vi]
Key Elements of Tensor Programs
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The demonstrated primitive tensor function calculates the element-wise sum of two vectors.
The function:

- Accepts three **multi-dimensional buffers** as parameters, and generates one **multi-dimensional
buffer** as output.
- Incorporates a solitary **loop nest** ``i`` that facilitates the computation.
- Features a singular **compute statement** that calculates the element-wise sum of the two
vectors.

Extra Structure in TensorIR
~~~~~~~~~~~~~~~~~~~~~~~~~~~
Crucially, we are unable to execute arbitrary transformations on the program, as certain
computations rely on the loop's sequence. Fortunately, the majority of primitive tensor
functions we focus on possess favorable properties, such as independence among loop iterations.
For instance, the aforementioned program includes block and iteration annotations:

- The **block annotation** ``with T.block("C")`` signifies that the block is the fundamental
computation unit designated for scheduling. A block may encompass a single computation
statement, multiple computation statements with loops, or opaque intrinsics such as Tensor
Core instructions.
- The **iteration annotation** ``T.axis.spatial``, indicating that variable ``vi`` is mapped
to ``i``, and all iterations are independent.

While this information isn't crucial for *executing* the specific program, it proves useful when
transforming the program. Consequently, we can confidently parallelize or reorder loops associated
with ``vi``, provided we traverse all the index elements from 0 to 128.
31 changes: 31 additions & 0 deletions docs/deep_dive/tensor_ir/index.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
.. Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
.. http://www.apache.org/licenses/LICENSE-2.0
.. Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
.. _tensor-ir:

TensorIR
========
TensorIR is one of the core abstraction in Apache TVM Unity stack, which is used to
represent and optimize the primitive tensor functions.

.. toctree::
:maxdepth: 2

abstraction
learning
tutorials/creation
tutorials/transformation
253 changes: 253 additions & 0 deletions docs/deep_dive/tensor_ir/learning.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,253 @@
.. Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
.. http://www.apache.org/licenses/LICENSE-2.0
.. Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
.. _tir-learning:

Understand TensorIR Abstraction
===============================
TensorIR is the tensor program abstraction in Apache TVM, which is one of the standard
machine learning compilation frameworks. The principal objective of tensor program abstraction
is to depict loops and associated hardware acceleration options, including threading, the
application of specialized hardware instructions, and memory access.

To help our explanations, let us use the following sequence of tensor computations as
a motivating example. Specifically, for two :math:`128 \times 128` matrices ``A`` and ``B``, let us perform the
following two steps of tensor computations.

.. math::
Y_{i, j} &= \sum_k A_{i, k} \times B_{k, j} \\
C_{i, j} &= \mathbb{relu}(Y_{i, j}) = \mathbb{max}(Y_{i, j}, 0)
The above computations resemble a typical primitive tensor function commonly seen in neural networks,
a linear layer with relu activation. We use TensorIR to depict the above computations as follows.

Before we invoke TensorIR, let's use native Python codes with NumPy to show the computation:

.. code:: python
def lnumpy_mm_relu(A: np.ndarray, B: np.ndarray, C: np.ndarray):
Y = np.empty((128, 128), dtype="float32")
for i in range(128):
for j in range(128):
for k in range(128):
if k == 0:
Y[i, j] = 0
Y[i, j] = Y[i, j] + A[i, k] * B[k, j]
for i in range(128):
for j in range(128):
C[i, j] = max(Y[i, j], 0)
With the low-level NumPy example in mind, now we are ready to introduce TensorIR. The code block
below shows a TensorIR implementation of ``mm_relu``. The particular code is implemented in a
language called TVMScript, which is a domain-specific dialect embedded in python AST.

.. code:: python
@tvm.script.ir_module
class MyModule:
@T.prim_func
def mm_relu(A: T.Buffer((128, 128), "float32"),
B: T.Buffer((128, 128), "float32"),
C: T.Buffer((128, 128), "float32")):
Y = T.alloc_buffer((128, 128), dtype="float32")
for i, j, k in T.grid(128, 128, 128):
with T.block("Y"):
vi = T.axis.spatial(128, i)
vj = T.axis.spatial(128, j)
vk = T.axis.reduce(128, k)
with T.init():
Y[vi, vj] = T.float32(0)
Y[vi, vj] = Y[vi, vj] + A[vi, vk] * B[vk, vj]
for i, j in T.grid(128, 128):
with T.block("C"):
vi = T.axis.spatial(128, i)
vj = T.axis.spatial(128, j)
C[vi, vj] = T.max(Y[vi, vj], T.float32(0))
Next, let's invest the elements in the above TensorIR program.

Function Parameters and Buffers
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
**The function parameters correspond to the same set of parameters on the numpy function.**

.. code:: python
# TensorIR
def mm_relu(A: T.Buffer[(128, 128), "float32"],
B: T.Buffer[(128, 128), "float32"],
C: T.Buffer[(128, 128), "float32"]):
...
# NumPy
def lnumpy_mm_relu(A: np.ndarray, B: np.ndarray, C: np.ndarray):
...
Here ``A``, ``B``, and ``C`` takes a type named ``T.Buffer``, which with shape
argument ``(128, 128)`` and data type ``float32``. This additional information
helps possible MLC process to generate code that specializes in the shape and data
type.

**Similarly, TensorIR also uses a buffer type in intermediate result allocation.**

.. code:: python
# TensorIR
Y = T.alloc_buffer((128, 128), dtype="float32")
# NumPy
Y = np.empty((128, 128), dtype="float32")
Loop Iterations
~~~~~~~~~~~~~~~
**There are also direct correspondence of loop iterations.**

``T.grid`` is a syntactic sugar in TensorIR for us to write multiple nested iterators.

.. code:: python
# TensorIR with `T.grid`
for i, j, k in T.grid(128, 128, 128):
...
# TensorIR with `range`
for i in range(128):
for j in range(128):
for k in range(128):
...
# NumPy
for i in range(128):
for j in range(128):
for k in range(128):
...
Computational Block
~~~~~~~~~~~~~~~~~~~
A significant distinction lies in computational statements:
**TensorIR incorporates an additional construct termed** ``T.block``.

.. code:: python
# TensorIR
with T.block("Y"):
vi = T.axis.spatial(128, i)
vj = T.axis.spatial(128, j)
vk = T.axis.reduce(128, k)
with T.init():
Y[vi, vj] = T.float32(0)
Y[vi, vj] = Y[vi, vj] + A[vi, vk] * B[vk, vj]
# NumPy
vi, vj, vk = i, j, k
if vk == 0:
Y[vi, vj] = 0
Y[vi, vj] = Y[vi, vj] + A[vi, vk] * B[vk, vj]
A **block** represents a fundamental computation unit within TensorIR. Importantly,
a block encompasses more information than standard NumPy code. It comprises a set of block axes
``(vi, vj, vk)`` and the computations delineated around them.

.. code:: python
vi = T.axis.spatial(128, i)
vj = T.axis.spatial(128, j)
vk = T.axis.reduce(128, k)
The above three lines declare the **key properties** about block axes in the following syntax.

.. code:: python
[block_axis] = T.axis.[axis_type]([axis_range], [mapped_value])
These three lines convey the following details:

- They specify the binding of ``vi``, ``vj``, ``vk`` (in this instance, to ``i``, ``j``, ``k``).
- They declare the original range intended for ``vi``, ``vj``, ``vk``
(the 128 in ``T.axis.spatial(128, i)``).
- They announce the properties of the iterators (spatial, reduce).

Block Axis Properties
~~~~~~~~~~~~~~~~~~~~~
Let's delve deeper into the properties of the block axis. These properties signify the axis's
relationship to the computation in progress. The block comprises three axes ``vi``, ``vj``, and
``vk``, meanwhile the block reads the buffer ``A[vi, vk]``, ``B[vk, vj]`` and writs the buffer
``Y[vi, vj]``. Strictly speaking, the block performs (reduction) updates to Y, which we label
as write for the time being, as we don't require the value of Y from another block.

Significantly, for a fixed value of ``vi`` and ``vj``, the computation block yields a point
value at a spatial location of ``Y`` (``Y[vi, vj]``) that is independent of other locations in ``Y``
(with different ``vi``, ``vj`` values). We can refer to ``vi``, ``vj`` as **spatial axes** since
they directly correspond to the start of a spatial region of buffers that the block writes to.
The axes involved in reduction (``vk``) are designated as **reduce axes**.

Why Extra Information in Block
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
One crucial observation is that the additional information (block axis range and their properties)
makes the block to be **self-contained** when it comes to the iterations that it is supposed to
carry out independent from the external loop-nest ``i, j, k``.

The block axis information also provides additional properties that help us to validate the correctness of the
external loops that are used to carry out the computation. For example, the above code block will result in an
error because the loop expects an iterator of size 128, but we only bound it to a for loop of size 127.

.. code:: python
# wrong program due to loop and block iteration mismatch
for i in range(127):
with T.block("C"):
vi = T.axis.spatial(128, i)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
error here due to iterator size mismatch
...
Sugars for Block Axes Binding
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In situations where each of the block axes is directly mapped to an outer loop iterator,
we can use ``T.axis.remap`` to declare the block axis in a single line.

.. code:: python
# SSR means the properties of each axes are "spatial", "spatial", "reduce"
vi, vj, vk = T.axis.remap("SSR", [i, j, k])
which is equivalent to

.. code:: python
vi = T.axis.spatial(range_of_i, i)
vj = T.axis.spatial(range_of_j, j)
vk = T.axis.reduce (range_of_k, k)
So we can also write the programs as follows.

.. code:: python
@tvm.script.ir_module
class MyModuleWithAxisRemapSugar:
@T.prim_func
def mm_relu(A: T.Buffer((128, 128), "float32"),
B: T.Buffer((128, 128), "float32"),
C: T.Buffer((128, 128), "float32")):
Y = T.alloc_buffer((128, 128), dtype="float32")
for i, j, k in T.grid(128, 128, 128):
with T.block("Y"):
vi, vj, vk = T.axis.remap("SSR", [i, j, k])
with T.init():
Y[vi, vj] = T.float32(0)
Y[vi, vj] = Y[vi, vj] + A[vi, vk] * B[vk, vj]
for i, j in T.grid(128, 128):
with T.block("C"):
vi, vj = T.axis.remap("SS", [i, j])
C[vi, vj] = T.max(Y[vi, vj], T.float32(0))
2 changes: 2 additions & 0 deletions docs/deep_dive/tensor_ir/tutorials/README.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
Deep Dive: TensorIR
-------------------
Loading

0 comments on commit 50999fa

Please sign in to comment.