forked from apache/tvm
-
Notifications
You must be signed in to change notification settings - Fork 30
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[Doc][Relay] Add VM doc (apache#3188)
* [Doc][Relay] Add VM doc * Add Apache header * Apply suggestions from code review Co-Authored-By: Steven S. Lyubomirsky <[email protected]> Co-Authored-By: 雾雨魔理沙 <[email protected]> Co-Authored-By: Logan Weber <[email protected]> Co-Authored-By: Zhi <[email protected]> * Junru's comment * More fix * More fix * More fix * last fix * Apply suggestions from code review Co-Authored-By: 雾雨魔理沙 <[email protected]> * Apply suggestions from code review Co-Authored-By: Logan Weber <[email protected]> * Add code links * Remove unused bp * Update docs/dev/virtual_machine.rst Co-Authored-By: Logan Weber <[email protected]> * Explain TODO * Yong's comment Co-Authored-By: Yong Wu <[email protected]> * Comment
- Loading branch information
Showing
1 changed file
with
314 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,314 @@ | ||
.. Licensed to the Apache Software Foundation (ASF) under one | ||
or more contributor license agreements. See the NOTICE file | ||
distributed with this work for additional information | ||
regarding copyright ownership. The ASF licenses this file | ||
to you under the Apache License, Version 2.0 (the | ||
"License"); you may not use this file except in compliance | ||
with the License. You may obtain a copy of the License at | ||
.. http://www.apache.org/licenses/LICENSE-2.0 | ||
.. Unless required by applicable law or agreed to in writing, | ||
software distributed under the License is distributed on an | ||
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY | ||
KIND, either express or implied. See the License for the | ||
specific language governing permissions and limitations | ||
under the License. | ||
Putting the VM in TVM: The Relay Virtual Machine | ||
================================================ | ||
|
||
Relay, a new program representation, has enabled the representation and optimization of | ||
a great breadth of machine learning programs. | ||
Unfortunately, by supporting a more expressive set of programs, we have | ||
introduced several new execution challenges. | ||
|
||
Relay's interpreter can execute the full language but has notable limitations | ||
that make it unsuited for production deployments. It is structured as an inefficient | ||
interpreter that performs AST traversal to execute the program. This approach is conceptually | ||
simple but inefficient, as the AST traversal heavily relies on indirection. | ||
|
||
There are further challenges in compiling dynamic code, such as dynamic scheduling and allocation, | ||
fully dynamic tensor shapes, and control flow. The interpreter offers simple solutions | ||
for these, but none is sufficiently compelling or optimized. | ||
|
||
The second execution mechanism is the existing graph runtime. In order to target Relay | ||
programs to this, we compile a small subset of them to the old graph format and execute | ||
them on the runtime. Graph runtime provides a fast execution experience but only for a very limited | ||
subset of Relay programs. | ||
|
||
An alternative but not-standard approach is Relay's ahead-of-time compiler, | ||
which compiles a Relay program into a shared library containing an ahead- | ||
of-time implementation. The ahead-of-time compiler provides compelling performance | ||
but is difficult to extend and instrument, which can only be done by modifying the | ||
code generation and optimization mechanisms. | ||
|
||
The Relay virtual machine is intended to be a framework that balances these competing | ||
approaches, providing a dynamic execution environment which can be extended, instrumented, | ||
and integrated with other approaches like ahead-of-time compilation via a flexible extension | ||
mechanism. | ||
|
||
The virtual machine is designed to strike a balance between performance and flexibility | ||
when deploying and executing Relay programs, without giving up the benefits of TVM. | ||
|
||
Virtual machine (VM) design is a well-studied area in programming languages and systems, | ||
and there have been various virtual machine designs for both full-fledged | ||
and embedded programing languages. | ||
Previous language VM designs have been heavily tailored to the execution profile of traditional programs. | ||
Traditional programs manipulate small scalar values and consist of a large number of low-level instructions. | ||
The sheer quantity of instructions requires instruction execution and dispatch to be extremely efficient. | ||
In the context of machine learning we manipulate primarily tensor values, using a (relatively) | ||
low number of high level instructions. ML programs' cost centers are expensive operator invocations, | ||
such as GEMM or convolution, over a large input. Due to the execution profile exhibited by ML programs, | ||
micro-optimizations present in scalar VMs are dramatically less important. | ||
|
||
TVM has provided strong support for vision models, | ||
but we want to grow to support a wider variety of models. | ||
The graph runtime is able to utilize the fully static nature of the input graphs to perform | ||
aggressive optimization such as fully static allocation, and optimal memory reuse. | ||
When we introduce models which make use of control flow, recursion, dynamic shapes, and dynamic | ||
allocation, we must change how execution works. A virtual machine for Relay is a natural choice. | ||
|
||
The rest of this document provides a high-level overview of the Relay | ||
virtual machine design and its instruction set. | ||
|
||
Design | ||
------ | ||
|
||
The VM's design is focused on simplicity without sacrificing performance. | ||
In order to accomplish this we have focused on designing a tensor VM rather than a scalar VM. | ||
|
||
In the tensor VM setting, we optimize for cheap “allocation” of objects (by trying to avoid real allocation), | ||
reuse of static fragments, and the ability to do dynamic shape (i.e jagged tensors). | ||
|
||
Instruction Set | ||
~~~~~~~~~~~~~~~ | ||
|
||
The choices of an instruction set and instruction representation are the most critical design decisions for a VM. | ||
The current representation of the instructions is a tagged union containing the op-code and the data payload. An important design decision is the level of abstraction of the instructions (RISC vs. CISC) and how they take their data (fixed-width instruction encoding vs. variable-length encoding). The current version is closer to CISC, with complex instructions like AllocTensor, and is variable-length due to the inclusion of the shape as part of the instruction. The current instruction set is very high-level and corresponds roughly to high-level operations in Relay. | ||
|
||
Ret | ||
^^^ | ||
**Arguments**: | ||
:: | ||
RegName dst | ||
RegName result | ||
|
||
Returns the object in register `result` to caller's register `dst`. | ||
|
||
InvokePacked | ||
^^^^^^^^^^^^ | ||
**Arguments**: | ||
:: | ||
size_t packed_index | ||
size_t arity | ||
size_t output_size | ||
RegName* packed_args | ||
|
||
Invoke the packed function denoted by `packed_index`. The `arity` | ||
and `output_size` are used to inform the VM how many inputs and | ||
outputs to expect. `packed_args` stores the list of argument registers. | ||
|
||
AllocTensor | ||
^^^^^^^^^^^ | ||
**Arguments**: | ||
:: | ||
RegName dst | ||
RegName shape_register | ||
size_t ndim | ||
DLDataType dtype | ||
|
||
Allocate a tensor value of the appropriate shape (stored in `shape_register`) and `dtype`. The result | ||
is saved to register `dst`. | ||
|
||
AllocDatatype | ||
^^^^^^^^^^^^^ | ||
**Arguments**: | ||
:: | ||
RegName dst | ||
size_t tag | ||
size_t num_fields | ||
RegName* datatype_fields | ||
|
||
Allocate a data type with the tag `tag` using the `num_fields` entries | ||
from registers `datatype_fields`. The result is saved to register `dst`. | ||
|
||
AllocClosure | ||
^^^^^^^^^^^^ | ||
**Arguments**: | ||
:: | ||
RegName dst | ||
size_t clo_index | ||
size_t num_freevar | ||
RegName* free_vars; | ||
|
||
Allocate a closure with the VMFunction at `clo_index` as | ||
its code, and the `num_freevar` entries from registers in | ||
`free_vars`. The result is saved to register `dst`. | ||
|
||
GetField | ||
^^^^^^^^ | ||
**Arguments**: | ||
:: | ||
RegName dst | ||
RegName object | ||
size_t field_index | ||
|
||
Get the field value with index `field_index` from `object`. And saves the result to register `dst`. | ||
|
||
If | ||
^^ | ||
**Arguments**: | ||
:: | ||
RegName if_cond | ||
size_t true_offset | ||
size_t false_offset | ||
|
||
Check if the object at register `if_cond` is `true` or `false`. | ||
If `true`, relative jump by `true_offset`, else relative | ||
jump by `false_offset`. | ||
|
||
Goto | ||
^^^^ | ||
**Arguments**: | ||
:: | ||
size_t pc_offset | ||
|
||
Relative unconditional jump by `pc_offset`. | ||
|
||
Invoke | ||
^^^^^^ | ||
**Arguments**: | ||
:: | ||
size_t func_index | ||
|
||
Invoke function at `func_index`, consumes the number of arguments contained in the VMFunction's | ||
arity field. | ||
|
||
InvokeClosure | ||
^^^^^^^^^^^^^ | ||
**Arguments**: | ||
:: | ||
RegName closure | ||
size_t closure_args_num | ||
RegName* closure_args | ||
|
||
Invokes `closure`, consuming the number of arguments declared in the closure's VMFunction. | ||
|
||
LoadConst | ||
^^^^^^^^^ | ||
**Arguments**: | ||
:: | ||
RegName dst | ||
size_t const_index | ||
|
||
Load the constant at `const_index` from the constant pool. The result is saved to register `dst`. | ||
|
||
Object Representation | ||
~~~~~~~~~~~~~~~~~~~~~ | ||
We use a simple object representation that uses shared pointers and tagging. | ||
There is a huge space of possible object representations trade-offs, but we | ||
believe micro-optimizing this code has little to no effect on the end-to-end performance. | ||
|
||
:: | ||
|
||
struct ObjectCell { | ||
ObjectTag tag; | ||
... | ||
}; | ||
|
||
struct Object { | ||
std::shared_ptr<ObjectCell> ptr; | ||
... | ||
} | ||
|
||
See `include/tvm/runtime/vm.h` for more details. | ||
|
||
Currently, we support 3 types of objects: tensors, data types, and closures. | ||
|
||
:: | ||
|
||
VMObject VMTensor(const tvm::runtime::NDArray& data); | ||
VMObject VMDatatype(size_t tag, const std::vector<VMObject>& fields); | ||
VMObject VMClosure(size_t func_index, std::vector<VMObject> free_vars); | ||
|
||
|
||
Stack and State | ||
~~~~~~~~~~~~~~~ | ||
|
||
The Relay VM maintains a stack frame, which contains information about how to resume the | ||
previous call. Registers are allocated in a continuous space (virtual register file) for each function. | ||
|
||
We keep track of a set of Relay functions we have called, a pointer into its bytecode, an offset into the byte code (known as the program counter). | ||
|
||
:: | ||
|
||
struct VirtualMachine { | ||
... | ||
std::vector<VMFrame> frames; | ||
... | ||
// Current function. | ||
size_t func_index; | ||
// Pointer into the current function's instructions. | ||
const Instruction* code; | ||
// Current program counter relative to the code pointer. | ||
size_t pc; | ||
... | ||
}; | ||
|
||
|
||
Dispatch Loop | ||
~~~~~~~~~~~~~ | ||
A critical piece of a VM is the dispatch loop. The dispatch loop usually dominates the execution time of a | ||
virtual machine, but we have experimentally found this not to be the case for Relay. We have just implemented | ||
a simple `switch`/`goto` dispatch loop which dispatches based on instruction op code. | ||
|
||
This loop is implemented by `VirtualMachine::Run()`. | ||
|
||
VM Compiler | ||
~~~~~~~~~~~ | ||
|
||
An important part of this infrastructure is a compiler from Relay's full IR into a sequence of bytecode. | ||
The VM compiler transforms a `tvm::relay::Module` into a `tvm::relay::vm::VirtualMachine`. The virtual | ||
machine contains a set of compiled functions, the compiled functions are contained in `tvm::relay::vm::Function`. The functions contain metadata about the the function as well as its compiled bytecode. For full definitions of the data structures see `vm.h`. | ||
|
||
Optimizations | ||
~~~~~~~~~~~~~ | ||
|
||
There are quite a few optimizations required by the VM compiler. | ||
|
||
We have implemented them in the old pass style, but plan to port them to | ||
the new pass manager (#2546) before merging. | ||
|
||
Optimizations marked with `TODO` are not implemented yet. | ||
|
||
- A-Normal Form | ||
- Lambda Lift (see `src/relay/vm/lambda_lift.cc`) | ||
- Inline Primitives (see `src/relay/vm/inline_primitives.cc`) | ||
- Inliner (see `src/relay/pass/inliner.cc`) | ||
- Constant Pool Layout (see `src/relay/backend/vm/compiler.cc`) | ||
- ADT Tag Allocation (see `src/relay/backend/vm/compiler.cc`) | ||
- Tail Call Optimization (TODO) | ||
- Liveness Analysis (TODO) | ||
|
||
Serialization | ||
~~~~~~~~~~~~~ | ||
|
||
A final and yet-to-be-implemented part of the VM design is serialization. The accompanying PR will introduce both the bytecode and its serialization, as well as VM-level serialization. The design premise is that a VM can be efficiently stored to disk and resumed at a later time. This would also allow us to efficiently schedule many models on to a single machine in order to obtain good utilization. | ||
|
||
Unresolved Questions | ||
~~~~~~~~~~~~~~~~~~~~ | ||
|
||
How do we handle dynamic shapes? | ||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
TODO | ||
|
||
How can we modify the VM to support JIT compilation of certain code paths? | ||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
In the code generation space there are still many tradeoffs to be analyzed and the VM is designed | ||
to be very flexible so we can modify it for future experiments. | ||
|
||
How do we support heterogenous execution? | ||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
Heterogenous execution should work out of the box assuming we have annotated the appropriate device copies. | ||
In order to do this properly we need to run the device annotation and copying passes. |