diff --git a/docs/dev/relay_bring_your_own_codegen.rst b/docs/dev/relay_bring_your_own_codegen.rst index fa696446294c7..91cf60839f587 100644 --- a/docs/dev/relay_bring_your_own_codegen.rst +++ b/docs/dev/relay_bring_your_own_codegen.rst @@ -18,7 +18,6 @@ ============================= Bring Your Own Codegen To TVM ============================= -**Author**: `Zhi Chen `_, `Cody Hao Yu `_ As the number of hardware devices targeted by deep learning workloads keeps increasing, the required knowledge for users to achieve high performance on various devices keeps increasing as well. To free data scientists from worrying about the performance when developing a new model, hardware backend providers either provide libraries such as MKLDNN or cuDNN with many commonly used deep learning operators, or provide frameworks such as TensorRT to let users describe their models in a certain way to achieve high performance. However, users have to learn a new programming interface when they attempt to work on a new library or device. As a result, the demand for a unified programming interface becomes more and more important to 1) let all users and hardware backend providers stand on the same page, and 2) provide a feasible solution to allow specialized hardware or library to only support widely used operators with extremely high performance, but fallback unsupported operators to general devices like CPU/GPU. @@ -136,7 +135,7 @@ Here we highlight the notes marked in the above code: * **Note 3** is a TVM runtime compatible wrapper function. It accepts a list of input tensors and one output tensor (the last argument), casts them to the right data type, and invokes the subgraph function described in Note 2. In addition, ``TVM_DLL_EXPORT_TYPED_FUNC`` is a TVM macro that generates another function ``gcc_0`` with unified the function arguments by packing all tensors to ``TVMArgs``. As a result, the TVM runtime can directly invoke ``gcc_0`` to execute the subgraph without additional efforts. With the above code generated, TVM is able to compile it along with the rest parts of the graph and export a single library for deployment. -In the rest of this section, we will implement a codegen step-by-step to generate the above code. Your own codegen has to be located at ``src/relay/backend/contrib//``. In our example, we name our codegen "codegen_c" and put it under ``src/relay/backend/contrib/codegen_c/codegen.cc``. Feel free to check this file for a complete implementation. +In the rest of this section, we will implement a codegen step-by-step to generate the above code. Your own codegen has to be located at ``src/relay/backend/contrib//``. In our example, we name our codegen "codegen_c" and put it under `here`_. Feel free to check this file for a complete implementation. Specifically, we are going to implement two classes in this file and here is their relationship: @@ -149,7 +148,7 @@ Specifically, we are going to implement two classes in this file and here is the ---------------------------------------- ------------------------ generated C source runtime module generated C code -When TVM backend finds a function (subgraph) in a Relay graph is annotated with the registered compiler tag (``ccompiler`` in this example), TVM backend invokes ``CSourceCodegen`` and passes the subgraph. ``CSourceCodegen``'s member function ``CreateCSourceModule`` will 1) generate C code for the subgraph, and 2) wrap the generated C code to a C source runtime module for TVM backend to compile and deploy. In particular, the C code generation is transparent to ``CodegenC`` class because it provides many useful utilities to ease the code generation implementation. The following sections will implement these two classes in bottom-up order. +When TVM backend finds a function (subgraph) in a Relay graph is annotated with the registered compiler tag (``ccompiler`` in this example), TVM backend invokes ``CSourceCodegen`` and passes the subgraph. ``CSourceCodegen``'s member function ``CreateCSourceModule`` will 1) generate C code for the subgraph, and 2) wrap the generated C code to a C source runtime module for TVM backend to compile and deploy. In particular, the C code generation is transparent to the ``CodegenC`` class because it provides many useful utilities to ease the code generation implementation. The following sections will implement these two classes in the bottom-up order. Implement CodegenC ================== @@ -362,7 +361,7 @@ The final part in this codegen class is a ``JIT`` function that emits a C functi The above call will generate three functions (one from the TVM wrapper macro): -1. The subgraph function ``gcc_0_`` (with one more underline at the end of the function name) with all C code we generated to execute a subgaph. +1. The subgraph function ``gcc_0_`` (with one more underline at the end of the function name) with all C code we generated to execute a subgraph. 2. The wrapper function ``gcc_0__wrapper_`` with a list of ``DLTensor`` arguments that casts data to the right type and invokes ``gcc_0_``. @@ -385,7 +384,7 @@ All variables (``ext_func_id``, etc) we passed are class variables and were fill Implement CSourceCodegen ======================== -Again, let's create a class skeleton and implement required functions. Note that it inherits ``CSourceModuleCodegenBase`` +Again, let's create a class skeleton and implement the required functions. Note that it inherits ``CSourceModuleCodegenBase`` .. code-block:: c++ @@ -493,11 +492,11 @@ Finally, we register this function to TVM backend: .. code-block:: c++ - TVM_REGISTER_API("relay.ext.ccompiler").set_body_typed(CCompiler); + TVM_REGISTER_GLOBAL("relay.ext.ccompiler").set_body_typed(CCompiler); where ``ccompiler`` is a customized tag to let TVM know this is the codegen it should use to generate and offload subgraphs when the subgraph is annotated with ``ccompiler``. -Finally, a good practice is to set up a CMake configuration flag to include your compiler only for your customers. We first create a cmake file: `cmake/modules/contrib/CODEGENC.cmake`: +Finally, a good practice is to set up a CMake configuration flag to include your compiler only for your customers. We first create a cmake file: ``cmake/modules/contrib/CODEGENC.cmake``: .. code-block:: cmake @@ -506,7 +505,7 @@ Finally, a good practice is to set up a CMake configuration flag to include your list(APPEND COMPILER_SRCS ${CSOURCE_RELAY_CONTRIB_SRC}) endif(USE_CODEGENC) -So that users can configure whether to include your compiler when configuring TVM using `config.cmake`: +So that users can configure whether to include your compiler when configuring TVM using ``config.cmake``: .. code-block:: cmake @@ -516,10 +515,444 @@ So that users can configure whether to include your compiler when configuring TV Implement a Codegen for Your Representation ******************************************* -Although we have demonstrated how to implement a C codegen, your hardware may require other forms of graph representation, such as JSON. In this case, you can slightly modify ``CodegenC`` class we have implemented to generate your own graph representation, and implement a customized runtime module to let TVM runtime know how this graph representation should be executed. **(TBA)** +Although we have demonstrated how to implement a C codegen, your hardware may require other forms of graph representation, such as JSON. In this case, you could modify ``CodegenC`` class we have implemented to generate your own graph representation and implement a customized runtime module to let TVM runtime know how this graph representation should be executed. -Implement CodegenJSON -===================== +To simplify, we define a graph representation named "ExampleJSON" in this guide. ExampleJSON does not mean the real JSON but just a simple representation for graphs without a control flow. For example, assuming we have the following subgraph named ``subgraph_0``: + +:: + + input0 + | + add <-- input1 + | + subtract <-- input2 + | + multiply <-- input3 + | + out + +Then the ExampleJON of this subgraph looks like: + +.. code-block:: json + + subgraph_0 + input 0 10 10 + input 1 10 10 + input 2 10 10 + input 3 10 10 + add 4 inputs: 0 1 shape: 10 10 + sub 5 inputs: 4 2 shape: 10 10 + add 6 inputs: 5 3 shape: 10 10 + +The ``input`` keyword declares an input tensor with its ID and shape; while the other statements describes computations in `` inputs: [input ID] shape: [shape]`` syntax. + +In this section, our goal is to implement the following customized TVM runtime module to execute ExampleJSON graphs. + +.. code-block:: c++ + + runtime::Module ExampleJsonCompiler(const NodeRef& ref) { + ExampleJsonCodeGen codegen(ref); + std::string code = codegen.gen(); // Note 1 + const auto* pf = runtime::Registry::Get("module.examplejson_module_create"); // Note 2 + CHECK(pf != nullptr) << "Cannot find ExampleJson module to create the external runtime module"; + return (*pf)(code); + } + TVM_REGISTER_GLOBAL("relay.ext.examplejsoncompiler").set_body_typed(ExampleJsonCompiler); + +**Note 1**: We will implement a customized codegen later to generate a ExampleJSON code string by taking a subgraph. + +**Note 2**: This line obtains a pointer to a function for creating the customized runtime module. You can see that it takes subgraph code in ExampleJSON format we just generated and initializes a runtime module. + +In the following sections, we are going to introduce 1) how to implement ``ExampleJsonCodeGen`` and 2) how to implement and register ``examplejson_module_create``. + +Implement ExampleJsonCodeGen +============================ + +Similar to the C codegen, we also derive ``ExampleJsonCodeGen`` from ``ExprVisitor`` to make use of visitor patterns for subgraph traversing. On the other hand, we do not have to inherit ``CodegenCBase`` because we do not need TVM C++ wrappers. The codegen class is implemented as follows: + +.. code-block:: c++ + + #include + #include + #include + #include + #include + + #include + #include + + namespace tvm { + namespace relay { + namespace contrib { + + class ExampleJsonCodeGen : public ExprVisitor { + public: + explicit ExampleJsonCodeGen(); + + // Note 1 + void VisitExpr_(const VarNode* node) { /* Skip in this example. */ } + void VisitExpr_(const CallNode* call) final { /* Skip in this example. */ } + + // Note 2 + std::string gen(NodeRef& ref) { + this->code = ""; + if (ref->IsInstance()) { + this->visit(Downcast(ref)); + } else if (ref->IsInstance()) { + relay::Module mod = Downcast(ref); + for (const auto& it : mod->functions) { + this->visit(Downcast(it.second)); + } + } else { + LOG(FATAL) << "The input ref is expected to be a Relay function or module"; + } + return this->code; + } + + private: + /*! \brief The function id that represents a C source function. */ + std::string code; + } + +**Note 1**: We again implement corresponding visitor functions to generate ExampleJSON code and store it to a class variable ``code`` (we skip the visitor function implementation in this example as their concepts are basically the same as C codegen). After finished the graph visiting, we should have an ExampleJSON graph in ``code``. + +**Note 2**: We define an internal API ``gen`` to take a subgraph and generate a ExampleJSON code. This API can be in an arbitrary name you prefer. + +The next step is to implement a customized runtime to make use of the output of ``ExampleJsonCodeGen``. + +Implement a Customized Runtime +============================== + +In this section, we will implement a customized TVM runtime step-by-step and register it to TVM runtime modules. The customized runtime should be located at ``src/runtime/contrib//``. In our example, we name our runtime "example_ext_runtime" and put it under `here`_. Feel free to check this file for a complete implementation. + +Again, we first define a customized runtime class as follows. The class has to be derived from TVM ``ModuleNode`` in order to be compatible with other TVM runtime modules. + +.. code-block:: c++ + + #include + #include + #include + #include + #include + #include + #include + #include + + #include + #include + #include + #include + #include + #include + + namespace tvm { + namespace runtime { + class ExampleJsonModule : public ModuleNode { + public: + explicit ExampleJsonModule(std::string graph_json); + + PackedFunc GetFunction(const std::string& name, + const ObjectPtr& sptr_to_self) final; + + const char* type_key() const { return "examplejson"; } + + void SaveToBinary(dmlc::Stream* stream) final; + + static Module LoadFromBinary(void* strm); + + static Module Create(const std::string& path); + + std::string GetSource(const std::string& format = ""); + + void Run(int id, const std::vector& inputs, int output); + + void ParseJson(const std::string& json); + + private: + /* \brief The json string that represents a computational graph. */ + std::string graph_json_; + /* \brief The subgraph that being processed. */ + std::string curr_subgraph_; + /*! \brief A simple graph from subgraph id to node entries. */ + std::map > graph_; + /* \brief A simple pool to contain the tensor for each node in the graph. */ + std::vector data_entry_; + /* \brief A mapping from node id to op name. */ + std::vector op_id_; + }; + +In particular, there are some functions derived from ``ModuleNode`` that we must implement in ``ExampleJsonModule``: + +* Constructor: The constructor of this class should accept a subgraph (in your representation), process and store it in any format you like. The saved subgraph could be used by the following two functions. + +* ``GetFunction``: This is the most important function in this class. When TVM runtime wants to execute a subgraph with your compiler tag, TVM runtime invokes this function from your customized runtime module. It provides the function name as well as runtime arguments, and ``GetFunction`` should return a packed function implementation for TVM runtime to execute. + +* ``SaveToBinary`` and ``LoadFromBinary``: ``SaveToBinary`` serialize the runtime module to a binary format for later deployment. This function will be called by TVM when users use ``export_library`` API. On the other hand, since we are now using our own graph representation, we have to make sure that ``LoadFromBinary`` is able to construct the same runtime module by taking the serialized binary generated by ``SaveToBinary``. + +* ``GetSource`` (optional): If you would like to see the generated ExampleJSON code, you can implement this function to dump it; otherwise you can skip the implementation. + +Other functions and class variables will be introduced along with the implementation of above must-have functions. + +Implement Constructor +--------------------- + +.. code-block:: c++ + + explicit ExampleJsonModule(std::string graph_json) { + this->graph_json_ = graph_json; + ParseJson(this->graph_json_); + } + +Then, we implement ``ParseJson`` to parse a subgraph in ExampleJSON format and construct a graph in memory for later usage. Since we do not support subgraph with branches in this example, we simply use an array to store every nodes in a subgraph in order. + +.. code-block:: c++ + + void ParseJson(const std::string& json) { + std::string line; + std::string curr_subgraph; + std::stringstream ss(json); + + while (std::getline(ss, line, '\n')) { + std::stringstream ss2(line); + std::string token; + int id = 0; + + ss2 >> token; + if (token.find("subgraph_") != std::string::npos) { + curr_subgraph = token; + continue; + } + + ss2 >> id; + if (op_id_.size() <= static_cast(id)) { + op_id_.resize(id + 1); + data_entry_.resize(id + 1); + } + + int64_t total_elements = 1; + std::vector shape; + if (token == "input") { + int64_t size = 0; + while (ss2 >> size) { + total_elements *= size; + shape.push_back(size); + } + } else { + op_id_[id] = token; // Note 1 + bool shape_data = false; + NodeEntry entry; + while (ss2 >> token) { + if (token == "shape:") { + shape_data = true; + } else if (shape_data) { + total_elements *= std::stoll(token); + shape.push_back(std::stoll(token)); + } else if (token != "inputs:") { + entry.inputs.push_back(std::stoi(token)); + } + } + entry.id = id; + entry.output = id; + graph_[curr_subgraph].push_back(entry); // Note 2 + } + DLContext ctx; + ctx.device_type = static_cast(1); + ctx.device_id = 0; + data_entry_[id] = NDArray::Empty(shape, DLDataType{kDLFloat, 32, 1}, ctx); // Note 3 + } + } + +**Note 1**: We use a class variable ``op_id_`` to map from subgraph node ID to the operator name (e.g., ``add``) so that we can invoke the corresponding operator function in runtime. + +**Note 2**: We use a class variable ``graph_`` to map from subgraph name to an array of nodes. ``GetFunction`` will query graph nodes by a subgraph ID in runtime. + +**Note 3**: We use a class variable `data_entry_` to map from a subgraph node ID to a tensor data placeholder. We will put inputs and outputs to the corresponding data entry in runtime. + +Implement GetFunction +--------------------- + +After the construction, we should have the above class variables ready. We then implement ``GetFunction`` to provide executable subgraph functions to TVM runtime: + +.. code-block:: c++ + + PackedFunc GetFunction(const std::string& name, + const ObjectPtr& sptr_to_self) final { + if (this->graph_.find(name) != this->graph_.end()) { + this->curr_subgraph_ = name; + return PackedFunc([sptr_to_self, this](TVMArgs args, TVMRetValue* rv) { + + // Copy input tensors to corresponding data entries. + for (auto i = 0; i < args.size(); ++i) { + CHECK(args[i].type_code() == kNDArrayContainer || args[i].type_code() == kArrayHandle) + << "Expect NDArray or DLTensor as inputs\n"; + if (args[i].type_code() == kArrayHandle) { + DLTensor* arg = args[i]; + this->data_entry_[i].CopyFrom(arg); + } else { + NDArray arg = args[i]; + this->data_entry_[i].CopyFrom(arg); + } + } + + // Execute the subgraph. + for (const auto& it : this->graph_[this->curr_subgraph_]) { + this->Run(it.id, it.inputs, it.output); + } + CHECK_GT(graph_.count(this->curr_subgraph_), 0U); + + // Copy the output from a data entry back to TVM runtime argument. + auto out_idx = graph_[this->curr_subgraph_].back().output; + if (args[args.size() - 1].type_code() == kArrayHandle) { + DLTensor* arg = args[args.size() - 1]; + this->data_entry_[out_idx].CopyTo(arg); + } else { + NDArray arg = args[args.size() - 1]; + this->data_entry_[out_idx].CopyTo(arg); + } + *rv = data_entry_.back(); + }); + } else { + LOG(FATAL) << "Unknown subgraph: " << name << "\n"; + return PackedFunc(); + } + } + +As can be seen, ``GetFunction`` is composed of three major parts. The first part copies data from TVM runtime arguments to the corresponding data entries we assigned in the constructor. The second part executes the subgraph with ``Run`` function (will implement later) and saves the results to another data entry. The third part copies the results from the output data entry back to the corresponding TVM runtime argument for output. + +Implement Run +------------- + +Now let's implement ``Run`` function. This function accepts 1) a subgraph ID, 2) a list of input data entry indexs, and 3) an output data entry index. + +.. code-block:: c++ + + void Run(int id, const std::vector& inputs, int output) { + // Make a list data entry indexs. + std::vector args(inputs.begin(), inputs.end()); + args.push_back(output); + + // Initialize data holders. + std::vector values(args.size()); + std::vector type_codes(args.size()); + + // Initialize a TVM arg setter with TVMValue and its type code. + TVMArgsSetter setter(values.data(), type_codes.data()); + + // Set each argument to its corresponding data entry. + if (op_id_[id] == "add" || op_id_[id] == "sub" || op_id_[id] == "mul") { + for (size_t i = 0; i < args.size(); i++) { + setter(i, data_entry_[args[i]]); + } + } + + // Invoke the corresponding operator function. + if (op_id_[id] == "add") { + Add(values.data(), type_codes.data(), args.size()); + } else if (op_id_[id] == "sub") { + Sub(values.data(), type_codes.data(), args.size()); + } else if (op_id_[id] == "mul") { + Mul(values.data(), type_codes.data(), args.size()); + } else { + LOG(FATAL) << "Unknown op: " << op_id_[id] << "\n"; + } + } + +``Run`` function mainly has two parts. The first part allocates a list of ``TVMValue``, and maps corresponding data entry blocks. This will become the arguments of our operator functions. The second part than invokes our operator functions. Although we use the same C functions as the previous example, you can replace ``Add``, ``Sub``, and ``Mul`` with your own engine. You only need to make sure your engine stores the results to the last argument so that they can be transferred back to TVM runtime. + +With above functions implemented, our customized codegen and runtime can now execute subgraphs. The last step is registering an API (``examplejson_module_create``) to create this module: + +.. code-block:: c++ + + TVM_REGISTER_GLOBAL("module.examplejson_module_create") + .set_body_typed([](std::string code){ + auto n = make_object(code); + return runtime::Module(n); + }); + +Implement SaveToBinary and LoadFromBinary +----------------------------------------- + +So far we have implemented the main features of a customized runtime so that it can be used as other TVM runtimes. However, when users want to save the built runtime to a disk for deployment, TVM has no idea about how to save it. This is the reason we want to implement ``SaveToBinary`` and ``LoadFromBinary``, which tell TVM how should this customized runtime be persist and restored. + +We first implement ``SaveToBinary`` function to allow users to save this module in disk. + +.. code-block:: c++ + + void SaveToBinary(dmlc::Stream* stream) final { + stream->Write(this->graph_json_); + } + +We can find that this function is pretty simple. Recall that the only argument we took in constructor is a subgraph representation, meaning that we only need a subgraph representation to construct/recover this customized runtime module. As a result, ``SaveToBinary`` simply writes the subgraph to an output DMLC stream. That is, when users use ``export_library`` API to export the module, the customized module will be an ExampleJSON stream of a subgraph. + +Similarity, ``LoadFromBinary`` reads the subgraph stream and re-constructs the customized runtime module: + +.. code-block:: c++ + + static Module LoadFromBinary(void* strm) { + dmlc::Stream* stream = static_cast(strm); + std::string graph_json; + stream->Read(&graph_json); + auto n = tvm::runtime::make_object(graph_json); + return Module(n); + } + +We also need to register this function to enable the corresponding Python API: + +.. code-block:: c++ + + TVM_REGISTER_GLOBAL("module.loadbinary_examplejson") + .set_body_typed(ExampleJsonModule::LoadFromBinary); + +The above registration means when users call ``tvm.module.load(lib_path)`` API and the exported library has an ExampleJSON stream, our ``LoadFromBinary`` will be invoked to create the same customized runtime module. + +In addition, if you want to support module creation directly from an ExampleJSON file, you can also implement a simple function and register a Python API as follows: + +.. code-block:: c++ + + static Module Create(const std::string& path) { + std::ifstream filep; + filep.open(path, std::ios::in); + std::string graph_json; + std::string line; + while (std::getline(filep, line)) { + graph_json += line; + graph_json += "\n"; + } + filep.close(); + auto n = tvm::runtime::make_object(graph_json); + return Module(n); + } + + TVM_REGISTER_GLOBAL("module.loadfile_examplejson") + .set_body([](TVMArgs args, TVMRetValue* rv) { + *rv = ExampleJsonModule::Create(args[0]); + }); + +It means users can manually write/modify an ExampleJSON file, and use Python API ``tvm.module.load("mysubgraph.examplejson", "examplejson")`` to construct a customized module. + +******* +Summary +******* + +In summary, here is a checklist for you to refer: + +* A codegen class derived from ``ExprVisitor`` and ``CodegenCBase`` (only for C codegen) with following functions. + + * ``VisitExpr_(const CallNode* call)`` to collect call node information. + * Other visitor functions you needed to collect subgraph information. + * ``JIT`` to generate subgraph code. + * Register codegen. + +* A function to create ``CSourceModule`` (for C codegen). + +* A runtime module class derived from ``ModuleNode`` with following functions (for your graph representation). + + * Constructor. + * ``GetFunction`` to generate a TVM runtime compatible ``PackedFunc``. + * ``Run`` to execute a subgraph. + * Register a runtime creation API. + * ``SaveToBinary`` and ``LoadFromBinary`` to serialize/deserialize customized runtime module. + * Register ``LoadFromBinary`` API to support ``tvm.module.load(your_module_lib_path)``. + * (optional) ``Create`` to support customized runtime module construction from subgraph file in your representation. -Implement Customized Runtime -============================ \ No newline at end of file +* An annotator to annotate a user Relay program to make use of your compiler and runtime (TBA).