[Relay] Merge analysis/context_analysis.cc and transforms/device_annotation.cc #9038

mbs-octoml · 2021-09-17T20:31:05Z

Currently LowerTEPass (backend/te_compiler.cc) is a 'special' pass because it
depends on a side-input DeviceMap. We'd like to remove that side-input, and
instead recover the Device (and, ultimately, Target) for each (fused) primitive
call from the AST alone.

By doing so we also avoid needing to perform device planning twice:

It needs to be done before lowering so we know which primitives need
to be compiled for which devices.
It then needs to be re-done after lowering and optimization as a prelude
to memory planning.
By baking the device plan into the AST we can simply do device planning before
lowering, and run memory planning later, both as ordinary passes.

While working on that issue we realized we currently have 3 'device planners':

transforms/device_annotation.cc, which supports only a small subset of Relay
and uses a simple top-down algorithm to assign a device to every
sub-expression.
analysis/context_analysis.cc, which makes a galant effort to support most of
Relay, is based on unification rather than a top-down algorithm, but handles
higher order functions by ad-hoc and fragile inlining.
transforms/annotate_target.cc, which works on Targets instead of Devices, but
is otherwise like 'device planning'.
We'd like to bring these together.

In this PR we introduce a new transforms/device_planner.cc intended to replace
transforms/device_annotation.cc and analysis/context_analysis.cc. We don't
delete those two just yet since we need to switch all users off of them in the
next PR. We also leave transforms/annotate_target.cc alone pending a proper RFC
to bring devices and targets together sensibly, but have it firmly in our
sights.

transforms/device_planner.cc is based on analysis/context_analysis.cc, but
is heavily reworked to:

Handle starting from existing "on_device" annotations as well as existing
"device_copy" calls.
Be idempotent, with the idea we'll probably need to re-run it to 'refine'
device planning to account for storge scopes.
Robustly handle all of Relay, particularly higher-order functions. For that
we replace the inlining approach in analysis/context_analysis.cc with a
higher-order unification domain.
Be a little more systematic with defaulting.
Capture the result of the analysis within the AST as new "device_copy" calls
at device boundaries, and new/replaced "on_device" calls wherever the device
for a sub-expression is not already 'obvious' from the sub-expression's
lexical scope.
Provide helper visitors for passes which need to ask for the device for
any sub-expression they are processing and/or preserve device information
on rewrites. Those passes include:
- backend/aot_executor_codegen.cc (AOTOnDemandAllocator)
- backend/graph_plan_memory.cc (StorageAllocaBaseVisitor etc)
- backend/te_compiler.cc (LowerTensorExprMutator)
- backend/vm/lambda_lift.cc (LambdaLifter)
- transforms/memory_alloc.cc (DialectRewriter)
- transforms/to_a_normal_form.cc (Fill)
- backend/vm/compiler.cc (VMFunctionCompiler)
  However we won't change any of those in this PR.

See the draft #8788 for the end game, I'll be peeling PRs out of that.

mbs-octoml · 2021-09-17T20:46:10Z

@electriclilies @mikepapadim @jroesch Here's part 1. Things to watch out for:

How to I may device_copy.cc branch from context_analysis.cc?
Did I go too far with device defaulting?
All the visitor helpers are needed in follow up but not used here, so sorry it's not obvious why I set things up like that. But suggestion for reducing code dup would be welcome.
I tried to make the intro comment comprehensive, let me know what needs more detail.
Anything non-obvious.
Thanks!

mbs-octoml · 2021-09-17T20:47:37Z

Oh, and this is on top of #9012 which is still in-flight, sorry about that.

jroesch · 2021-09-22T07:08:44Z

tests/python/relay/test_pass_plan_devices.py

@@ -0,0 +1,1405 @@
+# Licensed to the Apache Software Foundation (ASF) under one


Now this is a test file!

Not to rain on your parade but it might be worth directly using text format here?

Yeah it's pretty awful and I nearly switched a few times already. Which format should I use? Can it capture the attributes?

+1 for relay text format, i think that would greatly help

jroesch · 2021-09-22T07:12:41Z

src/relay/transforms/device_planner.cc

+ */
+
+/*!
+ * \file src/relay/analysis/device_planner.cc


I read everything except main file, will tackle this part tomorrow.

Thanks. With apologies, but I ended up hoisting out everything except the main into #9077, so I think you could directly LGTM that now?

I'm trying to model good PR hygiene :-)

areusch

@mbs-octoml did an initial pass, don't think i understand everything yet, but i can see the pieces here. thanks for doing this!!

areusch · 2021-09-24T17:49:24Z

include/tvm/relay/attrs/annotation.h

  int device_type;
+  /*!
+   * \brief If true, the result device must also be \p device_type and device planning should


not sure i quite understand--isn't this what the struct-level brief above says it does? what happens if is_fixed == false? if we are going to insert a device copy node, that implies the data must be needed elsewhere. how can we merely attach an attribute to a result saying a copy is simply out of the question?

This is because the association of expression to device is a 'constraint', and by default "on_device" on constrains the inner expr and not the overall expr. However to make plan_devices idempotent with the current approach I need some way to fully specify devices.

areusch · 2021-09-24T17:51:59Z

python/tvm/relay/op/annotation/annotation.py

+
+
+# for testing only
+def function_on_device(function, param_devices, result_device):


should we call this testonly_function_on_device then or place in python/tvm/relay/op/annotation/test_utils.py?

I ended up fully commenting this one in a prior PR.

areusch · 2021-09-24T17:52:51Z

src/relay/op/annotation/annotation.cc

+  return Call(OnDeviceOp(), {std::move(expr)}, Attrs(std::move(attrs)), /*type_args=*/{}, span);
+}
+
+Expr OptOnDevice(Expr expr, DLDeviceType device_type, bool is_fixed) {


opt=Optional?

Ah, old personal naming convention but it clashes with Optimize. Switching to Maybe (in the supporting PR).

areusch · 2021-09-24T17:53:40Z

src/relay/op/annotation/annotation.cc

    .set_attr<FTVMCompute>("FTVMCompute",
                           [](const Attrs& attrs, const Array<te::Tensor>& inputs,
                              const Type& out_type) -> Array<te::Tensor> {
                             return {topi::identity(inputs[0])};
                           });

+OnDeviceProps GetOnDeviceProps(const CallNode* call_node) {
+  if (call_node->op == OnDeviceOp()) {
+    ICHECK_EQ(call_node->args.size(), 1) << "on_device expects one argument";


want to print the args or op node in this case to help debug? same question below

I've gotten all the way through to unit testing the final PR without needing it, so how about I leave it. I tend to add them as I need them.

areusch · 2021-09-24T17:55:40Z

src/relay/op/annotation/annotation.h

+ */
+OnDeviceProps GetOnDeviceProps(const Expr& expr);
+
+/*! \brief Returns true if \p expr is an on_device CallNode. */


this will CHECK-fail if there is an ill-formed on_device CallNode tho, not sure if that should get documented?

Turns out this is dead code now so nuked.

areusch · 2021-09-24T18:12:22Z

src/relay/transforms/device_planner.cc

+/*!
+ * \brief Rewrites "on_device" calls to handle some special cases.
+ */
+class RewriteOnDevices : public ExprMutator {


dang this file is getting a bit long..

Yeah might be worth splitting out into like device_domain.{h, cc} and then putting the core pass parts in their own file.

Good call, will do.

areusch · 2021-09-24T18:13:11Z

src/relay/transforms/device_planner.cc

+      domains_->Unify(func_domain, implied_domain);  // higher-order
+    } catch (const Error& e) {
+      // TODO(mbs): Proper diagnostics.
+      LOG(FATAL) << "Function parameters and result devices do not match those of call. Call:"


!! fatal! probably we need to address the TODO before merging...i think?

I think we just need to handle these when we replace the existing passes with this one, overall would be fine to build action item list from this, land a version and then let mark work on follow up to remove passes.

Yeah this is a big improvement on the context_analysis.cc (which will CHECK fail with no context). Let's do this as CORE-75.

areusch · 2021-09-24T18:15:46Z

src/relay/transforms/device_planner.cc

+    return FunctionOnDevice(func, param_device_types, result_device_type);
+  }
+
+  Expr VisitExpr_(const CallNode* call_node) final {


i roughly understand what's happening, but i think example Relay snippets would help to better understand/maintain each Pass here.

I added some simple examples to each of the internal pass class comments.

areusch · 2021-09-24T18:18:17Z

tests/python/relay/test_pass_plan_devices.py

@@ -0,0 +1,1405 @@
+# Licensed to the Apache Software Foundation (ASF) under one


+1 for relay text format, i think that would greatly help

areusch · 2021-09-24T18:19:20Z

tests/python/relay/test_pass_plan_devices.py

+
+
+if __name__ == "__main__":
+    test_plain()


sys.exit(pytest.main([__file__] + sys.argv[1:]))

one day we will write a helper :)

ping pong on this one

These changes came from changing apache#9038 to use tvm.parser.fromtext instead of manual AST construction. - Demote FunctionOnDeviceAttrs to just a pair of DictAttrs entries so that the parser will understand them on Function definitions. - Connect some special operators to their attributes so parsing understands them at call sites. - Don't silently ignore attributes during parsing. - Implement OptFunctionOnDevice so won't add device annotations for kUnknownDeviceType. - Allow the parser to be given an initial metadata map to support examples which need constants. - More DLOG -> VLOG conversions to reduce debug clutter.

mbs-octoml · 2021-09-26T15:20:35Z

While converting the unit tests to script form I ended up needing #9130.

I'll work through the remaining comments now.

jroesch

I think everything makes sense, big asks would to be splitting up the passes into smaller files with a better separation of the pieces and adding diagnostics and other features. I am totally OK landing this version and building action item list to do before we turn this on as default pass.

jroesch · 2021-09-27T21:12:30Z

src/parser/parser.cc

@@ -1088,7 +1088,7 @@ class Parser {
    VLOG(0) << "Parser::ParseFunctionDef";
    return WithSpan<Function>([&]() {
      PushScope();
-      PushTypeScope();
+      PushTypeScope();  // TODO(mbs): BUG?


I think the second PushTypeScope() is in error and I'll delete it. You unconditionally pop one scope further down.

jroesch · 2021-09-27T21:14:16Z

src/relay/op/annotation/annotation.cc

-  attrs->result_device_type = result_device_type;
-  return WithAttr(std::move(function), attr::kFunctionAttrsKey, Attrs(std::move(attrs)));
+                          Integer result_device_type) {
+  return WithAttr(WithAttr(std::move(function), tvm::attr::kParamDeviceTypes, param_device_types),


Can we maybe add a WithAttr overload that takes a Map instead of calling multiple times?

jroesch · 2021-09-27T21:19:48Z

tests/python/relay/test_pass_plan_devices.py

+
+
+if __name__ == "__main__":
+    test_plain()


ping pong on this one

jroesch · 2021-09-27T21:31:47Z

src/relay/transforms/device_planner.cc

+ * -------
+ * After flowing constraints we apply some defaulting heuristics (using a global default device)
+ * to fix the device for any as-yet unconstrained sub-expressions.
+ *  - Unconstrained function result devices default to the global default device.


This is the "today behavior" technically, we should probably discuss what people expect.

jroesch · 2021-09-27T21:32:28Z

src/relay/transforms/device_planner.cc

+ *  - Unconstrained function result devices default to the global default device.
+ *  - Unconstrained function parameters devices default to the device for the function result.
+ *  - Unconstrained let-bound expression devices default to the device for the overall let.
+ * TODO(mbs): I may have over-innovated here and we simply want to bind all free domaints to


Yeah the code is setup to "innovate" here from my PoV but today's behavior is super conservative everything defaults to default device which is usually host.

jroesch · 2021-09-27T21:33:31Z

src/relay/transforms/device_planner.cc

+ *    incorporate targets and memory scopes into the domain. For example it's ok for the function
+ *    body to be executed on different device ids provided they have the same target and memory
+ *    scope.
+ *  * Might be simpler to just let every type have a device annotation rather than work in


We have discussed this one before

cc @areusch

jroesch · 2021-09-27T21:34:57Z

src/relay/transforms/device_planner.cc

+      if (lhs->device_type_ != rhs->device_type_) {
+        // TODO(mbs): Proper diagnostics.
+        std::ostringstream os;
+        os << "Inconsistent device types " << lhs->device_type_ << " and " << rhs->device_type_;


Yeah I think we can do this, just need to do some plumbing, Mark and I can add tracking for this as part of turning this thing online.

jroesch · 2021-09-27T21:35:39Z

src/relay/transforms/device_planner.cc

+   * callee is a primitive or special operation we handle it specially. Otherwise defers to \p
+   * DomainFor(call->op).
+   *
+   * This special handling is needed:


We could add a hook here, but I would like to first make sure everyone is using the same algorithm before we provide anymore customization.

jroesch · 2021-09-27T21:36:52Z

src/relay/transforms/device_planner.cc

+/*!
+ * \brief Rewrites "on_device" calls to handle some special cases.
+ */
+class RewriteOnDevices : public ExprMutator {


Yeah might be worth splitting out into like device_domain.{h, cc} and then putting the core pass parts in their own file.

jroesch · 2021-09-27T21:37:44Z

src/relay/transforms/device_planner.cc

+      domains_->Unify(func_domain, implied_domain);  // higher-order
+    } catch (const Error& e) {
+      // TODO(mbs): Proper diagnostics.
+      LOG(FATAL) << "Function parameters and result devices do not match those of call. Call:"


I think we just need to handle these when we replace the existing passes with this one, overall would be fine to build action item list from this, land a version and then let mark work on follow up to remove passes.

* Prepare for new plan_devices.cc (part II) These changes came from changing #9038 to use tvm.parser.fromtext instead of manual AST construction. - Demote FunctionOnDeviceAttrs to just a pair of DictAttrs entries so that the parser will understand them on Function definitions. - Connect some special operators to their attributes so parsing understands them at call sites. - Don't silently ignore attributes during parsing. - Implement OptFunctionOnDevice so won't add device annotations for kUnknownDeviceType. - Allow the parser to be given an initial metadata map to support examples which need constants. - More DLOG -> VLOG conversions to reduce debug clutter. * [checkpoint] Keep existing ParseModule ffi to simplify rust bindings * [checkpoint] Address Christopher's comments. * [checkpoint] Andrew's comments from #9038 * [checkpoint] Jared's comments from #9038 * [checkpoint] Woops, forgot rename.

…tation.cc Currently LowerTEPass (backend/te_compiler.cc) is a 'special' pass because it depends on a side-input DeviceMap. We'd like to remove that side-input, and instead recover the Device (and, ultimately, Target) for each (fused) primitive call from the AST alone. By doing so we also avoid needing to perform device planning twice: - It needs to be done before lowering so we know which primitives need to be compiled for which devices. - It then needs to be re-done after lowering and optimization as a prelude to memory planning. By baking the device plan into the AST we can simply do device planning before lowering, and run memory planning later, both as ordinary passes. While working on that issue we realized we currently have 3 'device planners': - transforms/device_annotation.cc, which supports only a small subset of Relay and uses a simple top-down algorithm to assign a device to every sub-expression. - analysis/context_analysis.cc, which makes a galant effort to support most of Relay, is based on unification rather than a top-down algorithm, but handles higher order functions by ad-hoc and fragile inlining. - transforms/annotate_target.cc, which works on Targets instead of Devices, but is otherwise like 'device planning'. We'd like to bring these together. In this PR we introduce a new transforms/device_planner.cc intended to replace transforms/device_annotation.cc and analysis/context_analysis.cc. We don't delete those two just yet since we need to switch all users off of them in the next PR. We also leave transforms/annotate_target.cc alone pending a proper RFC to bring devices and targets together sensibly, but have it firmly in our sights. transforms/device_planner.cc is based on analysis/context_analysis.cc, but is heavily reworked to: 1. Handle starting from existing "on_device" annotations as well as existing "device_copy" calls. 2. Be idempotent, with the idea we'll probably need to re-run it to 'refine' device planning to account for storge scopes. 3. Robustly handle all of Relay, particularly higher-order functions. For that we replace the inlining approach in analysis/context_analysis.cc with a higher-order unification domain. 4. Be a little more systematic with defaulting. 5. Capture the result of the analysis within the AST as new "device_copy" calls at device boundaries, and new/replaced "on_device" calls wherever the device for a sub-expression is not already 'obvious' from the sub-expression's lexical scope. 6. Provide helper visitors for passes which need to ask for the device for any sub-expression they are processing and/or preserve device information on rewrites. Those passes include: - backend/aot_executor_codegen.cc (AOTOnDemandAllocator) - backend/graph_plan_memory.cc (StorageAllocaBaseVisitor etc) - backend/te_compiler.cc (LowerTensorExprMutator) - backend/vm/lambda_lift.cc (LambdaLifter) - transforms/memory_alloc.cc (DialectRewriter) - transforms/to_a_normal_form.cc (Fill) - backend/vm/compiler.cc (VMFunctionCompiler) However we won't change any of those in this PR. See the draft apache#8788 for the end game. * [checkpoint] Use Relay script for all unit tests. * [checkpoint] Hoist out DeviceDomain and DeviceDomains. * [checkpoint] Hoist out visitors

mbs-octoml · 2021-09-28T21:37:27Z

@areusch, @jroesch thank you and PTAL:

Rebased on main
All unit tests (plus a new one for match :-) in script form
Split out device domains and device-aware visitors into separate files
Cpp unit test for device domains to prompt us to better support cpp unit tests throughout
Addressed above comments.

jroesch · 2021-09-29T17:00:25Z

I am going to land this, if people have follow-ups we can do them in the final PR which actually turns this on.

* main: Fix flaky NMS test by making sure scores are unique (apache#9140) [Relay] Merge analysis/context_analysis.cc and transforms/device_annotation.cc (apache#9038) [LLVM] Make changes needed for opaque pointers (apache#9138) Arm(R) Ethos(TM)-U NPU codegen integration (apache#8849) [CI] Split Integration tests out of first phase of pipeline (apache#9128) [Meta Schedule][M3b] Runner (apache#9111) Fix Google Mock differences between Ubuntu 18.04 and 16.04 (apache#9141) [TIR] add loop partition hint pragma (apache#9121) fix things (apache#9146) [Meta Schedule][M3a] SearchStrategy (apache#9132) [Frontend][PyTorch] support for quantized conv_transpose2d op (apache#9133) [UnitTest] Parametrized test_conv2d_int8_intrinsics (apache#9143) [OpenCL] Remove redundant visit statement in CodeGen. (apache#9144) [BYOC] support arbitrary input dims for add/mul/relu of dnnl c_src codegen (apache#9127) [Relay][ConvertLayout] Support for qnn.conv2d_transpose (apache#9139) add nn.global_avgpool to fq2i (apache#9137) [UnitTests] Enable minimum testing on Vulkan target in CI (apache#9093) [Torch] Support returning quantized weights and bias for BYOC use cases (apache#9135) [Relay] Prepare for new plan_devices.cc (part II) (apache#9130) [microTVM][Zephyr] Add MIMXRT1050 board support (apache#9068)

* Prepare for new plan_devices.cc (part II) These changes came from changing apache#9038 to use tvm.parser.fromtext instead of manual AST construction. - Demote FunctionOnDeviceAttrs to just a pair of DictAttrs entries so that the parser will understand them on Function definitions. - Connect some special operators to their attributes so parsing understands them at call sites. - Don't silently ignore attributes during parsing. - Implement OptFunctionOnDevice so won't add device annotations for kUnknownDeviceType. - Allow the parser to be given an initial metadata map to support examples which need constants. - More DLOG -> VLOG conversions to reduce debug clutter. * [checkpoint] Keep existing ParseModule ffi to simplify rust bindings * [checkpoint] Address Christopher's comments. * [checkpoint] Andrew's comments from apache#9038 * [checkpoint] Jared's comments from apache#9038 * [checkpoint] Woops, forgot rename.

* main: (80 commits) Introduce centralised name transformation functions (apache#9088) [OpenCL] Add vectorization to cuda conv2d_nhwc schedule (apache#8636) [6/6] Arm(R) Ethos(TM)-U NPU codegen integration with `tvmc` (apache#8854) [microTVM] Add wrapper for creating project using a MLF (apache#9090) Fix typo (apache#9156) [Hotfix][Testing] Wait for RPCServer to be established (apache#9150) Update find cublas so it search default path if needed. (apache#9149) [TIR][LowerMatchBuffer] Fix lowering strides when source region has higher dimension than the buffer (apache#9145) Fix flaky NMS test by making sure scores are unique (apache#9140) [Relay] Merge analysis/context_analysis.cc and transforms/device_annotation.cc (apache#9038) [LLVM] Make changes needed for opaque pointers (apache#9138) Arm(R) Ethos(TM)-U NPU codegen integration (apache#8849) [CI] Split Integration tests out of first phase of pipeline (apache#9128) [Meta Schedule][M3b] Runner (apache#9111) Fix Google Mock differences between Ubuntu 18.04 and 16.04 (apache#9141) [TIR] add loop partition hint pragma (apache#9121) fix things (apache#9146) [Meta Schedule][M3a] SearchStrategy (apache#9132) [Frontend][PyTorch] support for quantized conv_transpose2d op (apache#9133) [UnitTest] Parametrized test_conv2d_int8_intrinsics (apache#9143) ...

device_planner.cc from apache#9038, and finally remove DeviceMap from the LowerTE Pass. - We retire analysis/context_analysis.cc and transforms/device_annotation.cc (and their tests). That includes the CollectDeviceInfo, CollectDeviceAnnotationOps and ContextAnalysis entry points. These are all subsumed by the PlanDevices pass and the device aware visitors. - The following passes now use the new 'Device Aware' visitors to recover the device for every Relay sub-expression: - backend/aot_executor_codegen.cc (AOTOnDemandAllocator) - backend/graph_plan_memory.cc (StorageAllocaBaseVisitor etc) - backend/te_compiler.cc (LowerTensorExprMutator) - transforms/memory_alloc.cc (DialectRewriter) - backend/vm/compiler.cc (VMFunctionCompiler) - The following passes/utils must maintain the device information encoded by the device planner within "on_device" annotations and "param_device_types"/"result_device_type" function attributes: - backend/vm/lambda_lift.cc (LambdaLifter) - transforms/to_a_normal_form.cc (Fill) - ir/expr_functior.cc (Bind) - Remove a lot ad-hoc 'homogeneous' vs 'hetrogeneous' conditionals in favor of just asking for the device. Also removed a lot of ad-doc encodings of the 'default' device. - We no longer need to run device-planning twice (before and after lowering). Device planning is also decoupled from memory planning. - The LowerTE Pass no longer needs an expression-to-device side table (which was the problem which kicked this series of PRs off in the first place).

* [Relay] Switch the graph, VM and AOT executors to use the merged device_planner.cc from #9038, and finally remove DeviceMap from the LowerTE Pass. - We retire analysis/context_analysis.cc and transforms/device_annotation.cc (and their tests). That includes the CollectDeviceInfo, CollectDeviceAnnotationOps and ContextAnalysis entry points. These are all subsumed by the PlanDevices pass and the device aware visitors. - The following passes now use the new 'Device Aware' visitors to recover the device for every Relay sub-expression: - backend/aot_executor_codegen.cc (AOTOnDemandAllocator) - backend/graph_plan_memory.cc (StorageAllocaBaseVisitor etc) - backend/te_compiler.cc (LowerTensorExprMutator) - transforms/memory_alloc.cc (DialectRewriter) - backend/vm/compiler.cc (VMFunctionCompiler) - The following passes/utils must maintain the device information encoded by the device planner within "on_device" annotations and "param_device_types"/"result_device_type" function attributes: - backend/vm/lambda_lift.cc (LambdaLifter) - transforms/to_a_normal_form.cc (Fill) - ir/expr_functior.cc (Bind) - Remove a lot ad-hoc 'homogeneous' vs 'hetrogeneous' conditionals in favor of just asking for the device. Also removed a lot of ad-doc encodings of the 'default' device. - We no longer need to run device-planning twice (before and after lowering). Device planning is also decoupled from memory planning. - The LowerTE Pass no longer needs an expression-to-device side table (which was the problem which kicked this series of PRs off in the first place). * [checkpoint] Revert unnecessary changes - Started down multi-target handling in interpreter but didn't finish - Some one-off debug stuff * [checkpoint] TODO's for default device logic

…tation.cc (apache#9038) * [Relay] Merge analysis/context_analysis.cc and transforms/device_annotation.cc Currently LowerTEPass (backend/te_compiler.cc) is a 'special' pass because it depends on a side-input DeviceMap. We'd like to remove that side-input, and instead recover the Device (and, ultimately, Target) for each (fused) primitive call from the AST alone. By doing so we also avoid needing to perform device planning twice: - It needs to be done before lowering so we know which primitives need to be compiled for which devices. - It then needs to be re-done after lowering and optimization as a prelude to memory planning. By baking the device plan into the AST we can simply do device planning before lowering, and run memory planning later, both as ordinary passes. While working on that issue we realized we currently have 3 'device planners': - transforms/device_annotation.cc, which supports only a small subset of Relay and uses a simple top-down algorithm to assign a device to every sub-expression. - analysis/context_analysis.cc, which makes a galant effort to support most of Relay, is based on unification rather than a top-down algorithm, but handles higher order functions by ad-hoc and fragile inlining. - transforms/annotate_target.cc, which works on Targets instead of Devices, but is otherwise like 'device planning'. We'd like to bring these together. In this PR we introduce a new transforms/device_planner.cc intended to replace transforms/device_annotation.cc and analysis/context_analysis.cc. We don't delete those two just yet since we need to switch all users off of them in the next PR. We also leave transforms/annotate_target.cc alone pending a proper RFC to bring devices and targets together sensibly, but have it firmly in our sights. transforms/device_planner.cc is based on analysis/context_analysis.cc, but is heavily reworked to: 1. Handle starting from existing "on_device" annotations as well as existing "device_copy" calls. 2. Be idempotent, with the idea we'll probably need to re-run it to 'refine' device planning to account for storge scopes. 3. Robustly handle all of Relay, particularly higher-order functions. For that we replace the inlining approach in analysis/context_analysis.cc with a higher-order unification domain. 4. Be a little more systematic with defaulting. 5. Capture the result of the analysis within the AST as new "device_copy" calls at device boundaries, and new/replaced "on_device" calls wherever the device for a sub-expression is not already 'obvious' from the sub-expression's lexical scope. 6. Provide helper visitors for passes which need to ask for the device for any sub-expression they are processing and/or preserve device information on rewrites. Those passes include: - backend/aot_executor_codegen.cc (AOTOnDemandAllocator) - backend/graph_plan_memory.cc (StorageAllocaBaseVisitor etc) - backend/te_compiler.cc (LowerTensorExprMutator) - backend/vm/lambda_lift.cc (LambdaLifter) - transforms/memory_alloc.cc (DialectRewriter) - transforms/to_a_normal_form.cc (Fill) - backend/vm/compiler.cc (VMFunctionCompiler) However we won't change any of those in this PR. See the draft apache#8788 for the end game. * [checkpoint] Use Relay script for all unit tests. * [checkpoint] Hoist out DeviceDomain and DeviceDomains. * [checkpoint] Hoist out visitors * [checkpoint] Woops, left debug-only code in

* [Relay] Switch the graph, VM and AOT executors to use the merged device_planner.cc from apache#9038, and finally remove DeviceMap from the LowerTE Pass. - We retire analysis/context_analysis.cc and transforms/device_annotation.cc (and their tests). That includes the CollectDeviceInfo, CollectDeviceAnnotationOps and ContextAnalysis entry points. These are all subsumed by the PlanDevices pass and the device aware visitors. - The following passes now use the new 'Device Aware' visitors to recover the device for every Relay sub-expression: - backend/aot_executor_codegen.cc (AOTOnDemandAllocator) - backend/graph_plan_memory.cc (StorageAllocaBaseVisitor etc) - backend/te_compiler.cc (LowerTensorExprMutator) - transforms/memory_alloc.cc (DialectRewriter) - backend/vm/compiler.cc (VMFunctionCompiler) - The following passes/utils must maintain the device information encoded by the device planner within "on_device" annotations and "param_device_types"/"result_device_type" function attributes: - backend/vm/lambda_lift.cc (LambdaLifter) - transforms/to_a_normal_form.cc (Fill) - ir/expr_functior.cc (Bind) - Remove a lot ad-hoc 'homogeneous' vs 'hetrogeneous' conditionals in favor of just asking for the device. Also removed a lot of ad-doc encodings of the 'default' device. - We no longer need to run device-planning twice (before and after lowering). Device planning is also decoupled from memory planning. - The LowerTE Pass no longer needs an expression-to-device side table (which was the problem which kicked this series of PRs off in the first place). * [checkpoint] Revert unnecessary changes - Started down multi-target handling in interpreter but didn't finish - Some one-off debug stuff * [checkpoint] TODO's for default device logic

* Prepare for new plan_devices.cc (part II) These changes came from changing apache#9038 to use tvm.parser.fromtext instead of manual AST construction. - Demote FunctionOnDeviceAttrs to just a pair of DictAttrs entries so that the parser will understand them on Function definitions. - Connect some special operators to their attributes so parsing understands them at call sites. - Don't silently ignore attributes during parsing. - Implement OptFunctionOnDevice so won't add device annotations for kUnknownDeviceType. - Allow the parser to be given an initial metadata map to support examples which need constants. - More DLOG -> VLOG conversions to reduce debug clutter. * [checkpoint] Keep existing ParseModule ffi to simplify rust bindings * [checkpoint] Address Christopher's comments. * [checkpoint] Andrew's comments from apache#9038 * [checkpoint] Jared's comments from apache#9038 * [checkpoint] Woops, forgot rename.

…tation.cc (apache#9038) * [Relay] Merge analysis/context_analysis.cc and transforms/device_annotation.cc Currently LowerTEPass (backend/te_compiler.cc) is a 'special' pass because it depends on a side-input DeviceMap. We'd like to remove that side-input, and instead recover the Device (and, ultimately, Target) for each (fused) primitive call from the AST alone. By doing so we also avoid needing to perform device planning twice: - It needs to be done before lowering so we know which primitives need to be compiled for which devices. - It then needs to be re-done after lowering and optimization as a prelude to memory planning. By baking the device plan into the AST we can simply do device planning before lowering, and run memory planning later, both as ordinary passes. While working on that issue we realized we currently have 3 'device planners': - transforms/device_annotation.cc, which supports only a small subset of Relay and uses a simple top-down algorithm to assign a device to every sub-expression. - analysis/context_analysis.cc, which makes a galant effort to support most of Relay, is based on unification rather than a top-down algorithm, but handles higher order functions by ad-hoc and fragile inlining. - transforms/annotate_target.cc, which works on Targets instead of Devices, but is otherwise like 'device planning'. We'd like to bring these together. In this PR we introduce a new transforms/device_planner.cc intended to replace transforms/device_annotation.cc and analysis/context_analysis.cc. We don't delete those two just yet since we need to switch all users off of them in the next PR. We also leave transforms/annotate_target.cc alone pending a proper RFC to bring devices and targets together sensibly, but have it firmly in our sights. transforms/device_planner.cc is based on analysis/context_analysis.cc, but is heavily reworked to: 1. Handle starting from existing "on_device" annotations as well as existing "device_copy" calls. 2. Be idempotent, with the idea we'll probably need to re-run it to 'refine' device planning to account for storge scopes. 3. Robustly handle all of Relay, particularly higher-order functions. For that we replace the inlining approach in analysis/context_analysis.cc with a higher-order unification domain. 4. Be a little more systematic with defaulting. 5. Capture the result of the analysis within the AST as new "device_copy" calls at device boundaries, and new/replaced "on_device" calls wherever the device for a sub-expression is not already 'obvious' from the sub-expression's lexical scope. 6. Provide helper visitors for passes which need to ask for the device for any sub-expression they are processing and/or preserve device information on rewrites. Those passes include: - backend/aot_executor_codegen.cc (AOTOnDemandAllocator) - backend/graph_plan_memory.cc (StorageAllocaBaseVisitor etc) - backend/te_compiler.cc (LowerTensorExprMutator) - backend/vm/lambda_lift.cc (LambdaLifter) - transforms/memory_alloc.cc (DialectRewriter) - transforms/to_a_normal_form.cc (Fill) - backend/vm/compiler.cc (VMFunctionCompiler) However we won't change any of those in this PR. See the draft apache#8788 for the end game. * [checkpoint] Use Relay script for all unit tests. * [checkpoint] Hoist out DeviceDomain and DeviceDomains. * [checkpoint] Hoist out visitors * [checkpoint] Woops, left debug-only code in

* [Relay] Switch the graph, VM and AOT executors to use the merged device_planner.cc from apache#9038, and finally remove DeviceMap from the LowerTE Pass. - We retire analysis/context_analysis.cc and transforms/device_annotation.cc (and their tests). That includes the CollectDeviceInfo, CollectDeviceAnnotationOps and ContextAnalysis entry points. These are all subsumed by the PlanDevices pass and the device aware visitors. - The following passes now use the new 'Device Aware' visitors to recover the device for every Relay sub-expression: - backend/aot_executor_codegen.cc (AOTOnDemandAllocator) - backend/graph_plan_memory.cc (StorageAllocaBaseVisitor etc) - backend/te_compiler.cc (LowerTensorExprMutator) - transforms/memory_alloc.cc (DialectRewriter) - backend/vm/compiler.cc (VMFunctionCompiler) - The following passes/utils must maintain the device information encoded by the device planner within "on_device" annotations and "param_device_types"/"result_device_type" function attributes: - backend/vm/lambda_lift.cc (LambdaLifter) - transforms/to_a_normal_form.cc (Fill) - ir/expr_functior.cc (Bind) - Remove a lot ad-hoc 'homogeneous' vs 'hetrogeneous' conditionals in favor of just asking for the device. Also removed a lot of ad-doc encodings of the 'default' device. - We no longer need to run device-planning twice (before and after lowering). Device planning is also decoupled from memory planning. - The LowerTE Pass no longer needs an expression-to-device side table (which was the problem which kicked this series of PRs off in the first place). * [checkpoint] Revert unnecessary changes - Started down multi-target handling in interpreter but didn't finish - Some one-off debug stuff * [checkpoint] TODO's for default device logic

mbs-octoml force-pushed the mbs-plan-devices branch from 90b1c6b to 966f413 Compare September 18, 2021 00:10

jroesch reviewed Sep 22, 2021

View reviewed changes

mbs-octoml force-pushed the mbs-plan-devices branch from 966f413 to 2cc7e2f Compare September 22, 2021 19:52

mbs-octoml mentioned this pull request Sep 22, 2021

[Relay] Prepare for merging context_analysis.cc and device_annotation.cc #9077

Merged

areusch reviewed Sep 24, 2021

View reviewed changes

mbs-octoml force-pushed the mbs-plan-devices branch from 2cc7e2f to 65d2377 Compare September 25, 2021 18:29

mbs-octoml mentioned this pull request Sep 26, 2021

[Relay] Prepare for new plan_devices.cc (part II) #9130

Merged

mbs-octoml force-pushed the mbs-plan-devices branch from 65d2377 to 4bbc47f Compare September 26, 2021 15:19

jroesch approved these changes Sep 27, 2021

View reviewed changes

mbs-octoml added a commit to mbs-octoml/mbs-tvm that referenced this pull request Sep 27, 2021

[checkpoint] Andrew's comments from apache#9038

5553be2

mbs-octoml added a commit to mbs-octoml/mbs-tvm that referenced this pull request Sep 27, 2021

[checkpoint] Jared's comments from apache#9038

738bb29

mbs-octoml force-pushed the mbs-plan-devices branch from 4bbc47f to 25577d3 Compare September 28, 2021 21:34

[checkpoint] Woops, left debug-only code in

aab79d3

jroesch merged commit d7a28f9 into apache:main Sep 29, 2021

mbs-octoml deleted the mbs-plan-devices branch September 29, 2021 17:22

mbs-octoml mentioned this pull request Sep 30, 2021

[Relay] Remove DeviceMap from LowerTE #8788

Merged

		@@ -0,0 +1,1405 @@
		# Licensed to the Apache Software Foundation (ASF) under one



		# for testing only
		def function_on_device(function, param_devices, result_device):

[Relay] Merge analysis/context_analysis.cc and transforms/device_annotation.cc #9038

[Relay] Merge analysis/context_analysis.cc and transforms/device_annotation.cc #9038

Conversation

mbs-octoml commented Sep 17, 2021

mbs-octoml commented Sep 17, 2021

mbs-octoml commented Sep 17, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

areusch left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mbs-octoml commented Sep 26, 2021

jroesch left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mbs-octoml commented Sep 28, 2021 • edited Loading

jroesch commented Sep 29, 2021

mbs-octoml commented Sep 28, 2021 •

edited

Loading