-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Relay] Merge analysis/context_analysis.cc and transforms/device_annotation.cc #9038
Conversation
@electriclilies @mikepapadim @jroesch Here's part 1. Things to watch out for:
|
Oh, and this is on top of #9012 which is still in-flight, sorry about that. |
90b1c6b
to
966f413
Compare
@@ -0,0 +1,1405 @@ | |||
# Licensed to the Apache Software Foundation (ASF) under one |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now this is a test file!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not to rain on your parade but it might be worth directly using text format here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah it's pretty awful and I nearly switched a few times already. Which format should I use? Can it capture the attributes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for relay text format, i think that would greatly help
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
*/ | ||
|
||
/*! | ||
* \file src/relay/analysis/device_planner.cc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I read everything except main file, will tackle this part tomorrow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. With apologies, but I ended up hoisting out everything except the main into #9077, so I think you could directly LGTM that now?
I'm trying to model good PR hygiene :-)
966f413
to
2cc7e2f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mbs-octoml did an initial pass, don't think i understand everything yet, but i can see the pieces here. thanks for doing this!!
include/tvm/relay/attrs/annotation.h
Outdated
int device_type; | ||
/*! | ||
* \brief If true, the result device must also be \p device_type and device planning should |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure i quite understand--isn't this what the struct-level brief above says it does? what happens if is_fixed == false? if we are going to insert a device copy node, that implies the data must be needed elsewhere. how can we merely attach an attribute to a result saying a copy is simply out of the question?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is because the association of expression to device is a 'constraint', and by default "on_device" on constrains the inner expr and not the overall expr. However to make plan_devices idempotent
with the current approach I need some way to fully specify devices.
|
||
|
||
# for testing only | ||
def function_on_device(function, param_devices, result_device): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we call this testonly_function_on_device then or place in python/tvm/relay/op/annotation/test_utils.py
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ended up fully commenting this one in a prior PR.
return Call(OnDeviceOp(), {std::move(expr)}, Attrs(std::move(attrs)), /*type_args=*/{}, span); | ||
} | ||
|
||
Expr OptOnDevice(Expr expr, DLDeviceType device_type, bool is_fixed) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
opt=Optional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, old personal naming convention but it clashes with Optimize. Switching to Maybe (in the supporting PR).
.set_attr<FTVMCompute>("FTVMCompute", | ||
[](const Attrs& attrs, const Array<te::Tensor>& inputs, | ||
const Type& out_type) -> Array<te::Tensor> { | ||
return {topi::identity(inputs[0])}; | ||
}); | ||
|
||
OnDeviceProps GetOnDeviceProps(const CallNode* call_node) { | ||
if (call_node->op == OnDeviceOp()) { | ||
ICHECK_EQ(call_node->args.size(), 1) << "on_device expects one argument"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
want to print the args or op node in this case to help debug? same question below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've gotten all the way through to unit testing the final PR without needing it, so how about I leave it. I tend to add them as I need them.
src/relay/op/annotation/annotation.h
Outdated
*/ | ||
OnDeviceProps GetOnDeviceProps(const Expr& expr); | ||
|
||
/*! \brief Returns true if \p expr is an on_device CallNode. */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this will CHECK-fail if there is an ill-formed on_device CallNode tho, not sure if that should get documented?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Turns out this is dead code now so nuked.
/*! | ||
* \brief Rewrites "on_device" calls to handle some special cases. | ||
*/ | ||
class RewriteOnDevices : public ExprMutator { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dang this file is getting a bit long..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah might be worth splitting out into like device_domain.{h, cc} and then putting the core pass parts in their own file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call, will do.
domains_->Unify(func_domain, implied_domain); // higher-order | ||
} catch (const Error& e) { | ||
// TODO(mbs): Proper diagnostics. | ||
LOG(FATAL) << "Function parameters and result devices do not match those of call. Call:" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
!! fatal! probably we need to address the TODO before merging...i think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we just need to handle these when we replace the existing passes with this one, overall would be fine to build action item list from this, land a version and then let mark work on follow up to remove passes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah this is a big improvement on the context_analysis.cc (which will CHECK fail with no context). Let's do this as CORE-75.
return FunctionOnDevice(func, param_device_types, result_device_type); | ||
} | ||
|
||
Expr VisitExpr_(const CallNode* call_node) final { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i roughly understand what's happening, but i think example Relay snippets would help to better understand/maintain each Pass here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added some simple examples to each of the internal pass class comments.
@@ -0,0 +1,1405 @@ | |||
# Licensed to the Apache Software Foundation (ASF) under one |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for relay text format, i think that would greatly help
|
||
|
||
if __name__ == "__main__": | ||
test_plain() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sys.exit(pytest.main([__file__] + sys.argv[1:]))
one day we will write a helper :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ping pong on this one
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
2cc7e2f
to
65d2377
Compare
These changes came from changing apache#9038 to use tvm.parser.fromtext instead of manual AST construction. - Demote FunctionOnDeviceAttrs to just a pair of DictAttrs entries so that the parser will understand them on Function definitions. - Connect some special operators to their attributes so parsing understands them at call sites. - Don't silently ignore attributes during parsing. - Implement OptFunctionOnDevice so won't add device annotations for kUnknownDeviceType. - Allow the parser to be given an initial metadata map to support examples which need constants. - More DLOG -> VLOG conversions to reduce debug clutter.
65d2377
to
4bbc47f
Compare
While converting the unit tests to script form I ended up needing #9130. I'll work through the remaining comments now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think everything makes sense, big asks would to be splitting up the passes into smaller files with a better separation of the pieces and adding diagnostics and other features. I am totally OK landing this version and building action item list to do before we turn this on as default pass.
src/parser/parser.cc
Outdated
@@ -1088,7 +1088,7 @@ class Parser { | |||
VLOG(0) << "Parser::ParseFunctionDef"; | |||
return WithSpan<Function>([&]() { | |||
PushScope(); | |||
PushTypeScope(); | |||
PushTypeScope(); // TODO(mbs): BUG? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the second PushTypeScope() is in error and I'll delete it. You unconditionally pop one scope further down.
attrs->result_device_type = result_device_type; | ||
return WithAttr(std::move(function), attr::kFunctionAttrsKey, Attrs(std::move(attrs))); | ||
Integer result_device_type) { | ||
return WithAttr(WithAttr(std::move(function), tvm::attr::kParamDeviceTypes, param_device_types), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we maybe add a WithAttr
overload that takes a Map instead of calling multiple times?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
|
||
|
||
if __name__ == "__main__": | ||
test_plain() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ping pong on this one
* ------- | ||
* After flowing constraints we apply some defaulting heuristics (using a global default device) | ||
* to fix the device for any as-yet unconstrained sub-expressions. | ||
* - Unconstrained function result devices default to the global default device. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the "today behavior" technically, we should probably discuss what people expect.
* - Unconstrained function result devices default to the global default device. | ||
* - Unconstrained function parameters devices default to the device for the function result. | ||
* - Unconstrained let-bound expression devices default to the device for the overall let. | ||
* TODO(mbs): I may have over-innovated here and we simply want to bind all free domaints to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah the code is setup to "innovate" here from my PoV but today's behavior is super conservative everything defaults to default device which is usually host.
* incorporate targets and memory scopes into the domain. For example it's ok for the function | ||
* body to be executed on different device ids provided they have the same target and memory | ||
* scope. | ||
* * Might be simpler to just let every type have a device annotation rather than work in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have discussed this one before
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @areusch
if (lhs->device_type_ != rhs->device_type_) { | ||
// TODO(mbs): Proper diagnostics. | ||
std::ostringstream os; | ||
os << "Inconsistent device types " << lhs->device_type_ << " and " << rhs->device_type_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I think we can do this, just need to do some plumbing, Mark and I can add tracking for this as part of turning this thing online.
* callee is a primitive or special operation we handle it specially. Otherwise defers to \p | ||
* DomainFor(call->op). | ||
* | ||
* This special handling is needed: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could add a hook here, but I would like to first make sure everyone is using the same algorithm before we provide anymore customization.
/*! | ||
* \brief Rewrites "on_device" calls to handle some special cases. | ||
*/ | ||
class RewriteOnDevices : public ExprMutator { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah might be worth splitting out into like device_domain.{h, cc} and then putting the core pass parts in their own file.
domains_->Unify(func_domain, implied_domain); // higher-order | ||
} catch (const Error& e) { | ||
// TODO(mbs): Proper diagnostics. | ||
LOG(FATAL) << "Function parameters and result devices do not match those of call. Call:" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we just need to handle these when we replace the existing passes with this one, overall would be fine to build action item list from this, land a version and then let mark work on follow up to remove passes.
* Prepare for new plan_devices.cc (part II) These changes came from changing #9038 to use tvm.parser.fromtext instead of manual AST construction. - Demote FunctionOnDeviceAttrs to just a pair of DictAttrs entries so that the parser will understand them on Function definitions. - Connect some special operators to their attributes so parsing understands them at call sites. - Don't silently ignore attributes during parsing. - Implement OptFunctionOnDevice so won't add device annotations for kUnknownDeviceType. - Allow the parser to be given an initial metadata map to support examples which need constants. - More DLOG -> VLOG conversions to reduce debug clutter. * [checkpoint] Keep existing ParseModule ffi to simplify rust bindings * [checkpoint] Address Christopher's comments. * [checkpoint] Andrew's comments from #9038 * [checkpoint] Jared's comments from #9038 * [checkpoint] Woops, forgot rename.
…tation.cc Currently LowerTEPass (backend/te_compiler.cc) is a 'special' pass because it depends on a side-input DeviceMap. We'd like to remove that side-input, and instead recover the Device (and, ultimately, Target) for each (fused) primitive call from the AST alone. By doing so we also avoid needing to perform device planning twice: - It needs to be done before lowering so we know which primitives need to be compiled for which devices. - It then needs to be re-done after lowering and optimization as a prelude to memory planning. By baking the device plan into the AST we can simply do device planning before lowering, and run memory planning later, both as ordinary passes. While working on that issue we realized we currently have 3 'device planners': - transforms/device_annotation.cc, which supports only a small subset of Relay and uses a simple top-down algorithm to assign a device to every sub-expression. - analysis/context_analysis.cc, which makes a galant effort to support most of Relay, is based on unification rather than a top-down algorithm, but handles higher order functions by ad-hoc and fragile inlining. - transforms/annotate_target.cc, which works on Targets instead of Devices, but is otherwise like 'device planning'. We'd like to bring these together. In this PR we introduce a new transforms/device_planner.cc intended to replace transforms/device_annotation.cc and analysis/context_analysis.cc. We don't delete those two just yet since we need to switch all users off of them in the next PR. We also leave transforms/annotate_target.cc alone pending a proper RFC to bring devices and targets together sensibly, but have it firmly in our sights. transforms/device_planner.cc is based on analysis/context_analysis.cc, but is heavily reworked to: 1. Handle starting from existing "on_device" annotations as well as existing "device_copy" calls. 2. Be idempotent, with the idea we'll probably need to re-run it to 'refine' device planning to account for storge scopes. 3. Robustly handle all of Relay, particularly higher-order functions. For that we replace the inlining approach in analysis/context_analysis.cc with a higher-order unification domain. 4. Be a little more systematic with defaulting. 5. Capture the result of the analysis within the AST as new "device_copy" calls at device boundaries, and new/replaced "on_device" calls wherever the device for a sub-expression is not already 'obvious' from the sub-expression's lexical scope. 6. Provide helper visitors for passes which need to ask for the device for any sub-expression they are processing and/or preserve device information on rewrites. Those passes include: - backend/aot_executor_codegen.cc (AOTOnDemandAllocator) - backend/graph_plan_memory.cc (StorageAllocaBaseVisitor etc) - backend/te_compiler.cc (LowerTensorExprMutator) - backend/vm/lambda_lift.cc (LambdaLifter) - transforms/memory_alloc.cc (DialectRewriter) - transforms/to_a_normal_form.cc (Fill) - backend/vm/compiler.cc (VMFunctionCompiler) However we won't change any of those in this PR. See the draft apache#8788 for the end game. * [checkpoint] Use Relay script for all unit tests. * [checkpoint] Hoist out DeviceDomain and DeviceDomains. * [checkpoint] Hoist out visitors
4bbc47f
to
25577d3
Compare
@areusch, @jroesch thank you and PTAL:
|
I am going to land this, if people have follow-ups we can do them in the final PR which actually turns this on. |
* main: Fix flaky NMS test by making sure scores are unique (apache#9140) [Relay] Merge analysis/context_analysis.cc and transforms/device_annotation.cc (apache#9038) [LLVM] Make changes needed for opaque pointers (apache#9138) Arm(R) Ethos(TM)-U NPU codegen integration (apache#8849) [CI] Split Integration tests out of first phase of pipeline (apache#9128) [Meta Schedule][M3b] Runner (apache#9111) Fix Google Mock differences between Ubuntu 18.04 and 16.04 (apache#9141) [TIR] add loop partition hint pragma (apache#9121) fix things (apache#9146) [Meta Schedule][M3a] SearchStrategy (apache#9132) [Frontend][PyTorch] support for quantized conv_transpose2d op (apache#9133) [UnitTest] Parametrized test_conv2d_int8_intrinsics (apache#9143) [OpenCL] Remove redundant visit statement in CodeGen. (apache#9144) [BYOC] support arbitrary input dims for add/mul/relu of dnnl c_src codegen (apache#9127) [Relay][ConvertLayout] Support for qnn.conv2d_transpose (apache#9139) add nn.global_avgpool to fq2i (apache#9137) [UnitTests] Enable minimum testing on Vulkan target in CI (apache#9093) [Torch] Support returning quantized weights and bias for BYOC use cases (apache#9135) [Relay] Prepare for new plan_devices.cc (part II) (apache#9130) [microTVM][Zephyr] Add MIMXRT1050 board support (apache#9068)
* Prepare for new plan_devices.cc (part II) These changes came from changing apache#9038 to use tvm.parser.fromtext instead of manual AST construction. - Demote FunctionOnDeviceAttrs to just a pair of DictAttrs entries so that the parser will understand them on Function definitions. - Connect some special operators to their attributes so parsing understands them at call sites. - Don't silently ignore attributes during parsing. - Implement OptFunctionOnDevice so won't add device annotations for kUnknownDeviceType. - Allow the parser to be given an initial metadata map to support examples which need constants. - More DLOG -> VLOG conversions to reduce debug clutter. * [checkpoint] Keep existing ParseModule ffi to simplify rust bindings * [checkpoint] Address Christopher's comments. * [checkpoint] Andrew's comments from apache#9038 * [checkpoint] Jared's comments from apache#9038 * [checkpoint] Woops, forgot rename.
* main: (80 commits) Introduce centralised name transformation functions (apache#9088) [OpenCL] Add vectorization to cuda conv2d_nhwc schedule (apache#8636) [6/6] Arm(R) Ethos(TM)-U NPU codegen integration with `tvmc` (apache#8854) [microTVM] Add wrapper for creating project using a MLF (apache#9090) Fix typo (apache#9156) [Hotfix][Testing] Wait for RPCServer to be established (apache#9150) Update find cublas so it search default path if needed. (apache#9149) [TIR][LowerMatchBuffer] Fix lowering strides when source region has higher dimension than the buffer (apache#9145) Fix flaky NMS test by making sure scores are unique (apache#9140) [Relay] Merge analysis/context_analysis.cc and transforms/device_annotation.cc (apache#9038) [LLVM] Make changes needed for opaque pointers (apache#9138) Arm(R) Ethos(TM)-U NPU codegen integration (apache#8849) [CI] Split Integration tests out of first phase of pipeline (apache#9128) [Meta Schedule][M3b] Runner (apache#9111) Fix Google Mock differences between Ubuntu 18.04 and 16.04 (apache#9141) [TIR] add loop partition hint pragma (apache#9121) fix things (apache#9146) [Meta Schedule][M3a] SearchStrategy (apache#9132) [Frontend][PyTorch] support for quantized conv_transpose2d op (apache#9133) [UnitTest] Parametrized test_conv2d_int8_intrinsics (apache#9143) ...
device_planner.cc from apache#9038, and finally remove DeviceMap from the LowerTE Pass. - We retire analysis/context_analysis.cc and transforms/device_annotation.cc (and their tests). That includes the CollectDeviceInfo, CollectDeviceAnnotationOps and ContextAnalysis entry points. These are all subsumed by the PlanDevices pass and the device aware visitors. - The following passes now use the new 'Device Aware' visitors to recover the device for every Relay sub-expression: - backend/aot_executor_codegen.cc (AOTOnDemandAllocator) - backend/graph_plan_memory.cc (StorageAllocaBaseVisitor etc) - backend/te_compiler.cc (LowerTensorExprMutator) - transforms/memory_alloc.cc (DialectRewriter) - backend/vm/compiler.cc (VMFunctionCompiler) - The following passes/utils must maintain the device information encoded by the device planner within "on_device" annotations and "param_device_types"/"result_device_type" function attributes: - backend/vm/lambda_lift.cc (LambdaLifter) - transforms/to_a_normal_form.cc (Fill) - ir/expr_functior.cc (Bind) - Remove a lot ad-hoc 'homogeneous' vs 'hetrogeneous' conditionals in favor of just asking for the device. Also removed a lot of ad-doc encodings of the 'default' device. - We no longer need to run device-planning twice (before and after lowering). Device planning is also decoupled from memory planning. - The LowerTE Pass no longer needs an expression-to-device side table (which was the problem which kicked this series of PRs off in the first place).
device_planner.cc from apache#9038, and finally remove DeviceMap from the LowerTE Pass. - We retire analysis/context_analysis.cc and transforms/device_annotation.cc (and their tests). That includes the CollectDeviceInfo, CollectDeviceAnnotationOps and ContextAnalysis entry points. These are all subsumed by the PlanDevices pass and the device aware visitors. - The following passes now use the new 'Device Aware' visitors to recover the device for every Relay sub-expression: - backend/aot_executor_codegen.cc (AOTOnDemandAllocator) - backend/graph_plan_memory.cc (StorageAllocaBaseVisitor etc) - backend/te_compiler.cc (LowerTensorExprMutator) - transforms/memory_alloc.cc (DialectRewriter) - backend/vm/compiler.cc (VMFunctionCompiler) - The following passes/utils must maintain the device information encoded by the device planner within "on_device" annotations and "param_device_types"/"result_device_type" function attributes: - backend/vm/lambda_lift.cc (LambdaLifter) - transforms/to_a_normal_form.cc (Fill) - ir/expr_functior.cc (Bind) - Remove a lot ad-hoc 'homogeneous' vs 'hetrogeneous' conditionals in favor of just asking for the device. Also removed a lot of ad-doc encodings of the 'default' device. - We no longer need to run device-planning twice (before and after lowering). Device planning is also decoupled from memory planning. - The LowerTE Pass no longer needs an expression-to-device side table (which was the problem which kicked this series of PRs off in the first place).
* [Relay] Switch the graph, VM and AOT executors to use the merged device_planner.cc from #9038, and finally remove DeviceMap from the LowerTE Pass. - We retire analysis/context_analysis.cc and transforms/device_annotation.cc (and their tests). That includes the CollectDeviceInfo, CollectDeviceAnnotationOps and ContextAnalysis entry points. These are all subsumed by the PlanDevices pass and the device aware visitors. - The following passes now use the new 'Device Aware' visitors to recover the device for every Relay sub-expression: - backend/aot_executor_codegen.cc (AOTOnDemandAllocator) - backend/graph_plan_memory.cc (StorageAllocaBaseVisitor etc) - backend/te_compiler.cc (LowerTensorExprMutator) - transforms/memory_alloc.cc (DialectRewriter) - backend/vm/compiler.cc (VMFunctionCompiler) - The following passes/utils must maintain the device information encoded by the device planner within "on_device" annotations and "param_device_types"/"result_device_type" function attributes: - backend/vm/lambda_lift.cc (LambdaLifter) - transforms/to_a_normal_form.cc (Fill) - ir/expr_functior.cc (Bind) - Remove a lot ad-hoc 'homogeneous' vs 'hetrogeneous' conditionals in favor of just asking for the device. Also removed a lot of ad-doc encodings of the 'default' device. - We no longer need to run device-planning twice (before and after lowering). Device planning is also decoupled from memory planning. - The LowerTE Pass no longer needs an expression-to-device side table (which was the problem which kicked this series of PRs off in the first place). * [checkpoint] Revert unnecessary changes - Started down multi-target handling in interpreter but didn't finish - Some one-off debug stuff * [checkpoint] TODO's for default device logic
…tation.cc (apache#9038) * [Relay] Merge analysis/context_analysis.cc and transforms/device_annotation.cc Currently LowerTEPass (backend/te_compiler.cc) is a 'special' pass because it depends on a side-input DeviceMap. We'd like to remove that side-input, and instead recover the Device (and, ultimately, Target) for each (fused) primitive call from the AST alone. By doing so we also avoid needing to perform device planning twice: - It needs to be done before lowering so we know which primitives need to be compiled for which devices. - It then needs to be re-done after lowering and optimization as a prelude to memory planning. By baking the device plan into the AST we can simply do device planning before lowering, and run memory planning later, both as ordinary passes. While working on that issue we realized we currently have 3 'device planners': - transforms/device_annotation.cc, which supports only a small subset of Relay and uses a simple top-down algorithm to assign a device to every sub-expression. - analysis/context_analysis.cc, which makes a galant effort to support most of Relay, is based on unification rather than a top-down algorithm, but handles higher order functions by ad-hoc and fragile inlining. - transforms/annotate_target.cc, which works on Targets instead of Devices, but is otherwise like 'device planning'. We'd like to bring these together. In this PR we introduce a new transforms/device_planner.cc intended to replace transforms/device_annotation.cc and analysis/context_analysis.cc. We don't delete those two just yet since we need to switch all users off of them in the next PR. We also leave transforms/annotate_target.cc alone pending a proper RFC to bring devices and targets together sensibly, but have it firmly in our sights. transforms/device_planner.cc is based on analysis/context_analysis.cc, but is heavily reworked to: 1. Handle starting from existing "on_device" annotations as well as existing "device_copy" calls. 2. Be idempotent, with the idea we'll probably need to re-run it to 'refine' device planning to account for storge scopes. 3. Robustly handle all of Relay, particularly higher-order functions. For that we replace the inlining approach in analysis/context_analysis.cc with a higher-order unification domain. 4. Be a little more systematic with defaulting. 5. Capture the result of the analysis within the AST as new "device_copy" calls at device boundaries, and new/replaced "on_device" calls wherever the device for a sub-expression is not already 'obvious' from the sub-expression's lexical scope. 6. Provide helper visitors for passes which need to ask for the device for any sub-expression they are processing and/or preserve device information on rewrites. Those passes include: - backend/aot_executor_codegen.cc (AOTOnDemandAllocator) - backend/graph_plan_memory.cc (StorageAllocaBaseVisitor etc) - backend/te_compiler.cc (LowerTensorExprMutator) - backend/vm/lambda_lift.cc (LambdaLifter) - transforms/memory_alloc.cc (DialectRewriter) - transforms/to_a_normal_form.cc (Fill) - backend/vm/compiler.cc (VMFunctionCompiler) However we won't change any of those in this PR. See the draft apache#8788 for the end game. * [checkpoint] Use Relay script for all unit tests. * [checkpoint] Hoist out DeviceDomain and DeviceDomains. * [checkpoint] Hoist out visitors * [checkpoint] Woops, left debug-only code in
* [Relay] Switch the graph, VM and AOT executors to use the merged device_planner.cc from apache#9038, and finally remove DeviceMap from the LowerTE Pass. - We retire analysis/context_analysis.cc and transforms/device_annotation.cc (and their tests). That includes the CollectDeviceInfo, CollectDeviceAnnotationOps and ContextAnalysis entry points. These are all subsumed by the PlanDevices pass and the device aware visitors. - The following passes now use the new 'Device Aware' visitors to recover the device for every Relay sub-expression: - backend/aot_executor_codegen.cc (AOTOnDemandAllocator) - backend/graph_plan_memory.cc (StorageAllocaBaseVisitor etc) - backend/te_compiler.cc (LowerTensorExprMutator) - transforms/memory_alloc.cc (DialectRewriter) - backend/vm/compiler.cc (VMFunctionCompiler) - The following passes/utils must maintain the device information encoded by the device planner within "on_device" annotations and "param_device_types"/"result_device_type" function attributes: - backend/vm/lambda_lift.cc (LambdaLifter) - transforms/to_a_normal_form.cc (Fill) - ir/expr_functior.cc (Bind) - Remove a lot ad-hoc 'homogeneous' vs 'hetrogeneous' conditionals in favor of just asking for the device. Also removed a lot of ad-doc encodings of the 'default' device. - We no longer need to run device-planning twice (before and after lowering). Device planning is also decoupled from memory planning. - The LowerTE Pass no longer needs an expression-to-device side table (which was the problem which kicked this series of PRs off in the first place). * [checkpoint] Revert unnecessary changes - Started down multi-target handling in interpreter but didn't finish - Some one-off debug stuff * [checkpoint] TODO's for default device logic
* Prepare for new plan_devices.cc (part II) These changes came from changing apache#9038 to use tvm.parser.fromtext instead of manual AST construction. - Demote FunctionOnDeviceAttrs to just a pair of DictAttrs entries so that the parser will understand them on Function definitions. - Connect some special operators to their attributes so parsing understands them at call sites. - Don't silently ignore attributes during parsing. - Implement OptFunctionOnDevice so won't add device annotations for kUnknownDeviceType. - Allow the parser to be given an initial metadata map to support examples which need constants. - More DLOG -> VLOG conversions to reduce debug clutter. * [checkpoint] Keep existing ParseModule ffi to simplify rust bindings * [checkpoint] Address Christopher's comments. * [checkpoint] Andrew's comments from apache#9038 * [checkpoint] Jared's comments from apache#9038 * [checkpoint] Woops, forgot rename.
…tation.cc (apache#9038) * [Relay] Merge analysis/context_analysis.cc and transforms/device_annotation.cc Currently LowerTEPass (backend/te_compiler.cc) is a 'special' pass because it depends on a side-input DeviceMap. We'd like to remove that side-input, and instead recover the Device (and, ultimately, Target) for each (fused) primitive call from the AST alone. By doing so we also avoid needing to perform device planning twice: - It needs to be done before lowering so we know which primitives need to be compiled for which devices. - It then needs to be re-done after lowering and optimization as a prelude to memory planning. By baking the device plan into the AST we can simply do device planning before lowering, and run memory planning later, both as ordinary passes. While working on that issue we realized we currently have 3 'device planners': - transforms/device_annotation.cc, which supports only a small subset of Relay and uses a simple top-down algorithm to assign a device to every sub-expression. - analysis/context_analysis.cc, which makes a galant effort to support most of Relay, is based on unification rather than a top-down algorithm, but handles higher order functions by ad-hoc and fragile inlining. - transforms/annotate_target.cc, which works on Targets instead of Devices, but is otherwise like 'device planning'. We'd like to bring these together. In this PR we introduce a new transforms/device_planner.cc intended to replace transforms/device_annotation.cc and analysis/context_analysis.cc. We don't delete those two just yet since we need to switch all users off of them in the next PR. We also leave transforms/annotate_target.cc alone pending a proper RFC to bring devices and targets together sensibly, but have it firmly in our sights. transforms/device_planner.cc is based on analysis/context_analysis.cc, but is heavily reworked to: 1. Handle starting from existing "on_device" annotations as well as existing "device_copy" calls. 2. Be idempotent, with the idea we'll probably need to re-run it to 'refine' device planning to account for storge scopes. 3. Robustly handle all of Relay, particularly higher-order functions. For that we replace the inlining approach in analysis/context_analysis.cc with a higher-order unification domain. 4. Be a little more systematic with defaulting. 5. Capture the result of the analysis within the AST as new "device_copy" calls at device boundaries, and new/replaced "on_device" calls wherever the device for a sub-expression is not already 'obvious' from the sub-expression's lexical scope. 6. Provide helper visitors for passes which need to ask for the device for any sub-expression they are processing and/or preserve device information on rewrites. Those passes include: - backend/aot_executor_codegen.cc (AOTOnDemandAllocator) - backend/graph_plan_memory.cc (StorageAllocaBaseVisitor etc) - backend/te_compiler.cc (LowerTensorExprMutator) - backend/vm/lambda_lift.cc (LambdaLifter) - transforms/memory_alloc.cc (DialectRewriter) - transforms/to_a_normal_form.cc (Fill) - backend/vm/compiler.cc (VMFunctionCompiler) However we won't change any of those in this PR. See the draft apache#8788 for the end game. * [checkpoint] Use Relay script for all unit tests. * [checkpoint] Hoist out DeviceDomain and DeviceDomains. * [checkpoint] Hoist out visitors * [checkpoint] Woops, left debug-only code in
* [Relay] Switch the graph, VM and AOT executors to use the merged device_planner.cc from apache#9038, and finally remove DeviceMap from the LowerTE Pass. - We retire analysis/context_analysis.cc and transforms/device_annotation.cc (and their tests). That includes the CollectDeviceInfo, CollectDeviceAnnotationOps and ContextAnalysis entry points. These are all subsumed by the PlanDevices pass and the device aware visitors. - The following passes now use the new 'Device Aware' visitors to recover the device for every Relay sub-expression: - backend/aot_executor_codegen.cc (AOTOnDemandAllocator) - backend/graph_plan_memory.cc (StorageAllocaBaseVisitor etc) - backend/te_compiler.cc (LowerTensorExprMutator) - transforms/memory_alloc.cc (DialectRewriter) - backend/vm/compiler.cc (VMFunctionCompiler) - The following passes/utils must maintain the device information encoded by the device planner within "on_device" annotations and "param_device_types"/"result_device_type" function attributes: - backend/vm/lambda_lift.cc (LambdaLifter) - transforms/to_a_normal_form.cc (Fill) - ir/expr_functior.cc (Bind) - Remove a lot ad-hoc 'homogeneous' vs 'hetrogeneous' conditionals in favor of just asking for the device. Also removed a lot of ad-doc encodings of the 'default' device. - We no longer need to run device-planning twice (before and after lowering). Device planning is also decoupled from memory planning. - The LowerTE Pass no longer needs an expression-to-device side table (which was the problem which kicked this series of PRs off in the first place). * [checkpoint] Revert unnecessary changes - Started down multi-target handling in interpreter but didn't finish - Some one-off debug stuff * [checkpoint] TODO's for default device logic
Currently LowerTEPass (backend/te_compiler.cc) is a 'special' pass because it
depends on a side-input DeviceMap. We'd like to remove that side-input, and
instead recover the Device (and, ultimately, Target) for each (fused) primitive
call from the AST alone.
By doing so we also avoid needing to perform device planning twice:
to be compiled for which devices.
to memory planning.
By baking the device plan into the AST we can simply do device planning before
lowering, and run memory planning later, both as ordinary passes.
While working on that issue we realized we currently have 3 'device planners':
and uses a simple top-down algorithm to assign a device to every
sub-expression.
Relay, is based on unification rather than a top-down algorithm, but handles
higher order functions by ad-hoc and fragile inlining.
is otherwise like 'device planning'.
We'd like to bring these together.
In this PR we introduce a new transforms/device_planner.cc intended to replace
transforms/device_annotation.cc and analysis/context_analysis.cc. We don't
delete those two just yet since we need to switch all users off of them in the
next PR. We also leave transforms/annotate_target.cc alone pending a proper RFC
to bring devices and targets together sensibly, but have it firmly in our
sights.
transforms/device_planner.cc is based on analysis/context_analysis.cc, but
is heavily reworked to:
"device_copy" calls.
device planning to account for storge scopes.
we replace the inlining approach in analysis/context_analysis.cc with a
higher-order unification domain.
at device boundaries, and new/replaced "on_device" calls wherever the device
for a sub-expression is not already 'obvious' from the sub-expression's
lexical scope.
any sub-expression they are processing and/or preserve device information
on rewrites. Those passes include:
However we won't change any of those in this PR.
See the draft #8788 for the end game, I'll be peeling PRs out of that.