-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP][Ansor][AutoTVM v2.0] Part 0: Infrastructures for Automatic Schedule Search #5883
Conversation
* Init commit: Code migration Start * Add loop_state.cc/h * Add ComputeDAG basic test
* Split transform_step out * Update GetProducers & GetConsumers * Update UTs * Add UT for CacheReadWrite & Some bug fix
* Add FollowSplit & FollowFusedSplit tests * Update dag.InferBound & its UT * Add search_task, measure and serialization * Update Serialization UT
* Add feature * Add cost_model, meta_tile_rewrite_policy * Add MetaTileRewritePolicy basic UT
* Add Basic Python API for State * Add UTs for State
* Update the return value of state operation * Add task * Copy measure.py & utils.py * Fix LocalBuilder * Fix LocalRunner
…che#8) * Add basic Python support for ansor.auto_schedule * Update AutoSchedule API * Bug fix for get the attach point of a fused iter * Update UT after infer bug fix
* Delete C++ UT hack since Python is ready * Add ndarray.non_empty * Update Serialization python API
* Update c++ code style and unit test * Update python State wrapper and test cases
* Add RPCRunner & OpenCL search test * Add CUDA search test * Add RPCRunner test
* Add basic tutorial
* Add XGBModel & RPCRunnerWarpper * Revert "Add Parallel Granularity Mutation"
* add workload registry * update * update
* add tune_test.py (the old tune_wkl.py) * update * fix measure * fix for gpu
* Bug fix for tutorials * Add PreLoadMeasuredStates * Add search_callback support for task tuner * Code refine for tune_test.py * Update * Update * Update * Update * Bug fix
* Add custom sketch rule * Bug fix
* relay integration
* Add vectorized cooperative_fetching test * Update math simplify for vectorized CF * File rename * Update tune_network * API update
* Add a threading wrapper to fix the test bug * Set default TVM_USE_AUTO_SCHEDULER to false * Update PreLoadMeasuredStates callback
* Add tensorize step
* Start to update api * Add compute_dag to state * API update
* kernel layout rewrite * remove some hacks * add defuse_ops pass and move kernel_layout_rewrite pass after fuse_ops pass * set TVM_RELAY_DISABLE_BUILD_CACHE for task extraction and prepare_layout_rewrite
* It consists of the current loop structure and the history steps to reach this state. */ | ||
class StateNode: public Object { | ||
public: | ||
std::vector<Stage> stages; // Current stages and loop structures |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
vector<Stage> -> Array<Stage>
ObjectRef aux_info); | ||
|
||
// Schedule primitives | ||
void reorder(int stage_id, const std::vector<Iterator>& order); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let us move the schedule promitives to the StateNode instead
This PR is good to get a global context. Some further comments: To make sure we get the a thoroughly review and smooth upstream, let us break this PR down further into several PRs, it also helps us to logically organize and think about the overall design architecture:
|
I agree with @tqchen that it would be helpful to break this up a little into separate PRs. Maybe it can be divided into the three partitions discussed in the paper: task scheduling, program sampling, and performance tuning. That would make it much more clear what we're looking it. |
@jwfromm The partition of implementation is different from the organization of the paper. We cannot upstream code according to the paper. We listed the integration steps in our RFC, and we will follow those steps. |
I have outlined the proposal of a possible breakdown of this PR in the above post, please see if that makes sense |
Since this PR is incomplete it might be not trivial for people to review. For example, If we are going to breakdown this PR, I would suggest putting everything we have to this PR and use this PR as the reference when sending other small PRs. |
…he#39) * lint fix * clang-format-fix * pylint fix * Update * Recover the double constructor of tvm::PrimExpr * Fix pylint * pylint fix * pylint fix
…pache#40) * Add MutateComputeLocation and MutateParallel in evolutionary search * fix lint
* improve loop state python API (stage_tensors -> stage_ops) * fix
* Bug Fix * Sample example of Custom TensorCore Matmul
After some discussion, we changed our upstream plan. We will distill a minimal version of Ansor and send it as the first PR. |
Hi all,
In [RFC] Ansor: An Auto-scheduler for TVM (AutoTVM v2.0), we've introduced the auto-scheduler Ansor. In the RFC, we reached an agreement that we should replace AutoTVM with Ansor.
For most existing templates, current Ansor can directly replace them with better performance and less tuning time.
For other special templates (low-precision, sparse), the plan is to introduce search space customization and gradually rewrite them with Ansor's new API.
This is the first PR according to the integration plan mentioned in the RFC.
This PR contains the infrastructure for search (the definition of state and actions) and small modifications outside the Ansor folder.
Infrastructure for search: A lightweight IR
Automatic scheduling is a search problem. For a search problem, we need to define the states and actions.
The state of schedule search is the loop structure defined by the schedule (i.e., the TVM IR created by
tvm.lower
). The actions are schedule primitives to manipulate the loop structures (e.g., split, reorder, fuse).To enable flexible manipulation of the loop structures, we implemented a lightweight loop structure IR (Intermediate Representation) specifically for search. We also implemented all schedule primitives for this IR. Basically, it is a simplified TVM IR. We don't use the existing TVM IR because:
After the search is done, we will lower this IR to TVM IR with TVM schedule primitives.
Key data structures
ComputeDAG
: Compute declaration graph and its related analysis toolsRelated files:
src/ansor/compute_dag.*
,python/tvm/ansor/compute_dag.py
This is the entrance data structure of Ansor. Ansor takes a compute declaration described by
tvm.compute
as input and converts it to this data structure for analysis.TransformStep
: This defines the "action" for the search problem, i.e., the schedule primitives for our IR.Related files:
src/ansor/transform_step.*
,python/tvm/ansor/loop_state.py
Each step has its corresponding
tvm.te
schedule primitives. We record allTransformStep
for every state as its transform history. After the search is done, these transform steps will be lowered with their corresponding TVM's schedule primitives.State
: This defines the "state" for the search problem, i.e., the current loop structure and history transform steps to reach this state.Related files:
src/ansor/loop_state.*
,python/tvm/ansor/loop_state.py
A state consists of a current loop structure and the transform history to reach its current loop structure.
The loop structure keeps a preview of how the schedule will finally look like after lowering (how many iterators, the extent of each iter, the location of some iterators if they have been done compute_at...), which can help the search policy to make decisions during the search.
The history is a sequence of
TransformStep
which will finally be mapped to schedule primitives.Example Walkthrough
While the search policy is implemented in C++, we also provide a python API for the new IR.
This is intended to be used by users for space customization. They look very similar to existing schedule primitives, as shown in
python/tvm/ansor/loop_state.py
. The API design is ongoing and may get updated later.Take
tests/python/unittest/test_ansor_loop_state.py:test_split_fuse_reorder_annotation()
for example, we can print out the test states1
as:The state stores all history transform steps required to reach this state. We can print the history transform steps as TVM's python API.
Or replay these steps to get a schedule for
tvm.lower
andtvm.build
.The steps of this state can be serialized into the log file as:
Ansor serializes all transform steps to the log file. This is different from AutoTVM which only serializes parameters.
In the next few PRs, we'll introduce search policy and tutorials for single op/ subgraph schedule search, relay integration and tutorials for end-to-end network schedule search, custom rules to support customized search space. When Ansor is able to fully support AutoTVM's features, we can gradually deprecate AutotVM.
This is a joint work by @merrymercy @jcf94 @minminsun @FrozenGene @comaniac @yangjunpro @yidawang .
Changes of original TVM code outside Ansor folders (Will later split these to separate PRs)
include/tvm/runtime/device_api.h
,src/runtime/cuda/cuda_device_api.cc
,src/runtime/opencl/opencl_device_api.cc
)src/te/schedule/schedule_dataflow_rewrite.cc
)include/tvm/runtime/c_runtime_api.h
,include/tvm/runtime/ndarray.h
,python/tvm/runtime/ndarray.py
,src/runtime/ndarray.cc
)src/tir/analysis/verify_gpu_code.cc
)src/tir/transforms/unroll_loop.cc
,tests/python/unittest/test_tir_transform_unroll_loop.py
)src/arith/rewrite_simplify.cc
)src/runtime/rpc/rpc_module.cc
,src/runtime/threading_backend.cc
)#(34) Add call_all_topi_functions to RelayBuildModule (src/relay/backend/build_module.cc
,python/tvm/relay/build_module.py
)