Apache TVM v0.8 Release Note
Overview
Apache TVM v0.8 brings several major exciting experimental features, including:
- PaddlePaddle frontend
- TVMScript: round-trippable python-based syntax for TIR
- TorchScript integration
- TensorIR scheduling language
- TensorRT and CUTLASS integration via BYOC
- Int4 TensorCore support in AutoTVM
- MicroTVM Project API and Zephyr, Arduino support
- AOT executor
- Robust Windows support
- Affine analysis infra: iter-affine-map
- Improved Vulkan backend
- CUDA graph support in TVM runtime
Besides, The community has been working together to refactor and evolve the existing infrastructure, including but not limited to:
- Relay compilation engine
- Relay pattern language
- CI and build process
- Refactoring documentation and tutorials
- Stablizing AutoScheduler
- Stablizing TVMC command line driver interface
- Stablizing target system
- Frontend coverage, quantization, dynamic shape, training
Full changelog: https://gist.github.com/junrushao1994/c669905dbc41edc2e691316df49d8562.
Accepted RFCs
The community has adopted a formal RFC process. Below is a list of the formal RFCs accepted by the community since then:
- [RFC-0005] Meta schedule (AutoTIR)
- [RFC-0006] Automatic mixed-precision pass and support
- [RFC-0007] Parametrized unit tests
- [RFC-0008] MicroTVM Project API
- [RFC-0009] Unified static memory planner
- [RFC-0010] Target-registered compiler flow customisation
- [RFC-0011] Arm® Ethos-U integration
- [RFC-0014] Pipeline executor
- [RFC-0015] Use CMSIS-NN with TVM
- [RFC-0019] Add PaddlePaddle frontend
- [RFC-0020] Extend metadata in project option
- [RFC-0022] TIR non-scalar constants
- [RFC-0023] Adding annotation field to
tir.allocate
nodes - [RFC-0025] PyTorchTVM
- [RFC-0027] Formalize TVM documentation organization
- [RFC-0028] Command line composition from internal registry
- [RFC-0029] Migrating target attributes to IRModule
- [RFC-0030] Command line configuration files
- [RFC-0031] C Device API
- [RFC-0036] TVMScript namespace
- [RFC-0041] Update TVMScript block syntax
Features and Improvements
TE, TIR, TVMScript
- TVMScript parser and printer #7630 #9115 #9286
- Scheduleable TIR (S-TIR) infrastructure, analysis and lowering passes #7553 #7765 #7847 #8114 #8121 #7873 #7923 #7962 #7848 #8044 #7806
- S-TIR schedule primitives:
compute-inline
,reverse-compute-inline
,fuse
,split
,rfactor
,storage-align
,vectorize
,unroll
,bind
,reorder
,cache-read
,cache-write
,compute-at
,reverse-compute-at
,decompose-reduction
#8170 #8467 #8544 #8693 #8716 #8767 #8863 #8943 #9041 - While loop in TIR #7425 #9004
- Metaprogramming in S-TIR via
specialize
#8354 - Support Return value in TIR #7084 #7932
- Storage scope support in
PointerType
#8017 #8366 #8463 - Creation of S-TIR via TE compute #7987
AutoTVM, AutoScheduler, Meta Schedule
- PopenPoolExecutor is used to replace python native library to provide better multiprocessing support as well as enable auto-tuning in Jupyter notebooks for AutoTVM and AutoScheduler #6959 #8492 #8913 #8820 #8851
- AutoScheduler improvement and stabilization: task scheduler, layout rewrite, early stopping, dispatching #6945 #6750 #6987 #7156 #8862 #8995 #7571 #7376 #7377 #7344 #7185
- AutoScheduler support for sparse workloads #7313 #7635 #8065
- AutoScheduler support for Vulkan, ROCm, Mali #7626 #7038 #7132
- AutoTVM support for int4 TensorCore #7831 #8402
- Meta Schedule core infrastructure, builder runner and database #8615 #8623 #8642 #8817 #9079 #9132 #9154 #9053 #9059 #9044 #9111 #9061 #9153
Operator Coverage
- Operators for Int-8 vision transformer on GPU #7814
- Optimizing NMS and ROI-related kernel on GPU #7257 #7172 #7136 #7796 #7463 #6516 #7440 #7666 #8174
- Support and optimize sparse operators #8605 #7477 #7435 #6889 #6580 #8437
- Sort-related operators and optimization #9184 #7669 #8672 #7611 #7195 #7056 #6978
- Support for einsum operator #6370
- Matmul, dense operators and their optimization #8921 #8527 #8234 #8250 #6616 #8229 #8401 #7404 #8669
- Convolution and pooling operators and their optimization #8620 #8936 #8584 #7075 #7142 #7515 #6999 #6899 #6840 #6137 #6802 #6445 #6711 #6714 #8167 #8222 #8275 #8276 #8422 #8430 #6687 #7928 #8897
- Scatter and gather operators and their optimization #8479 #7600 #7044 #7464 #7233 #6533 #6856 #6854 #7927 #8105
- Prefix scan, cumsum and cumprod #7722 #7303 #7314 #7334 #7123 #6868
- Dynamic shape and shape functions #7414 #6979 #6912 #6898 #6373 #8068 #7490 #7487
- Miscellaneous improvement. Operators including: reshape, resize, pad, PRNG, transpose, where, softmax, concat, nll_loss, space_to_batch_nd, batch_to_space_nd, slice_like; Libraries including thrust, cuDNN, cuBLAS, MIOpen; Improving schedules for generic reduction and softmax. #8592 #7375 #7287 #7184 #7131 #7086 #7083 #8030 #6851 #6477 #8346 #6759 #8028 #8056 #8369 #7468 #7458 #7194 #8138 #8543
Training
Relay
- Pattern language and mixed-mode visitor: matching more IR constructs, fuzzy matching; converting more passes to non-recursive. #8843 #7754 #7355 #7332 #7282 #7151 #7120 #6958 #7507 #8325 #8774 #7817 #7374 #6695 #6704
- Improving or adding passes including ExtractOperators, SimplifyExpr, DynamicToStatic, DefuseOps, ConvertLayout, FoldConstant. Added a set of utilities that allows a model to be run efficiently on TensorCores #9253 #9245 #8996 #7827 #9034 #7807 #8755 #7731 #7368 #7603 #7656 #7423 #7354 #6946 #6748 #6720 #6776 #7835 #7895 #8205
- TECompiler and refactoring of compilation workflow #9103 #8974 #8886 #8802 #8501 #8526 #8486 #8597 #7518 #7552 #8914 #9130
- Quantization and automatic-mixed precision #8883 #8810 #8644 #7613 #8069 #8341 #8126 #8460
- Parser, printer and diagnostic #7347 #6274 #6692 #8352 #8000
MicroTVM, AOT, Graph Executor and VM
- Pipeline Executor #8702 #9108
- CUDA graph integration in graph executor #7616
- Enable add
set_output_zero_copy
in graph executor #8497 - VM: memory allocation improvement, shape function improvement and misc #7746 #7451 #7413 #7210 #8040 #6938 #8661 #7676 #8285
- AOT compilation and execution #8697 #7785 #8014 #8023 #8096 #8075
- Project API infrastructure: #8380 #8963 #8708 #8019
- MicroTVM, Zephyr, Arduino RVM, AutoTVM support #9320 #8941 #7804 #7786 #7449 #7891 #7915 #8055 #8037 #8386 #8519 #8748 8154 #8945 #8624 #8701 #7723 #8715 #7225 #6964 #7813 #7528
- The pure C runtime (CRT) #7398 #7333 #7095 #7225
- Model library format #8270 #8072 #7938
Arithmetic Analysis
- Tighter bounds and more simplification on cast #6771 #7045
- Introducing iterator (quasi-) affine map detection #6667 #7752 #7759
- Inverse of iterator affine map #8384 #8427
- Subspace division in iterator affine map #7760
Frontends
- PaddlePaddle initial support #8645 #9124 #9126 #9295 #9370 #9236 #9283
- ONNX support, including better handling of control flow, coverage of more operators, better dynamic shape support, more tests. #9265 #9178 #9146 #8894 #8966 #8967 #7818 #9000 #9001 #9066 #9028 #9002 #8985 #9019 #9017 #8972 #7802 #7800 #7781 #8919 #9054 #8906 #8933 #8959 #8907 #7771 #8923 #8924 #7755 #7720 #8773 #8872 #7655 #8741 #7633 #8781 #8866 #8867 #7522 #7519 #7489 #7438 #7429 #7364 #7300 #7259 #7243 #7237 #7208 #7189 #7115 #7109 #7089 #7036 #7031 #6839 #6351 #7842 #7844 #6646 #6647 #6681 #6700 #7883 #6726 #6730 #7899 #7900 #7906 #7934 #7956 #8007 #8011 #8084 #8099 #8189 #8191 #8304 #8321 #8337 #8356 #8385 #8502 #8426 #8440 #8456 #8475 #7391 #7394 #8621 #8322 #8323 #8435 #8436 #8455 #7353 #7215
- TensorFlow and TFLite, including more operators, better TensorArray support and quantization #9404 #9256 #8689 #7789 #7736 #8763 #8647 #8648 #8558 #8780 #8538 #7659 #7639 #7531 #7520 #7502 #7496 #7473 #7452 #7442 #7441 #7400 #7320 #7293 #7267 #7159 #7148 #7114 #7113 #7093 #7074 #7048 #7030 #6998 #6984 #6970 #6949 #6933 #6918 #6901 #6885 #6849 #5767 #6589 #6670 #6674 #6675 #7866 #6685 #7885 #6729 #7901 #6774 #6783 #6799 #7951 #8024 #8051 #8060 #8074 #8142 #8179 #8251 #8277 #8335 #8364 #8375 #8431 #8454 #6818 #8483 #9099 #9165
- PyTorch: more operators including activations, inplace operators, RNNs, NMS #9371 #9204 #9185 #9135 #9133 #9015 #8839 #8718 #8699 #8692 #7712 #8753 #7694 #8583 #7675 #7646 #7606 #7592 #7569 #7544 #7549 #7535 #7517 #7465 #7397 #7371 #7348 #7346 #7325 #7231 #7174 #7154 #7137 #7134 #7133 #7128 #7088 #7023 #6900 #6602 #7845 #6659 #6740 #6782 #6784 #7958 #8192 #8397 #8398 #8403 #8447 #6829
- MXNet support. More operators and NLP model coverage in GluonNLP #7568 #7409 #7209 #7191 #7062 #6561 #6699
- Misc: CoreML, Keras, DarkNet, etc. #7667 #6676 #6651 #6963 #7949 #7035 #7446 #8562 #8599
Codegen Backends and Runtime
-
LLVM backend: recover LLVM support on windows; support target feature strings in function attributes; atomic support in NVPTX, ROCm; LLVM compatibility to LLVM 12+ #9305 #9223 #9138 #8860 #8958 #6763 #6698 #6717 #6738 #8293 #6907 #7051
-
ROCm 3.9 bitcode files search #6865
-
Vulkan and SPIR-V refactoring and major improvement in codegen and runtime. A critical bug fix in SPIRV codegen allows the Vulkan backend to produce correct outputs on more hardwares and drivers. Added support for querying device specific hardware parameters and capabilities, dynamic shapes, irregular ops such as sorting and NMS, UBO, fp16, and vectorization. We can now run complicated models like MaskRCNN on Vulkan end to end. #8904 #7833 #7717 #7681 #8746 #8813 #7609 #8882 #7607 #7591 #7574 #7572 #7833 #6662 #7969 #8013 #8048 #8098 #8102 #8107 #8127 #8151 #8196 #8320 #8588 #8332 #8333 #8348 #8528
-
Metal language version upgrade (
MTLLanguageVersion2_3
), better codegen support, int64 support, various bug fixes #7830 #7819 #7714 #7118 #7116 #7105 #7980 #8054 #8175 #8202 #8206 #8313 -
OpenCL, VTA, Verilator: refactored code generator, better error messages, various bug fixes #7834 #7777 #7761 #7100 #6125 #6126 #6191 #7834 #8256 #8257 #8731 #8756 #8973
-
CUDA: enable
__launch_bounds__
, dynamic shared memory, TensorCore, BF16, half2, NVCC version upgrade #9341 #8678 #7561 #7273 #7146 #7147 #7099 #7065 #7033 #7014 #7907 #7964 #9087 #8135 #8137 #8457 #8466 #8571 -
ARM: CMSIS-NN, Ethos-N #8653 #7628 #8951 #7506 #7443 #7858 #6982 #8795 #8806 #8833 #9147 #9159 #9160 #9162 #9163 #9167 #9209 #9386 #9387
-
Hexagon: build, compilation, model launcher, more target options and better runtime #7784 #6718 #8821 #8822 #9033 #8823 #8859 #8865 #8915 #8954 #9024 #9025 #8960 #8986 #9010 #9011 #9189 #9220 #9355 #9356
-
WASM: Update support for latest emcc, add ffi test. #6751
BYOC Integration with Vendor Libraries: TensorRT, ACL, VitisAI
- TensorRT initial integration, stabilization, int8 calibration, dynamism support #6395 #7702 #7595 #7581 #7412 #7372 #9047 #8073 #8808 #6905 #7967 #8005 #8172 #8461 #8506 #8607 #7205 #7026 #7016 #7011 #6955 #6872 #7253 #6805 #9324
- Arm Compute Library (ACL) integration #7649 #7206 #6532 #7121 #6724 #8149 #7251 #9396
- Verilator integration #7406 #7351 #7286 #8094
- VitisAI integration #6343 #7350
- BYOC infrastructure enhancement: improving control flow, AnnotateTarget, custom codegen #6641 #6655 #6697 #6786 #7977 #8464
TVMC
- MacOS support #8396
- AutoScheduler support #7070
- Support cross compiler options #7922
- Python scripting #7823 #7698
- More flexible input specification #7366 #7788
- More options,
--disable-pass
and--config
#7816 #8253 - Allow passing optional arguments to importers #7674
- Model library format (MLF) support #8086 #8331
- More backend and library support: metal, ACL, Vulkan, OpenCL, ROCm, Vitis AI #8282 #7508 #8359 #6831 #8896 #7577
- Support for the new target system #7651 #7654 #6788 #7304 #6855
Rust Binding
- Rust bindings installable via Cargo #7503 #6678 #8631 #8665
- Initial support for diagnostic interface #6656
- Fixes for using Python APIs from Rust #7085
- Improve NDArray, GraphRt, Relay, IRModule, Array, Attrs bindings #6563 #6741 #7138 #8353 #7082
- Improve error handling, error messages and fix memory leaks #8289 #6815 #8714 #8725
Misc
- Enhanced CPP-RPC implementation: allow user supplied work dir, support of CPP-RPC server for Apple, support adb-shell style CPP-RPC #7670 #8224 #8223 #7766 #7013
- Use PopenWorker to handle RPC system: #7889 #7757 #7961
- Fold target host into target #7462 #7791 #7534 #8835
- Target-based intrinsic lowering and legalization #7936 #7809
- Add target tags for all existing CUDA GPU models #7410
- Linear Congruential Random Engine #8642