diff --git a/NEWS.md b/NEWS.md index 106e056a655f..5554727fe256 100644 --- a/NEWS.md +++ b/NEWS.md @@ -26,6 +26,1344 @@ Refer to the Roadmap issue for complete list on on-going version features. If you check in something that is not reflected in Roadmap issue, please reply to that issue so it can get added. +## 0.7 +v0.7 brings many major features. The community works together to refactor the internal code base to bring an unified IR code structure with a unified IRModule, type system and pass infrastructure. We have also bought many exciting new features, some highlights include: + +* Initial automatic scheduling support +* Initial command line driver interface +* WebGPU and webassembly support +* Better first class rust support in the codebase +* Intial Hexagon support +* Bring your own codegen (BYOC) support + +The community also continues to bring high quality improvements to the existing modules including, but not limited to: better frontend coverage, performance, quantization, uTVM and dynamic shape support. + +## New Features +### Automatic Scheduling (Experimental) +* Phase 0: Ansor minimum system for auto schedule generating #5962 +* Phase 1: Access Analyzer #6103 +* Phase 1: Add `follow_split` and `follow_fused_split` steps #6142 +* Phase 1: Add `pragma`/`storage_align`/`rfactor` steps #6141 +* Phase 1: Add RPC Runner #6077 +* Phase 1: Add `annotation`/`compute_at`/`compute_root`/`compute_inline` steps #6073 +* Phase 1: Add `cache_read`/`cache_write` steps #6107 +* Phase 1: Rename namspace form `auto_schedule` to `auto_scheduler` #6059 +* Phase 1: The base class for cost models #6187 +* Phase 1: feature extraction for cost models #6190 +* Phase 1: XGBoost Cost Model #6270 +* Phase 2: Basic GPU Sketch Search Policy #6269 +* Phase 2: Evolutionary Search #6310 +* Phase 2: Update heavy operations with `parallel_for` #6348 +* Parallel the InitPopulation (#6512) +* Tutorial: Using the template-free auto-scheduler on CPU (#6488) + +### BYOC +* External codegen support in Relay (#4482),(#4544) +* Bring Your Own Codegen Guide -- Part 1 #4602 +* Bring Your Own Codegen Guide -- Part 2 #4718 +* Relay annotation and partitioning for external compilers #4570 +* JSON Runtime with DNNL End-to-End Flow #5919 +* Handle one symbol for each runtime #5989 +* Run accelerator specific optimizations #6068 +* Arm Compute Library integration #5915 +* Retire the example json runtime #6177 +* `json_node.h` should include `data_type.h` #6224 +* Improve installation tutorial #6170 +* Add support for dense (fully connected) layer #6254 +* Introduce the Ethos-N BYOC integration #6222 +* Enable remote device via environment variables #6279 +* Improved pooling support #6248 +* Add support for quantized convolution #6335 +* CoreML codegen #5634 + +### Operator Coverage +* Add `strided_set` operation (#4303) +* Add support for conv3d (#4400), pool3d (#4478), 3d upsampling ops (#4584) +* Add group convolution for VTA (#4421) +* Add 1d deconvolution op (#4476) +* Allow batch matmul to be fused into injective ops (#4537) +* Add native depthtospace and spacetodepth operators (#4566) +* Add CUDNN conv3d support (#4418) +* Dilation2D operator support #5033 +* Isfinite operator #4981 +* Unravel Index operator #5082 +* Add thrust support for nms #5116 +* Resize3d, Upsample3d op support #5633 +* Add operator Correlation #5628 +* `affine_grid` and `grid_sample` #5657 +* Sparse to dense operator #5447 +* `Conv3d_transpose` op support added #5737 +* add op `crop_and_resize` #4417 +* Add bitwise ops #4815 +* Sparse to dense operator #5447 +* support dynamic NMS(Non Maximum Suppression), symbolic begin, end, and strides for strided_slice #4312 +* `Conv3d_transpose` op support added #5737 +* ReverseSequence operator #5495 +* Conv1D #4639 +* 1D Pooling #4663 + +### Quantization +* Channel wise quantization - Quantize & Requantize #4629 +* Support QNN ops. #5066 +* Adding support for QNN subtract op #5153 +* TFLite QNN Tutorial #5595 +* Tutorial: Deploy Quantized Model on CUDA #4667 +* Support asymmetric per-layer quantized operators #6109 + +### Relay +* Add convertlayout pass in Relay (#4335, #4600) +* Added Merge Composite pass #4771 +* Call graph for relay #4922 +* Add inline pass #4927 +* Target annotation for external codegen #4933 +* GradientCell Relay Pass #5039 +* Add MergeCompilerRegions pass #5134 +* Non-recursive Graph Vistor and Rewriter (#4886) +* [Blocksparse] Pipeline for lowering dense model to sparse-dense (#5377) +* Relay op strategy #4644 +* Static Tensor Array (#5103) +* Memory planner (part 1) #5144 +* ONNX codegen #5052 +* Add Parser 2.0 #5932, part 2 #6162 +* Basic block normal form #6152 +* Convert Layout pass. #4664 +* Pattern Language, Matcher, Rewriter, and Function Paritioner #5231 + +### Runtime and Backend +* Add ADTObject POD container type (#4346) +* TFLite RPC runtime (#4439) +* Standardized graph runtime export (#4532) +* MISRA-C compliant TVM runtime #3934 +* Add String container #4628 +* Introduce Virtual Memory Allocator to CRT (#5124) +* Initial implementation of Hexagon runtime support (#5252) +* FastRPC interface for Hexagon runtime (#5353) +* CoreML Runtime (#5283) +* AutoTVM + uTVM for Cortex-M7 (#5417) +* Windows Support for cpp_rpc (#4857) +* Implement TVMDSOOp(TensorFlow custom op) for TVM runtime (#4459) +* WebGPU support #5545 +* TVM WebAssembly JS Runtime #5506 +* Hexagon driver for offloading kernels to simulator #5492 +* Introduce runtime::Array #5585 +* Allow non-nullable ObjectRef, introduce Optional. (#5314) +* Introduce static slots for common objects. (#5423) +* ntroduce RValue reference(move) support to TypedPackedFunc (#5271) +* Introduce MetadataModule to separate code compilation/interpretation and weight initialization #5770 +* Support module based interface runtime #5753 +* Add TVM application extension with WASM runtime #5892 +* Provide guide to user who has difficulty register SEqualReduce (#5300) + +### Rust Support +* Revive the Rust + SGX refactor #4976 +* Improve Rust bindings: Map, Array, String, various IR nodes #6339 +* Rust Refactor Stage 4: Rewrite Rust graph runtime to use new APIs #5830 +* Second stage of Rust Refactor #5527 +* tvm crate stage 3 of Rust refactor #5769 +* Add first stage of updating and rewriting Rust bindings. #5526 + +### TIR +* Introduce StructuralHash for the Unified IR. #5160 +* Introduce StructuralEqual Infra for the unified IR. #5154 +* Introduce ExprDeepEqual, Remove IRDeepCompare #5206 +* [TIR] Introduce BufferLoad/Store (#5205) +* Improved massive build times caused by tir.floormod and tir.floordiv. Fixed Topi testcase. #5666 +* Buffer logger assert removed #6147 +* Enhance VerifyGPUCode #6194 +* HoistIfThenElse added #6066 +* Hybrid Script Support for TIR #6227 +* Migrate Low-level Passes to Pass Manager #5198 +* HoistIfThenElse added #6066 +* Hybrid Script Support for TIR #6227 +* Block scope hoisting added #6238 + +### TE +* reverse-mode autodiff without any optimization #5121 +* Tensor Expression Debug Display (TEDD) #4651 +* Optimize and eliminate the Jacobian tensor for te.autodiff #6078 + +### TVMC(Experimental) +* TVMC - A command line driver for TVM (Part 1) #6112 +* TVMC - Linting error on onnx command line driver frontend #6536 +* TVMC - Command line driver 'compile' (part 2/4) #6302 +* TVMC - Introduce 'tune' subcommand (part 3/4) #6537 +* TVMC - Introduce 'run' subcommand (part 4/4) #6578 +* TVMC - Getting started tutorial for TVMC #6597 + + +## Feature Improvement +### Accelerator and Microcontroller Support +- Cleanup legacy verilog code (#4576) +- uTVM support for ARM STM32F746XX boards (#4274) +- Add --runtime=c, remove `micro_dev` target, enable LLVM backend #6145 + +### Arithmetic Analysis +* Linear system and equation solver (#5171) +* Inequalities solver #5618 +* Improve IntervalSet's floormod (#5367) +* Remove legacy const pattern functions (#5387) +* Handle likely in IRMutatorWithAnalyzer #5665 +* ExtendedEuclidean merge impl to int_operator #5625 +* Rewrite simplify fix for Vectorized Cooperative Fetching #5924 + +### AutoTVM and Graph Tuner +* Adding ROCM schedules for TOPI (#4507) +* NHWC conv2d schedule templates for ARM (#3859) +* Use VM compile to extract autotvm tasks #4328 +* Download fallback schedule file if it does not exist #4671 +* Ignore error when removing tmpdir #4781 +* Fix a bug in generating the search space #4779 +* Minor bug fixes in AutoTVM for QNN graphs #4797 +* Fix autotvm customized template #5034 +* Add opt out operator for `has_multiple_inputs` for graph tuner #5000 +* Customize SI prefix in logging (#5411) +* Update XGBoost verbosity option #5649 +* Support range in index based tuners #4870 +* Enable random fill and CPU cache flush for AutoTVM and Ansor (#6391) +* Auto-scheduler tutorial for GPU and necessary refactor/fix (#6512) + +### BYOC +* [BYOC] Bind constant tuples in graph partitioner (#5476) +* [BYOC] Add support for composite functions in BYOC (#5261) +* [BYOC] Register pattern tables from external codegens (#5262) +* [BYOC] Enhance partitioning and external codegen (#5310) +* [BYOC] Refine AnnotateTarget and MergeCompilerRegion Passes (#5277) +* [BYOC] Use Non-Recursive Visitor/Mutator (#5410) +* [BYOC] Refine DNNL Codegen (#5288) +* [BYOC] Add example of Composite + Annotate for DNNL fused op (#5272) +* [BYOC] Prevent duplicate outputs in subgraph Tuple (#5320) +* [BYOC] Introduce further operator support (#6355) +* [BYOC] Support input nodes with multiple entries (#6368) +* [BYOC] Add maximum support for float32 (#6506) + +### Codegen +* Intrinsic dispatching with OCML instead of LLVM for ROCm (#4499) +* Make target codegen take IRModule and PrimFunc. #5107 +* Enhance CUDA codegen for SelectNode #4983 +* Vectorization for intrinsics #5101 +* [LLVM] Do not use `x86_vcvtph2ps_256` intrinsic with LLVM 11+ (#5267) +* [LLVM] Use llvm::ElementCount with LLVM 11+ when creating vectors (#5265) +* [LLVM] Use llvm::FunctionCallee in IRBuilder::CreateCall with LLVM 11+ (#5338) +* [LLVM] Include Support/Host.h for declaration of getDefaultTargetTriple (#5268) +* [LLVM] Replace calls to Type::getVectorNumElements (#5398) +* [LLVM] Use ArrayRef in calls to CreateShuffleVector (#5399) +* [LLVM] Use llvm::Align with LLVM 11+ to avoid warnings (#5264) +* [CodeGen] Cleanup generated code (#5424) +* Rename `target_id` => `target_kind` #6199 +* 64-bit RPi4b target #6211 +* Creating Target from JSON-like Configuration #6218 +* Add python binding to new JSON target construction #6315 +* Use target class in all codegens #6347 +* Initial support for Hexagon codegen #6261 +* Add --runtime=c, remove `micro_dev` target, enable LLVM backend #6145 +* Add tvm::support::hexdump() debug utility #6154 +* Adding AMD codegen unit tests (#4509) +* Support cuda tensorcore subbyte int data type in auto tensorcore #4546 +* Handle empty LLVMModule in GetFunction #5146 +* Support int4/int8 conv2d tensor core with HWNC layout #6121 + +### Dynamism Support +* Add shape function for `zero`, `zeros_like`, `ones`, `ones_like` (#4448), `tile` (#4441) +* Support symbolic newshape for Reshape #5429 +* Support symbolic TopK, Ones, Zeros and Full #5459 +* Add `shape_of` instruction #5855 +* symbolic `max_output_size` #5844 +* Dynamic TopK Op #6008 +* Dynamic `broadcast_to`, `zeros`, `ones` #6007 +* Add dynamic reshape grad #6080 +* Keep fixed dim when unifying dynamic shape #5795 +* OneHot operation #6209 +* Add Dynamic Resize Op #6198 +* Dynamic full operator #6260 +* Dynamic upsampling relay op #6273 +* Dynamic Tile Op #5983 + +### Frontend and User Interface +* TFLite parser support for `transpose_conv` (#4440), `unpack` (#4447) +* LLDB pretty printers for relay (#4453) +* ONNX to Relay converter op support: expand op (#4483) +* ONNX `auto_pad` in conv and convtranspose (#4563) +* TF to Relay converter op support (#4504) (#4551) (#4484) +* Remove unnecessary cast of constants in ONNX converter (#4573) +* Add support for tf.Keras networks in Relay Keras frontend #4630 +* Add conv3d #4604 +* Fix incorrect calculations in tf SLICE #4518 +* Dynamically calculate `input_stats` of any `fake_quant` range #4789 +* LSTM Support #4825 +* Add `MIRROR_PAD` operator #4822 +* use qnn helper function in softmax #4840 +* Add Resize op converter #4838 +* Add support for `TFLite_Detection_PostProcess` #4543 +* Fix tests for tflite unary elemwise operations #4913 +* GaussianDropout/Noise parsing support #4928 +* Add parser support for 'square' operator #4915 +* `make_loss` operator support #4930 +* Add parser support for `l2_normalization` #4966 +* ReadVariableOp operator support #4952 +* Check graph inputs match expected #4992 +* support multiply outputs #4980 +* TFLite: Using real image for QNN testing. #4816 +* TFLite: `FLOOR_MOD` & `FLOOR_DIV` support #4971 +* PyTorch: Upsampling op support and enable registering a user defined op conversion map #4961 +* PyTorch: fix unordered dictionary problem for python version under 3.6 #4982 +* Operator support NonZero #5073 +* Upsampling op support and enable registering a user defined op conversion map #4961 +* Check graph inputs match expected #4992 +* Add support for quantized models via QNN #4977 +* Add initial control flow support #4964 +* Remove FP32 piggy back and use QNN add/mul/concatenate #5061 +* Add missing upcast to uint8 `avg_pool` conversion #5089 +* Add initial 3D op support and test on Resnet 3D #5075 +* Fix conv2d conversion for group conv (group > 1 but != in channels) #5132 +* Add support for `max_pool1d` #5142 +* Add support for split #5174 +* `FLOOR_MOD` & `FLOOR_DIV` support #4971 +* Activation functions support #4978 +* Round op parsing support added #5022 +* DepthToSpace and SpaceToDepth support #5041 +* `TOP_K` op parser support #5051 +* ReadVariableOp operator support #4952 +* Support multiply outputs #4980 +* `reduce_any` op parsing support #4926 +* TensorFlow Parser Control Flow Enhancement #5020 +* TensorFlow Frontend support with shared params #5042 +* Support for AddV2 in Relay Tensorflow frontend converter. #5046 +* conv3d frontend operator support #5080 +* `max_pool3d` and Averagepool3d operator support #5085 +* Support for Atan/Atan2 in Relay Tensorflow frontend converter. #5104 +* Use leaky by default for LeakyReLU #5192 +* Conv3D ONNX support and `conv3D_ncdhw` x86 schedules #4949 +* Add support for FusedBatchNormV3 #5065 +* Activations for pytorch #5194 +* Dropouts And InstanceNorm support added #5203 +* [Frontend] Asymmetric padding of convolution support (#4803) +* [ONNX]Pool3d & upsample3d op support (#5135) +* Add TopK to ONNX Frontend (#5441) +* Add RoiAlign to Onnx frontend (#5454) +* [PYTORCH]AvgPool3d, MaxPool3d and Squeeze op support (#5220) +* [PYTORCH]celu, gelu, selu activations (#5263) +* [Pytorch]layernorm bug fix and testcase updated (#5257) +* [PYTORCH]LayerNorm support added (#5249) +* [PYTORCH]GroupNorm op support added (#5358) +* [PYTORCH]Logical & Bitwise operator support (#5341) +* [PYTORCH]Tensor creation ops support (#5347) +* [PYTORCH]cosh,sinh,log2,log10,log1p op support (#5395) +* [PYTORCH]Rsub, Embedded, OneHot ops support (#5434) +* [PYTORCH]Abs, Arange, Softplus ops (#5295) +* [PYTORCH]isNan, isinf, isfinite, ceil, clamp, round ops (#5316) +* [PYTORCH]Activations for pytorch (#5194) +* [PYTORCH]Repeat, Reciprocal & Reshape Op support (#5280) +* [PYTORCH]`Reduce_ops` support added (#5308) +* [PYTORCH]Take, Topk op support (#5332) +* [PYTORCH]Dropouts And InstanceNorm support added (#5203) +* [PYTORCH]Unary Ops frontend support. (#5378) +* [Torch] Support Python list, more realistic recurrent networks (#5306) +* [PYTORCH]where, addcdiv, addcmul op support (#5383) +* [Torch] Add support for split (#5174) +* [Torch] Fix up graph input handling (#5204) +* [TFLITE]Logical not op support (#5475) +* [TFLITE]Hard Swish & MobilnetV3 model testing (#5239) +* [TFLITE]Gather, StridedSlice op support added (#4788) +* [TFLITE] Match TFLite shape for SSD custom op (#5473) +* Factor out import of common tflite.Operator in tflite frontend. (#5355) +* [TFLite] support for FILL and `SPLIT_V` operators (#5330) +* [TFLite] `L2_POOL_2D` operator (#5452) +* [TFLite] Add config option to specify FlatBuffers location (#5425) +* [TFLITE]Logical not op support (#5475) +* [TENSORFLOW]reduce ops updated (#5180) +* [TENSORFLOW] Fix `gather_nd` indices (#5279) +* [TensorFlow]Improve TensorFlow Static Shape Tensor Array (#5243) +* [KERAS]Minimum & AlphaDropout op support (#5380) +* [KERAS]Embedding layer (#5444) +* [KERAS]`Max_pool3d` and Averagepool3d operator support (#5085) +* [CAFFE2]add Mul and ConvTranspose operator (#5302) +* [MXNET]DepthToSpace & SpaceToDepth Operator (#5408) +* [MXNET]broadcast and logical op support (#5461) +* [MXNET] Use leaky by default for LeakyReLU (#5192) +* [MXNET] support elemwise logic ops (#5361) +* [Frontend|MXNet] SwapAxis operator support (#5246) +* [RELAY] Move frontend utils (#5345) +* [Pytorch] Fix translation of transpose when axis argument is as a list (#5451) +* LpPool Support added #5696 +* Skip ADD inside Gemm op when vector is zero #5697 +* ReduceL1, ReduceL2, ReduceSumSquare, ReduceLogSum ops added #5721 +* MaxRoiPool, Mod & Xor op support added #5729 +* Skip multiply with 1.0f constant for GEMM import #5800 +* StatefulPartitionedCall/PartitionedCall Ops support added #5617 +* Don't add cast for batch norm when type isn't changing #5731 +* Conv3d Transpose OP added #5775 +* expand bug fix #5576 +* Support `max_pool2d_with_indices` #5549 +* Add prim::device op #5584 +* ImplicitTensorToNum support added #5603 +* Matmul fix for `batch_matmul` #5604 +* ReflectionPad2d op #5624 +* Padding op support #5638 +* Minor bug fixes #5683 +* `floor_divide` support for squeezenet #5702 +* ReplicationPad support added #5708 +* aten::norm support added #5776 +* broadcast and logical op support #5461 +* MaxPool3d and AvgPool3d Ops support added #5614 +* Softmin, trunc op support added #5715 +* conv3d and `conv3d_transpose` addedx #5814 +* Model importer to be compatible with tflite 2.1.0 #5497 +* Nit: Function names made consistent #5515 +* Select op support for tflite frontend #5486 +* `GATHER_ND` #5508 +* Quantize & Dequantize op #5394 +* Fully connected op conversion made in sync with TFLite #5510 +* `ADD_N` operator #5474 +* onnx, mxnet, pytorch mathops added #5561 +* abs, round, reciprocal, sign, softsign, `hard_sigmoid` ops support #5587 +* Gather nd bug fix for one dim support in tensorflow #5588 +* Add parser support for shape and range #5329 +* Darknet support batch size for yolo #5688 +* Improve Control Flow and TensorArray #5699 +* MXNet: Softmin, trunc op support added #5715 +* MXNet: conv3d and `conv3d_transpose` addedx #5814 +* MXNet: Add parser for `contrib.box_decode` #5967 +* Onnx: ReduceL1, ReduceL2, ReduceSumSquare, ReduceLogSum ops added #5721 +* Onnx: MaxRoiPool, Mod & Xor op support added #5729 +* Onnx: Skip multiply with 1.0f constant for GEMM import #5800 +* Onnx: Fix an issue with #5755 and add Batch norm unit tests. #5845 +* TensorFlow: StatefulPartitionedCall/PartitionedCall Ops support added #5617 +* TensorFlow: Don’t add cast for batch norm when type isn’t changing #5731 +* TensorFlow: Conv3d Transpose OP added #5775 +* Add parser support for shape and range #5329 +* Darknet support batch size for yolo #5688 +* Improve Control Flow and TensorArray #5699 +* Improve TF Parser to keep output nodes for `saved_model` #5794 +* Add parser support for `relu6`, `leaky_relu`, `relu_n1_to_1`, `log_softmax` #4805 +* Fix TF Dynamic input shape #5825 +* Support a few contrib ops in mxnet #5819 +* Improve TF Parser to keep output nodes for `saved_model` #5794 +* Add parser support for `relu6`, `leaky_relu`, `relu_n1_to_1`, `log_softmax` #4805 +* Check all unsupported ops before raising an exception #5929 +* Add Pytorch advanced indexing #6318 +* Support `index_select` #6295 +* Fix cast to long #6301 +* Fix dtype handling for modules with integer parameters #6311 +* pytorch frontend support conv1d #6203 +* Add cast to double, fix flatten conversion #6357 +* Fix aten::max and aten::min conversion #6372 +* Match pytorch 1.6 googlenet pretrained model (#6201) #6212Add unbiased variance op and corresponding support in pytorch frontend #6232 +* Implemented PADV2 Operator for TFLite and added support for constant values in PAD. #6167 +* Implemented `ONE_HOT` Operator for TFLite. #6223 +* Implemented `EXPAND_DIMS` Operator for TFLite. #6243 +* Implemented `REVERSE_V2` Operator for TFLite. #6304 +* Implemented `MATRIX_SET_DIAG` Operator for Relay/TOPI and TFLite Frontend. #6303 +* RESHAPE with dynamic shape arg in TFLite frontend #6208 +* Constant input attr added to fully connected operation in TFLite frontend #6228 +* Gather operation with indices as tensor expr in TFLite frontend #6168 +* Added support for tflite quantized maximum and minimum #6018 +* Unary ops support added in frontend #6196 +* Introduce caffe frontend for tvm #6206 +* Keras softmax and prelu fix under NHWC #6278 +* add support for MXNET numpy operators #6054 +* Refine tensorflow frontend 1.x & 2.x compatibility #6240 +* Reduceops support added to frontend #6252 +* Update precision in the ONNX `strided_slice`, update precision of ToScalar #6272 +* NHWC import support. #4899 +* Refine tensorflow frontend 1.x & 2.x compatibility #6240 +* Fix node indices attribute error for tensorflow 2.3 #6288 +* Support NMSv4 #6085 +* Support for PyTorch Non-Maximum Suppression #6314 +* ReplicationPad support added #5708 +* MXNet pre-quantized BERT #6039 +* Keep parameter names from PyTorch #5887 +* Refine LSTMBlockCell to support dynamic rnn #5963 + +### Relay +* Add function attributes to IR hash (#4479) +* Relay passes lookup overhead optimization (#4594) +* Add `half_pixel` option to Resize op #4610 +* Skip example json runtime test when config is not set #4614 +* Test `tensor_array` in vm #4608 +* Improve `memory_allocation` pass to support multiple i/o dynamic kernels #4595 +* Add unit test for `tensor_array_split` #4619 +* Add parses support for unary elemwise ops #4634 +* Add parses support for SLICE #4502 +* Added pool autopadding and simplified converters. #4672 +* Fix meaning of `conv2d_transpose` `output_padding` parameter #4318 +* Use packed func macro for external codegen #4710 +* Fix `_parse_param` bug #4711 +* Add constant input support for elemwise ops #4666 +* Add parser support for squared difference #4652 +* Add type check to dense #4724 +* Invoke tvm::build from relay `compile_engine` and interpreter #4723 +* Broadcast condition, x, and y for Where op #4774 +* Add parser support for relational ops #4695 +* Remove duplicated BindParamByName function in VM compiler #4793 +* Use SimplifyInference for L2 Normalization. #4795 +* Expose vm OptimizeModule to Python #4800 +* Add parser support for logical operators #4642 +* Conv2D padding representation #4787 +* Add support for quantized LOGISTIC #4696 +* Fix VM compiler for while loop with free vars #4889 +* Fix bug in re-processing call node in MergeComposite pass #4879 +* Expose FunctionGetAttr to Python #4905 +* Add a PyTorch to Relay Parser #4497 +* Support data types for CSourceModuleCodegen args and output #4934 +* Clean up and refactor PyTorch frontend #4944 +* Relay pass to use fast exp/tanh #4873 +* BatchNorm support with run-time mean and variance calculation #4990 +* Reduce plevel of conv2d winograd implementation on cuda #4987 +* Add operation tan to TVM #4938 +* Outline and inline lifted functions for external codegen #4996 +* Remove primitive attribute from composite function #5014 +* Refactor Relay Python to use new FFI #5077 +* Fix relay node registration after refactor #5083 +* `Codegen_c.h` should include relay.function #5093 +* Move expr.Function to function.py #5087 +* Propagate constant to subgraphs #5094 +* Adjust strategy plevel to achieve expected performance by default #5118 +* Added a AnnotatedRegion utility class #5030 +* Support TupleGetItem in body of pattern #5106 +* Partition graph codestyle fixes #5202 +* Re-wrote the Graph Partitioner to support multiple outputs #5143 +* Fixes to MergeCompilerRegions #5195 +* Refactor build module to take IRModule #4988 +* Separate analysis and transform passes #5035 +* Relay Node::make to constructor #5128 +* relay::StructuralHash to tvm::StructuralHash #5166 +* Conditions updated to cover better user scenarios #5043 +* Replace UseDefaultCompiler with GetAttr #5088 +* Return empty CSourceModule when no `lowered_funcs` exists in Relay mod #4847 +* Clean up for memory pass to enable heterogenous execution support. (#5324) +* Remove re-exports of tvm.transform (#5337) +* [Refactor] Add memoized expr translator for use by backend codegen (#5325) +* Legalize - Use Non-recursive Rewriter. (#5296) +* Add additional check before re-using the cached match #5552 +* Remove kCompiler attr from external functions #5615 +* Pattern Language MergeComposite #5656 +* Support Tuple Output in C/DNNL Codegen #5701 +* Infer types in MergeComposite #5766 +* Convert PatternGrouper to do pre-order, non-recursive analysis #5653 +* Remove constants from partitioned functions #5663 +* Add a check for null function attributes #5674 +* Add ConstantPattern #5689 +* Conditionally Embedding Constants in Partitioned Functions #5693 +* Simplify Pattern API Implementations #5703 +* Add ShapePattern and DataTypePattern #5760 +* Remove unnecessary print #5642 +* Improve Shape Func handling for Tuple inputs #5467 +* Relay updated with String #5578 +* Fix the creation of tuple of tuples in PartitionGraph #5616 +* Preserve type information in Merge Composite #5640 +* Move `compiler_begin`/`end_op` to local static objects #5622 +* Fix `dataflow_pattern`.rewrite() hang if Match in IR #5680 +* Fix segfault in pretty print when ObjectRef is null #5681 +* Move `fallback_device` to config #5690 +* Replace `build_config` with PassContext #5698 +* Clear compile engine after task extraction #5724 +* Add `storage_order` ignore in pooling layer. #5781 +* Tweak cublas/cudnn priority level #5820 +* Skip Unknown Function Symbols #5888 +* Allow every runtime module to handle constants #5885 +* handle Tuple/TupleGetItem in first order gradient #5946 +* Add resnet-3d & Update network definitions for NHWC layout #5945 +* Use TargetNode::attrs for Target serialization #5993 +* each option of target str should only contain one ‘=’ #5988 +* Rename `target_id` => `target_kind` #6199 +* 64-bit RPi4b target #6211 +* Add resnet-3d & Update network definitions for NHWC layout #5945 +* Small bug fix for Conv1D imports. #5995 +* Move `invoke_tvm_op` and `shape_func` to vm dialect #5958 +* GRU Layer Support #6020 +* Add pass for getting calibration data from a relay module #5997 +* Merge two consecutive reshape ops #6052 +* Add operation `scatter_add` to relay, based on scatter implementation. #6030 +* i64 indices #5235 +* Port `eliminate_common_subexpr` to non-recursive form #6134 +* Fix interpreter for dyanmic shape input of `ndarray_size` #6086 +* Allow to config allocator type and refactor vm code structure #6105 +* Handle `ndarray_size` in FoldConstant #6156 +* when converting constant nodes with types of int64 or float64 #6159 +* Add ReshapeTensor instruction in the VM to replace the reshape op #6089 +* Support combine multiple dense op just into dense #6062 +* Add unbiased variance op and corresponding support in pytorch frontend #6232 +* Specify additional layouts in convert layout pass #5422 +* Safe check added for Merge Composite Call Node #5562 +* Non recursive partitioning #5493 +* Support combine multiple dense op just into dense #6062 +* Make the max number of fused ops configurable #6327 +* Implementation of the dynamic pad operator #6284 +* change device annotation from post DFS to recursive #6124 +* Make check stricter: disallow inserting function with free vars into module #6313 +* Make check stricter by using Feature. Fixed multiple bugs #6326 +* Resize support for NCHW-convertible layouts #6293 +* Make AutoDiff thread through global function #6336 +* Create Interpreter for each constant subgraph #6195 +* Add Dynamic reshape to a dynamic namespace and add DynamicToStatic Pass #5826 +* Expose relay BindParamsByName to Python #4751 +* Implement pass manager tracing API #4782 +* Move Ops in relay.op.contrib #4942 +* Conditions updated to cover better user scenarios #4951 +* [External codegen] Add test cases for fused ops with manual annotation (#4741) +* Multiple output support, reshape, split ops added #6296 + +### Operator Coverage +* Allow empty tensor for `reshape`, `tile` and `strided_slice` #4618 +* Fix meaning of `conv2d_transpose` `output_padding` parameter"; #4708 +* Remove cpp upsampling and resize op #4769 +* upsample operator 'NCHWinic' format support. #4791 +* Injective schedule improvement #4786 +* Enable vectorization on fp16 type #4867 +* Support for Int8 schedules - CUDA/x86 #5031 +* New PR to re-add tan to TVM #5025 +* Register topi schedule for Relay `fast_exp` and `fast_tanh` #5131 +* Move Dilation2d from nn to image namespace #5110 +* Use Thrust sort for argsort and topk #5097 +* Conv2d and Dense ops support on Tensor Core #5099 +* Setting workload correctly for Depthwise Spatial conv ARM. #5182 +* Adding a few missing math intrin #5011 +* Missing vectorize for depthwise conv2d. #5196 +* [TOPI] Using x86 schedules for ARM conv2d (#5334) +* [TOPI-ARM] Do not alter layout if layout is NHWC (#5350) +* [TOPI] Setting workload correctly for Depthwise Spatial conv ARM. (#5182) +* [OP] Add `fast_erf` implementation (#5241) +* [Topi] Tensorcore support for Conv3D (#5284) +* [intrin] a few more math functions (#5468) +* [Intrinsic] Add log1p, ldexp, atan2, hypot, nextafter, copysign (#5312) +* [topi] Add operation relay.nn.dilate() which calls topi.nn.dilate() (#5331) +* [Topi x86] Missing vectorize for depthwise conv2d. (#5196) +* [TOPI x86] Adding `unroll_kw` config option for depthwise conv2d. (#5197) +* [Topi] Breakdown topi.cc into smaller files (#5253) +* ReduceLogSumExp Operator support #5453 +* Math ops added #5502 +* Enable blocking format in x86 conv2d and fold scale axis #5357 +* Add operation gather to relay. #5716 +* Add `storage_order` ignore in pooling layer. #5781 +* Fix bifrost spatial packing conv2d auto tune #5684 +* Fix reshape usage in ARM schedule #5732 +* Block sparse dense on cuda #5746 +* Improve CUDA softmax scheduling #5600 +* block sparse dense on cuda #5746 +* pass-by-value -> pass-by-const-reference #5783 +* Using MKL blas for quantized dense #6115 +* topi -> tvm/topi #6186 +* Use auto-tuner to improve `conv2d_gemm` performance #6117 +* Improve CUDA `conv2d_transpose_nchw` #4762 +* Add CUDA conv2d for NHWC layout #4737 +* `conv3d_ndhwc` schedule #4775 +* Fast exponent #4790 +* Add Scatter to Topi/Relay/ONNX via hybrid script #5619 +* Split MKL from BLAS. #6182 +* Change the meaning of `conv3d_transpose` `output_padding` to match `conv{1,2}d_transpose` #6065 +* Gather op support added #6013 + +### Runtime and Backend +* Cythonize NDArray.copyto (#4549) +* Unified Object System runtime refactor (#4578, #4581, #4603) +* VM profiler: sort VM stats by time (#4601) +* Update RPC runtime to allow remote module as arg (#4462) +* Refactorying system lib and dso lib into library module (#4481) +* Improve TSIM virtual memory mapping (#4545) +* make adt tag signed #4605 +* Improve TVMBackendPackedCFunc to allow return val #4637 +* EdgeTPU runtime for Coral Boards #4698 +* Fix memory leak when using openMP #4811 +* Fix memory leakage of TVMByteArray #4856 +* Fix `TVM_DLL_EXPORT_TYPED_FUNC` to work on Windows #4955 +* Fix memory leak when using openMP #4811 +* Export GraphRuntime in `tvm_runtime.dll` #5002 +* MISRA-C compliant TVM runtime #3934 +* Update the `type_keys` to reflect the code-org #5074 +* Fix AttrEqual for Array and StrMap, double #5054 +* Export GraphRuntime in `tvm_runtime.dll` #5002 +* Fix unused-value warning #5140 +* crt error handling #5147 +* Bundle deployment with static linking #5158 +* Implemented kDLCPUPinned (cudaMallocHost) #4985 +* Explicitly cast min/max operands #5090 +* `ref_counter` -> `ref_counter_` #5184 +* Expose runtime::String to Python (#5212) +* [FFI] Refactor runtime.String to subclass str (#5426) +* [RUNTIME] Auto conversion from str to runtime::String in PackedFUnc (#5251) +* [RUNTIME] Improved Packed FFI for optional. (#5478) +* [Hexagon] Add `hexagon_posix.cc` to TVM/RT sources in the right place (#5346) +* [FFI] Refactor runtime.String to subclass str (#5426) +* Fix workspace #5503 +* Store nullptr PackedFunc as nullptr for better error propagation #5540 +* Improve PackedFunc robustness #5517 +* Seg fault in WorkspacePool's destructor (#5632) #5636 +* Resolve constexpr issue in debug mode. #5651 +* Add `compile_shared` option to linux compile utility fn #5751 +* Call sync in CopyFromRemote and CopyToRemote #5512 +* Fix the multihop cpu case #5522 +* Improve RPCServer AsyncIO support. #5544 +* Modularize the RPC infra #5484 +* Add `compile_shared` option to linux compile utility fn #5751 +* Overload string operators #5806 +* Only initialize required module #5926 +* if a param not in input, we should still consume it’s data #5990 +* init TVMPackedFunc’s name #6044 +* Enable auto conversion `String->DLDataType` #6214 +* Support random fill #5913 +* Use new to avoid exit-time de-allocation order #6292 +* Add `parallel_for` support to run a loop in parallel #6275 +* Solve ARM BIG.LITTLE heterogeneous multicores #4747 +* [RUNTIME] Quick fix PackedFunc String passing (#5266) +* Introduce runtime::String::CanConvertFrom #5718 +* Restore the StrMap behavior in JSON/SHash/SEqual #5719 +* Support overriding RPCWatchdog termination behavior on Android and other platforms #6216 +* Set `NDArray::Container.shape_` in NDArray::FromDLPack (#5301) +* Enable x86 cpu cache flush #5914 + +### Quantization +* Conv2D type checking for kernel per-channel scales. #4732 +* Add missing nullptr check #4773 +* Doc fix on convolution and dequantize #4799 +* Conv2D with dilation support. #4796 +* Making `scale`/`zero_points` as expr instead of attrs. #4611 +* Make calibration faster and more memory usage friendly #4589 +* Doc fix on convolution and dequantize #4799 +* Conv2D with dilation support. #4796 +* Optimize lowering for requantize and FixedPointMultiply. #4798 +* More doc fix on quantize and convolution #4874 +* Add support for per channel weight scale in dense op #4880 +* Add support for quantized models via QNN #4977 #5013 +* Support 4D padding. #5036 +* [Requantize] Cleanup and Optimize Lowering (#5286) +* [Topi, ARM] Disbale Winograd for quantized tensors. (#5363) +* Adding support for TFLite QnnSubtract operator. (#5230) +* Remove developer facing api from frontend exports. (#5375) +* Add Quantize/Dequantize Partitioning #5940 +* Add support for quantized models via QNN #5016 +* Quanitze operation expanded to take const argument #6127 +* FP32 and Quantized Object Detection Model #5479 +* Support CallNode inputs in qnn.concatenate #5360 +* QNN support for TFLite 2.1.0 quantized models #5848 + +### TE +* Tighten split's extent #4931 +* Set split node's range to minimum of ext and split factor or split np… #5044 +* Support mixing normal and cross-thread reduction (#5193) +* Inline -> `te/schedule/operation_inline.h` (#5386) +* Create loops according to storage scope and thread hierarchies (#5190) +* Fix import in dump pass ir (#5327) +* Scalar support for te.extern #6079 + +### TIR +* IR readability enhancement (#4501) +* Introduce tir::PrimFunc #5070 +* Introduce PrimFuncPass. #5139 +* [TIR] Enhance Substitute, python bindings for Substitute/PostOrderVisit (#5400) +* [TIR] Remove ProducerConsumer and `AllocateNode::new_expr` (#5333) +* [TRANSFORM] Enable CopyOnWrite for TIR passes. (#5309) +* [REFACTOR] Migrate LowerTVMBuiltin, InferFragment, LowerThreadAllreduce, ThreadSync to Pass Manager (#5213) +* [REFACTOR] Remove te::Tensor dependencies from TIR passes. (#5372) +* [TIR] Refactor MakePackedAPI to target dependent stage. (#5326) +* [REFACTOR] tvm.hybrid -> te.hybrid (#5223) +* [REFACTOR] Migrate most of low-level build to use the Pass Manager. (#5225) +* [REFACTOR] Migrate low-level passes in tvm.lower to the Pass Manager (#5364) +* [TIR] Migrate VTA TIR passes to the new pass manager. (#5397) +* [REFACTOR] Migrate all low-level passes to the Pass Manager. (#5233) +* [REFACTOR] Introduce ExprDeepEqual, Remove IRDeepCompare (#5206) +* [REFACTOR] RewriteForTensorCore -> te/schedule (#5379) +* [REFACTOR] Remove `ir_pass` in favor of analysis/transform. (#5415) +* text format printer considering future parsing use #5483 +* Remove buffer params from pass config. #5652 +* std::string -> String Migration in TIR nodes #5596 +* Remove `CallNode.call_type` in favor of attribute. #5937 +* Remove legacy HoistIfThenElse #5944 +* Improve Let/LetStmt support. #5949 +* Refine side effect analysis. #5954 +* `Provide->ProducerStore`, `Realize->ProducerRealize`. #5750 +* Migrate the tvm/tir/expr.h to constructor #5773 +* Migrate tir/stmt.h to use constructor. #5778 +* Cleanup unused classes #5789 +* Add tir prefix to type keys #5802 +* Enhance VerifyGPUCode #6194 +* Enforce buffer pointer var type to be consistent with dtype. #6317 +* Create a StringImm reference type #4806 +* Add init member to ReduceNode #6138 +* Add dump and print for debugging (NFC) #5207 +* Streamline Function Attr interface. #5045 +* `alpha_equal` to `structural_equal` #5161 +* Remove AttrsEqual and AttrsHash related code #5169 +* [NODE] General serialzation of leaf objects into bytes. (#5299) +* [POC] Initial stab at `std::string->String` upgrade (#5438) +* [TIR] Make `lower_warp_memory` support `extent(threadIdx.x) < warp_size` (#5307) +* [PASS] dtype rewrite for indexing variables (#5092) +* [PYTHON] Enhance `with_attr` API, cleanup MakeAPILegacy in testcases (#5335) +* [PYTHON] Make IntImm more like an integer (#5232) +* [IR] Move to runtime::String (#5276) +* [IR] kExternalSymbol -> kGlobalSymbol (#5211) +* [IR] Remove PrimExpr from String (#5311) +* IRModule is updated with String #5523 +* IR is updated with String #5547 +* Streamline ir/op Registry #5609 +* Migrate IRModule ObjectRef to not-null #5654 +* Migrate BuildConfig to PassContext. #5668 +* relay.op.Op -> tvm.ir.Op #5705 +* Separate ArgTypeCode from DLDataTypeCode #5730 +* Remove legacy `compute_expr.h` #5738 +* Call::Halide => ProducerLoad, DSL/TIR decouple. #5743 +* `Provide->ProducerStore`, `Realize->ProducerRealize`. #5750 +* Migrate the tvm/tir/expr.h to constructor #5773 +* Migrate tir/stmt.h to use constructor. #5778 +* Migrate all Object construction to constructor. #5784 +* Cleanup unused classes #5789 +* Finish `std::string->String` updates #5793 +* Add tir prefix to type keys #5802 +* Change Call.name to Call.op(RelayExpr) #5863 +* Range/IntSet API style consistency. #5953 +* Separate ArgTypeCode from DLDataTypeCode #5730 +* Migrate all Object construction to constructor. #5784 +* Finish `std::string->String` updates #5793 +* Unify StrMapNode and MapNode #5687 + +### Performance Improvements +* Int8 GEMM performance enhancement using Cublas (#4550) +* Speedup TSIM with multi-threading (#4491) +* Support cudnn softmax (#5214) +* Add cuDNN grouped convolution support (#5319) +* Winograd support for Conv3D (#5186) +* Improve `get_valid_count` and nms performance for CUDA (#5339) +* Optimizations of `global_ave_pool` for NHWC layout (#5450) +* Optimization of Conv2d Winograd algorithm on Tensor #5485 +* Some performance improvement to VM #5901 +* Optimize x86 `conv3d_ndhwc` using data packing approach. #4866 +* Improve NHWC depthwise convolution for AArch64 #6095 +* Improve quantized convolution performance for armv8 architectures #5754 + +### Documentation +* Adding benchmark log format doc (#4366) +* Add Ninja build system to installation docs (#4554) +* Doc/comment fixes (#4452, #4463, #4469, #4493, #4397, #4580, #4585, #4591) +* Fix doc after moving to unified IR #4835 +* Introduction to module serialization #4564 +* ConvertLayout - Call RemoveUnunsedFunctions. #4834 +* Fix bugs that override `n_trials` #4842 +* Update the vm doc #4868 +* Refine the example description of `max/min/sum/tag_scope` #4974 +* Fix vta tutorial #4809 +* Introduce how to add hardware backend to FAQ #4898 +* Update API docs to reflect the status after the refactor. #4907 +* Fix sphinx warnings #4917 +* Fix Sphinx Warnings (RST indent, cross-ref, and image scale) #4920 +* Fix Sphinx Warning: the target found for cross-reference #4925 +* Sphinx -- Introduce alias detection. #4954 +* Fix Warnings from #4942 #4959 +* Fix sphinx precheck #4967 +* Move `git_howto` to rst, add Stage documents to te #5055 +* Add doc for Relay op strategy #5078 +* Update relay docs #5112 +* Include a tarball of docs, add a security faq #5119 +* Cleanup docs before rebuild #5127 +* Minimize necessary doc change #5129 +* Various sphinx related fix. #5168 +* Point docs to the ASF site. #5178 +* Use https link #5183 +* Reduce artifcats generated by sphinx gallery #5208 +* Refine the example description of `max/min/sum/tag_scope` #4974 +* Description updated for pooling attributes #5091 +* [DOCS] Migrate some markdowns to rst, fix sphinx3 warnings (#5416) +* [DOCS] Misc docs improvements (#5222) +* [DOCS] Bring relay docs to the top-level flat view (#5343) +* [DOCS] Reduce artifcats generated by sphinx gallery (#5208) +* [DOCS] Use https link (#5183) +* [DOCSTRING]missing function parameters updated (#5228) +* [DOCS] Migrate HLS documents from md to rst (#5419) +* [Tutorial, QNN] Add tutorial for loading quantized PyTorch model (#5321) +* [Docs] VTA install doc migration from md to rst (#5442) +* [Docs] compiler version in docs (#5281) +* Remove legacy `compute_expr.h` #5738 +* `TVM_REGISTER_API` -> `TVM_REGISTER_GLOBAL` #4768 + +### Bug Fixes +* Add bfloat16 typeflag support (#4525) +* MSVC / Windows fixes (#4455, #4569) +* Fix Makefile for `howto_deploy` (#4457) +* Fix GCC 4.8 compact (#4461) +* Fix search path to build `libtvm_topi.so` (#4467) +* Fix for `conv2d_transpose` CUDA compilation (#4472) +* Fix for LLVM 10.0 codegen (#4480, #4515) +* Fix alter op layout when calling global var (#4454) +* Fix `float2half_rn` support for cuda compute capabilities < 53 (#4489) +* Fix compile errors for OpenCL backends (#4492) +* Fix serialization precision loss (#4503) +* Fix hybrid script to support array of tensors (#4494) +* Fix annotation for multiply op (#4458) +* Fix Dockerfile for linter CI (#4506) +* Fix TF resize for dynamic size models (#4510) +* Fix `bias_add` gradient (#4516) +* Fix tanH unit test function call (#4517) +* Fix extra reshape parameter for ONNX (#4524) +* Fix crash caused by empty TOPI config (#4520) +* Fix ONNX shape op type to use int64 (#4528) +* Fix crash in TSIM virtual memory driver (#4527) +* Replace deprecated python library in setup script (#4533) +* Fix NMS `max_output_size` loop (#4541) +* Fix style in IR mutator and IR visitor (#4561) +* Fix compiler warning (#4559) +* Fix to get end to end inference on Chisel VTA (#4574) +* Fix LLVM build by adding missing intrinsics headers (#4575) +* Fix context creation in quantization (#4582) +* Fix NDArray SaveDLTensor signature (#4586) +* Fix dense pack schedule for x86 (#4539) +* Fix for broadcast tensor of scalar type (#4577) +* Datatype refactor (#4513, #4560) +* Add const qualifiers for NDArray container (#4590) +* Fix TF <= 1.12 compatibility (#4593) +* Fix for graph debug runtime (#4598) +* Disable copy constructor for external codegen (#4597) +* Make ADT tag signed (#4605) +* Added declare of aluBits for TensorAlu #4624 +* Get around limitation of g++-4.8 #4626 +* Bugfix StmtMutator IfThenElse #4609 +* Remove unecessary rdynamic #4613 +* Resolve constexpr related link error in debug mode #4641 +* Asymmetric padding #4511 +* Reduce data size of asymmetric padding testcase #4658 +* Fix Base64OutStream portability issue #4668 +* Fix `topi.nn.global_pool` layout="NHWC" #4656 +* Also package core.rly #4679 +* fskip of EliminateCommonSubexpr cannot always return false #4620 +* Fix Python syntax error in `start_rpc_server_to_tracker.py` #4682 +* os.path --> osp to match the import #4681 +* GitHub actions/checkout@v1 --> v2 #4680 +* Fix Python syntax error AGAIN in `start_rpc_server_to_tracker.py` #4685 +* Use ==/!= to compare str, bytes, and int literals #4686 +* Rename `start_rpc_server_to_tracker.py` to `start_rpc_server_to_tracker.sh` #4689 +* GitHub Action lint Python code for syntax errors #4688 +* Generate blob use LLVM directly #4657 +* Reduce input size to fix oom #4653 +* Fix RemoveUnusedFunctions pass #4700 +* Link the math library by default #4713 +* Update mainline version to 0.7.dev0 #4720 +* Add SizeVar representing non-neg valued variable in a tensor shape #4684 +* Fix the compile problem of `cpp_rpc` #4725 +* JSON upgrader to upgrade serialized json. #4730 +* Fallback schedule for Int8 depthwise. #4733 +* Fix dense x86 schedule #4728 +* Fix demo dockerfile build failed #4744 +* Improve CUDA vectorizer #4736 +* Add .asf.yaml for github info #4761 +* Fix padding in pooling op #4738 +* Remove `run_infer_type` duplicates #4766 +* pooling.cc improvements #4767 +* Export `builtin_fp16` on Windows #4731 +* Fix Tensorflow conv3d pad bug, add non-cubic data and kernel tests #4772 +* Bump prebuilt-image version in demo dockerfile #4770 +* Update `tune_simple_template.py` #4778 +* Explicitly link to cublasLt if it exists #4776 +* Fix hasattr by extracting Python error type from Windows error message #4780 +* Replace os.path.exists with try...except...else #4784 +* Make sure to visit the arguments of inlined functions #4783 +* Parse additional exception strings #4785 +* Fix #4670: add bias for fc layer #4801 +* Change color channel from BGR to RGB for darknet preprocessing #4794 +* Fix -Wextra #4804 +* Fix vta tutorial #4809 +* Minor bug fixes in AutoTVM for QNN graphs #4797 +* Fixed subprocess creation under windows #4820 +* Improve tol to resolve flaky case #4836 +* Fixed process termination routine in windows #4844 +* `test_cuddn` flaky #4846 +* Mxnet parser for Qnn dialect #4714 +* Enhance `cc.cross_compiler` #4817 +* Fixed crash caused by reversing bitwise operations #4852 +* Reverse some changes made for `intel_graphics/conv2d.py` in PR #4849 #4853 +* const auto p -> const auto& p #4861 +* Fix onnx import bugs #4750 +* Explicit llvm::StringRef to std::string conversion #4859 +* Update the runtime PackedFunc for module #4871 +* Improve antlr import error message #4888 +* Fix `alpha_equal` bug for attribute check #4897 +* Fix issues in cuda codegen #4876 +* Fixed: Bitwise ops on floats causing wrong code generation and crashes. #4892 +* Fix `tvm.target.generic_func` runtime detection #4910 +* `topi/tests/python/test_topi_sort.py::test_argsort` #4891 +* Use opencv reisze method for preprocessing of image in darknet #4883 +* Fix build breaks with StringRef changes #4923 +* Remove unnecessary spliting in the cached chunk #4935 +* Fixing an Infinite Loop case in UnmatchedChecker. #4881 +* Remove SGX toolchain installation from CI Dockerfile #4948 +* Fix tedd tutorial after strategy change #4947 +* Allow customize MKLDNN library location #4814 +* Added CopyFromBytes and CopyToBytes convenience methods to NDArray. Fixed typos. #4970 +* Fix gcn tutorial failure #4994 +* Fix stride default value None in torch.nn.functional.avg_pool #4984 +* Fix ROCm strategy for winograd conv selection #5001 +* Fix `get_valid_count` flaky test for cuda #4901 +* Change Scala Linter scalafmt => scalastyle #4998 +* Kill from tvm import te #5007 +* Chisel fixes and de10nano support #4986 +* Fix gpu not found when running TVM docker #4975 +* Fixes for pylint==2.4.4 #4849 +* Fix unordered dictionary problem for python version under 3.6 #4982 +* Fix gcn tutorial failure #4994 +* Fix stride default value None in `torch.nn.functional.avg_pool` #4984 +* Fix ROCm strategy for winograd conv selection #5001 +* Early checking added and new test cases added for schedule fuse #5010 +* Fixed div by zero core dump. Fixed rounding intrinsics on int crash #5026 +* Test case modified for int type #5012 +* Bug Fix for ARM CPUs. Lower strict assumption. #5063 +* Triage the testcases to fit the the new namespaces #5071 +* Add colors to `compute_at` edges and thread/block indices. #5111 +* Temporary fix to the stack overflow issue in autotvm task extraction #5019 +* Fix compilation of If-Elses #5040 +* Fix CompilerAttrs #5109 +* Fix the existing test cases before refactoring. #5122 +* Fixed bug where shifting by out-of-bounds value results in no compute code being emitted. #5115 +* Fix for issue #4831. The `data_min_idx` and `data_max_idx` were flipped. #5136 +* Duplicate likely nodes added when loop axis split unevenly #5084 +* Fix incorrect name of calibration mode #5150 +* Remove contrib spatial pack schedule of depthwise convolution #5148 +* Fix annotate pass static variable #5023 +* Fixed ConvTranspose2D parsing #5157 +* Nullptr check #5176 +* rocm: fix miopen convolutions #5179 +* rocm: fix `dense_rocblas` in strategy, topi #5191 +* Fix CRT static test bug (#5293) +* Fix perf regression of tir refactor (#5258) +* Bugfix in tensorflow `space_to_batch_nd` (#5175) +* Compilation warnings fixed for 32bit and 64bit compilation (#5349) +* Fix hang in MergeCompilerRegions (#5227) +* Fixes to MergeCompilerRegions (#5195) +* Fix generation of LLVM intrinsics (#5282) +* Fix setting up hints for getaddrinfo (#2872) +* Add ConstantNode to IsAtomic (#5457) +* Fix String SEqual (#5275) +* Fix fuse over functions that are handled by external codegen (#5365) +* Fix memory leak when accessing NDArray (#5413) +* Remove the duplicate PrintIR pass in Relay (#5403) +* Fix `lower_warp_memory` (#5247) +* Fix `lower_warp_memory` when there are >1 warp buffers (#5368) +* Fix intel conv2d auto tune (#5200) +* Fix FuseBatchNorm output cast error if `need_cast` is True #4894 +* Fix an assertion exposed by loop vectorizer #4916 +* Fix error message #4945 +* Fix for recursive let #5757 +* Fix Calibration Pass to Support Modules with Multiple Functions #5768 +* Fix what looks like bizzare copy-paste issue #6010 +* Fix bug in `transpose_shape_func` #6180 +* Fix bugs in CUDA codegen (#5209) +* Don’t remove() TemporaryFile in del. (#5414) +* Fix `test_ir_type`. (#5390) +* Fix multiple identical inputs bug (#5389) +* Add cuda target check to dense tensorcore schedule. (#5376) +* T2 test fixups (#5391) +* Fix miopen padding (#5433) +* Misc fixes for ROCm (#5431) +* Fix copy constructor (#5237) +* Corrected TVM autotuning on GPU (#5432) +* Fix vector load (#5226) +* Minor bugfix in `message_passing.cc` (#5254) +* Fix a bug when vectorized load&store was involved for… (#5428) +* Fix to skip node not in graph. (#5238) +* Fix #5388 [VULKAN] vkBuffer released before memory copy command se… (#5418) +* Fix a minor error in `device_annotation` (#5291) +* Fix scalar’s ndim is 0 (#5344) +* Fix the runtime raise error #5586 +* Fixed bug in attribute parsing for pool layers. #5582 +* AutoTVM incorrect measurement #5511 +* fix a min/max simplify bug #5761 +* Rename `tvm_dso_op` to `libtvm_dso_op` #5714 +* Fix generating types like float44 and float88 #5722 +* Avoid downloading when `TOPHUB_LOCATION` is NONE #5720 +* codegen llvm: move nvptx-specific intrinsic handling into `codegen_nvptx` #5726 +* ROCm warp shuffles and reductions #5727 +* fix small bug about `dense_grad` #5695 +* Clarify downstream consistency of TVMArgTypeCode #5742 +* Fix gelu in PyTorch frontend, tighten numerical checks #5763 +* Make batch matrix multiplication on GPU tunable #5752 +* update vulkan build rule #5777 +* aten::norm support added #5776 +* Edit onnx parser to infer values in post order #5755 +* Support symbolic inputs of Fill #5762 +* support `aten::type_as` in the pytorch frontend #5787 +* Temporary disable fp16 `type_as` test for PyTorch Frontend #5799 +* Add config switch for nn.dense layer type. #5801 +* Move cpu-only frontend tests to a CPU stage #5807 +* Pin hand landmark network to version 0.7.4. #5813 +* Limit number of threads in all jobs #5815 +* Error msg update #5818 +* fix relay.build to not change the module argument in place #5822 +* Fix InferType when module contains Prelude #5797 +* Add a combine `batch_matmul` pass #5791 +* RepeatVector, Conv3DTranspose op support added #5833 +* Fix converting serialized quantized models #5839 +* ffi (Object): make class dict visible in instances #5843 +* Additional canonicalization added for AddNode #5846 +* Suppress the warning messages when compile engine selects impls #5821 +* fix #5849 #5851 +* Introduce POD-C Compliant tvm::Map #5740 +* Add bfloat16 #5601 +* Add Python Classes for all Attrs #5853 +* Fix map assign issue in CI test #5854 +* Introduce Target Id Registry #5838 +* Update `has_dtype/has_shape` to pattern lang doc #5847 +* Add `nn.batch_flatten` as quantizable. #5805 +* Fail early before running invalid dynamic graphs #5856 +* Improve type handling in PyTorch frontend #5834 +* HotFix the python intrin rule #5895 +* add a few gradients #5899 +* Add Binary Intrinsic ops to TIR Ops in C++ #5900 +* Allow implicit conversion in TVM FFI to tvm::Bool #5907 +* PyTorch frontend: fix handling of duplicate use of a model weight #5897 +* Don’t multiply by constant 1 uselessly in dense #5911 +* Support any index matching for TupleGetItem #5909 +* Add MicroTVM tutorial using the STM32F746 discovery board #5655 +* Fix serialization of inf float value #5912 +* Fix CPU Thread Binding for Multiple Sockets #5918 +* CUDA device API & VerifyGPUCode pass update #5898 +* Update install.rst #5858 +* Two small fixes to AMDCPU codegen for LLVM 10+ and ROCm 3.5+ #5920 +* Add LegalizeInvalidAttach to legalize the `compute_at` location after split or fuse #591 +* Don’t rewrite expressions used outside of the pattern #5930 +* Add TupleGetItem to CSE #5931 +* Various update for CoreML codegen #5934 +* Update date in the NOTICE #5943 +* Raise right error in tensorflow split op #5951 +* Add rm xla attributes in tf docs #5950 +* Fix OpenCL `get_valid_counts` errors due to intrinsic `atomic_add` #5857 +* Amendments for gradients #5941 +* Fix the meaning of `conv{1,2}d_transpose` `output_padding` parameter. #5758 +* Make first order gradient graphs more efficient #5959 +* Raise an exception when extern function does not return Stmt #5964 +* Improve docker/bash.sh to handle git worktrees #5970 +* Install DNNL (OneDNN) to CI Environment #5936 +* Add Dynamic reshape to a dynamic namespace and add DynamicToStatic Pass #5826 +* Add meshgrid op in Relay, TOPI, Pytorch frontend #5961 +* Print right number of parentheses for LoadNode #5965 +* Migrate data structure of TargetNode #5960 +* Remove redundant function CreateBufferVecPtr #5982 +* Fix string argument mismatch in GraphRuntimeCodegen #5933 +* VectorType::get with two parameters is deprecated in LLVM 11+ #5984 +* Fix Compilation Error in CRT #5713 +* Fix runtime::String backward compatibility in JSON #5725 +* Allow RPCWrappedFunc to rewrite runtime::String as std::string #5796 +* Fix reshape #5739 +* Fix building with LLVM-10 on macOS #5859 +* Add cuda 11 to `contrib.nvcc.find_libdevice_path()` #5902 +* Fix sequential cpp test #5745 +* Infer types in MergeComposite #5766 +* Fix recursive let for well formed check #5780 +* Recover global state after `test_util.py` #5824 +* Fix bug in rpc ring buffer shrink #5516 +* Fix remote device sync #5538 +* Fix bug in rpc ring buffer shrink (#5516) #5537 +* RPC Server error fix on Pynq FPGA #5607 +* Fix FloorMod Simplifier #5509 +* Fix Python debugger segfaults with TVM built with LLVM #5685 +* Fix Compilation Error in CRT #5713 +* Fix runtime::String backward compatibility in JSON #5725 +* Allow RPCWrappedFunc to rewrite runtime::String as std::string #5796 +* Fix reshape #5739 +* Make "none" DataType explicit #5491 +* Change "scalar" and "stack" in IDL from "inrout" to "in" #5487 +* Link necessary libraries when building runtime for Android #5496 +* Fixes for wasm32 target #5489 +* Reset target and wait for runtime initialization on connect. #5499 +* Bump tophub rocm version #5504 +* Improve commentary for RingBuffer #5518 +* Add unit tests for ONNX PRelu and fix importer to pass them. #5521 +* LRN only supports 4D tensors, remove it from `alter_op_layout` #5520 +* Fix an issue with ONNX Upsample #5530 +* Cache PrimExpr instead of raw pointers in bound analyzer #5533 +* fix a few bugs with shape inference and types in the ONNX importer #5534 +* Add Onnx Pad v11 #5539 +* Changes to `cpp_rpc` to make it work on Android (+ Hexagon offloading) #5535 +* Fix to reduce RAM size during loading model #5507 +* Fix MakeLoopNest for warp memory #5382 +* Load platform specific lib for tvmdsoop instead of the hard-coded tvm_dso_op.so #5542 +* Add tests for running micro on native arm hardware #5546 +* Apparently, ONNX Conv with no 'pads' defaults to zero padding #5548 +* clang-format the h,cc,m files. #5557 +* Fix conv2d alter op for arm cpu #5532 +* Fix topi test for non tensorcore CI. #5563 +* Add clang-format and nodejs to ci-lint #5567 +* Enable clang-format. #5572 +* Allow `ubuntu_install_darknet.sh` to work in both 18.04 and 16.04 #5574 +* Add a quantized conv2 unit test for the tflite front-end #5558 +* Fix JSON graph dumping. #5591 +* Warp level reduction support for CUDA #5498 +* One more fix for concurrency count #5589 +* Improve robustness of the docs build #5583 +* Phase out WebGL #5570 +* Fix vulkansdk in the ci-gpu and upgrade to 1.2.135 #5566 +* Update ci-cpu to bionic #5554 +* Overestimate binary size for microTVM compiled binaries. #5590 +* Fix bug and re-enable RPC execution test #5436 +* Add ostream formatters for TargetPtr/TargetVal. #5592 +* Fix cross thread reduction #5551 +* Fix TVMArray layout on device #5599 +* Add debug mode to tempdir() #5581 +* Represent alignment information in LLVM IR #5598 +* Fix codegen for warp shuffle intrinsics #5606 +* Fix Topological Order calculation for DFPattern Language #5612 +* Global MaxPool3d and AvgPool3d support #5098 +* Fix build error of iOS RPC #5621 +* isn't a CallNode sometimes #5623 +* Introduce config to PassContext. #5631 +* CMAKE fix #5630 +* Label Pattern Partitions #5627 +* Extend AttrPattern to support CallNode and FunctionNode attributes #5637 +* Increase bss section size. #5660 +* Add buffer name when creating tensor bindings #5670 +* µtvm debug improvements #5648 +* enable `amd_apu` device on vulkan target #5659 +* Support TupleWrapper as direct ancestor of control flow ops #5639 +* add tvm.micro pydoc to sphinx #5661 +* Add a regression testcase for #5674 #5677 +* Fix C++ RPC build problem on Linux #5671 +* Add a check Callback to the Pattern Paritioner #5646 +* Call previous excepthook in `tvm_excepthook`. #5675 +* Fix the shift column for `scale_shift_nchw` and `scale_shift_nhwc` in C topi #5679 +* Support more dtypes for TVMDSOOp #5694 +* In `memory_plan`, check if value is not None, instead of just checking value as boolean. #5700 +* Fix flaky `test_topi_pooling.py:test_adaptive_pool` #5736 +* Fix the values for `test_fmod` since it fails way too often otherwise #5723 +* fix small bug about `dense_grad` #5695 +* Fix sequential cpp test #5745 +* Add Scatter to Topi/Relay/ONNX via hybrid script #5619 +* Clean WASM environment before build #5759 +* Fix gelu in PyTorch frontend, tighten numerical checks #5763 +* fix #5686: remove a overstrict assert in MakeAllreduce (#5686) #5785 +* Improve Pattern Language Docs #5676 +* Add missing expr visitor for any #6082 +* Remove the tvm web from version update #6122 +* Clear relay cache after every build & Clear warning message cache after autotvm task extraction #6131 +* avoid unexpected throw in AttrInitEntry #6128 +* Verify that tensor reshape is valid. #6215 +* Use LocalRunner by default in the tutorial tune_relay_cuda.py #6001 +* Undefined names: import os for line 324 & import re for line 308 #6003 +* GitHub Actions upgrade to actions/setup-python@v2 #6002 +* Only pass pythonpath for ci images #6005 +* Auto-convert shuffle with single index to “extract element” #6006 +* Cache object refs in loop partitioner instead of object pointers #6004 +* Fix `test_arith_solve_linear_inequality.py::test_multi_equal` #6014 +* MXNet frontend support for AMP cast op #5976 +* Demo showing how to run a pruned model. #5975 +* Move compiler related registry items to `vta/build_module.py` #6012 +* Pin keras version #6032 +* Fix in `arm_cpu/conv2d_alter_op` for NHWC quantized #6027 +* Add creation of Hexagon device in RPC client #6035 +* Terminate basic block after “ret” instruction #6036 +* µTVM CRT modifications for on-device RPC server #5921 +* Create TBAA information based on the unrelying buffer type #6046 +* Add support for tflite `arg_min` and `arg_max` #5992 +* Fix `fully_connected` converter when batch size is not 1 #6038 +* Fix a primitive check error #5991 +* Refactor to expose MakeOp functions to C++ #6047 +* Fix `conv2_gemm` after target structure update #6037 +* Remove use of designated initializers from `hexagon_module.cc` #6055 +* Build crttest and cpptest separately. #6057 +* Fix pytorch frontend prim::Constant issue #6051 +* update frontend tutorials to new model based runtime interface #6063 +* Remove unnecessary std::cout #6072 +* Fix error message in Buffer::vstore, NFC #6056 +* Fix FSIM Compile Error. #6070 +* Improve vector simplification for float operands #6043 +* Fix LocalBuilder on macOS with python 3.8. #6083 +* Add missing test for fast erf #6058 +* Fixed point multiplication improvements for AArch64 #5980 +* Fix code generation bugs for C/CUDA & Improve VerifyGPUCode pass #6041 +* Delete declaration of unused `op_node` #6102 +* Load configs even it has no entity #6100 +* Update SGX example Cargo.toml #6067 +* Add default value for option `USE_DNNL_CODEGEN` in the cmake #6099 +* Update installation doc with minor improvements #6104 +* lint: add opencl .cl file type #6092 +* Clean up conversions between TVM and Rust functions #6114 +* Improve reduction schedule on arm CPUs #6110 +* Register Shape Func for Some Operators to Handle Dynamic Shapes #5955 +* Fix variable name conflict with OpenCL keyword #6048 +* Some rust cleanups #6116 +* Option to specify alternate directory to output build to #6016 +* Add `get_num_inputs` to GraphRuntime #6118 +* TFLite quantized conv test #6084 +* Fix autotvm on the `conv2d_nchw_winograd.mali` operator #6130 +* add attr option mfloat-abi for arm32 #6123 +* Fix CUDA Library Tuning #6132 +* Add missing RPC sources after refactor #6113 +* Correct `runtime.load_module` #6161 +* Improve error messages in graph tuner, graph runtime, and module loader. #6148 +* Fix some shape mismatches between TF and Relay #6166 +* Improve doc string #6176 +* Fix incorrect function signature in header #6172 +* Fix alignment of note #6181 +* Implemented PADV2 Operator for TFLite and added support for constant values in PAD. #6167 +* Unary ops support added in frontend #6196 +* Change the meaning of `conv3d_transpose` `output_padding` to match `conv{1,2}d_transpose` #6065 +* Fix compile warnings. #6204 +* Fix -mfloat-abi=soft compilation for ARM with OpenCL target #6150 +* Match pytorch 1.6 googlenet pretrained model (#6201) #6212 +* Mod operator, bug fix #6160 +* RESHAPE with dynamic shape arg in TFLite frontend #6208 +* Fix compilation error with cuda 11 #6213 +* Fix `port_end` wrong default value 9199 to 9099 for keeping same with source code #6220 +* Std op without specified dimensions support #6226 +* fix crt building and running error #6231 +* Implemented `ONE_HOT` Operator for TFLite. #6223) +* Avoid unexpected throw in AttrInitEntry #6128 +* Added casting to hybrid script doc and fixed pass infra doc #6174 +* Fix compile warnings. #6204 +* Fix -mfloat-abi=soft compilation for ARM with OpenCL target #6150 +* Mod operator, bug fix #6160 +* Fix compilation error with cuda 11 #6213 +* Fix `port_end` wrong default value 9199 to 9099 for keeping same with source code #6220 +* Std op without specified dimensions support #6226 +* Verify that tensor reshape is valid. #6215 +* Fix crt building and running error #6231 +* Fix `conv2d_transpose` output padding #6236 +* Fix cuda half math function is undefined: hpow, htanh #6225 +* Fix division range estimation error in simplifier #6244 +* Fix newer GCC compiler warnings. #6257 +* Support `_contrib_SyncBatchNorm` #6245 +* Fix reduction #6250 +* Add apt repository for clang-11 and llvm-11 #6256 +* Update tutorial to new TARGET as `micro_dev` is no more #6262 +* Fix clang-format #6264 +* Trivial fix, up the rodata section for the discovery board to 512 bytes. #6259 +* Fix cuda half math function is undefined: hpow, htanh #6253 +* Add dilation in x86 NCHWc depthwise conv support #6267 +* Decrease test times by introducing testing model #6235 +* Add support for parsing the any dimension. #6277 +* Improve error messages for memory verifier and gpu memory verifier #6281 +* Reflect Compile-Time CMake Options into libtvm.so #6280 +* Add cmake options into libinfo #6286 +* Update slice to infer attributes when not graph inputs #6276 +* Use rpc.LocalSession for simple tests #6294 +* Fix random fail #6312 +* Fix resize test #6298 +* Fix cython FFI compact with np.int64 #6321 +* Fix relay vm optimize #6322 +* Changed TVMCTVMContext to TVMContext #6306 +* Make able to compile with MSVC #6341 +* ROCm changed name of library and removed the old one in ROCm 3.7 release. #6345 +* Compatible for ROCm before 3.7 #6359 +* Use clear name that is separate from ASF brand for cache #6360 +* Fix `Dockerfile.demo_android` #6361 +* Fx sparse dense schedule on cuda #5803 +* Fix strategy for sparse dense cuda #5782 +* Fix x86 conv2d template when tuning with unpacked layout #5938 +* Fix the filter width parameter in `depthwise_conv2d` #6081 +* Fix reshape usage in ARM schedule #5732 +* Missing header #4865 +* Fix `conv2d_transpose` output padding #6236 +* Simplify reduce expression in te.gradient #6611 + +## API Changes +* `tvm.module` -> `tvm.runtime.module` +* `tvm.module.load` -> `tvm.runtime.load_module` +* `tvm.module.enabled` -> `tvm.runtime.enabled` +* `tvm.module.system_lib` -> `tvm.runtime.system_lib` +* `tvm.relay.Module` -> `tvm.IRModule` +* `tvm.create_schedule` -> `tvm.te.create_schedule` +* `tvm.placeholder` -> `tvm.te.placeholder` +* `tvm.compute` -> `tvm.te.compute` + +## Deprecation +* Deprecate NNVM (#4535, #4562, #4565, #4571) +* Deprecate FreeStmt #5890 +* Remove legacy `compute_expr.h` #5738 +* Deprecate OpenGL #5711, #5712 + ## 0.6 ### Relay in Production