-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Optimization] Warp level reduction support for CUDA #5498
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what happened to submodules...
I looked into tvm's softmax schedule, and it is 10x slower than cudnn for shape m x 1024. Here is the timing in us I collected on on a Volta GPU:
m | n | cudnn | tvm_baseline | tvm_warp |
---|---|---|---|---|
64 | 1024 | 6.6 | 57.6 | 6.6 |
128 | 1024 | 7.1 | 61.7 | 6.9 |
256 | 1024 | 8.2 | 72.3 | 7.3 |
512 | 1024 | 10.6 | 101.9 | 10.8 |
1024 | 1024 | 22.4 | 173.9 | 24.2 |
2048 | 1024 | 36.5 | 338.9 | 42.9 |
4096 | 1024 | 65.5 | 851.4 | 78.3 |
8192 | 1024 | 123 | 1844.6 | 166.7 |
16384 | 1024 | 237.6 | 3820.5 | 283.6 |
32768 | 1024 | 468.2 | 7783.3 | 556.4 |
TVM's schedule could be improved:
- use a single warp to perform reductions (__shfl_down_sync etc intrinsics)
- fuse all computations to a single stage to reduce memory traffic
- vectorize loads/stores.
This patch implements the first item, which is a pretty independent piece of code. The next patch will implement the remaining part.. Please let me know if this patch is sound. Suggestions are appreciated. Thanks!
@tqchen feel free to include other folks.
cc @icemelon9 @Hzfengsy @Shawn-Inspur @jwfromm @antinucleon |
2e5cf70
to
f1d2ebe
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If my understanding is correct, you have directly rewrite the compiling rule for ThreadAllReduce
. But __shfl_sync
is only supported by GPU with compute capability 7.x or higher. So, I wonder it may bring trouble to people who are still using a Pascal GPU
aa167b0
to
8956176
Compare
- Added the warp level reduction support - Upgraded shfl intrinsics to the sync version. - This is the building block for scheduling softmax like operations. Signed-off-by: Wei Pan <[email protected]>
Minor fix-up and cleaned comments |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
ping @Hzfengsy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
THanks @wpan11nv @Hzfengsy @roastduck ! |
- Added the warp level reduction support - Upgraded shfl intrinsics to the sync version. - This is the building block for scheduling softmax like operations. Signed-off-by: Wei Pan <[email protected]>
* [TFLITE]Select op support for tflite frontend (#5486) * [TFLITE]Select/Where op support for tflite frontend * Review comment fixed * Review comment fixed * [FRONTEND][TFLite] Fully connected op conversion made in sync with TFLite (#5510) * [FRONTEND][TFLite] Fully connected op conversion made in sync with TFLite * [1] Test case added * [2] Review comments handled * [3] Prints removed * [TOPI][Winograd] Optimization of Conv2d Winograd algorithm on Tensor Core (#5485) * Cache PrimExpr instead of raw pointers in bound analyzer (#5533) The objects that the raw pointers point to can be deallocated and new objects can be allocated at the same address, all while these pointers are still in the cache. This can lead to unexpected behavior, for example to calculated bound conflicts with previously cached values. Caching PrimExpr will prevent the objects from being deallocated while the cache is active. * fix a few bugs with shape inference and types in the onnx importer (#5534) * [Frontend][TFLite] ADD_N operator (#5474) * [WEB][RUNTIME] TVM WebAssembly JS Runtime (#5506) * [WEB] Remove the old web runtime * [WEB][RUNTIME] TVM WebAssembly Runtime This PR introduces a brand new TVM web runtime based on the WASM standard API. Main highlights: - The new runtime is rewritten using the Typescript. - The new runtime now directly interfaces with WebAssembly's standard API, instead of relying on emscripten's API. This change will make the js runtime more portable to runtime variants. For example, we could also try to make it interface with the tvm's rust runtime implementation. - System library can be provided through WASI - We also build a hack to enable Emscripten to generate a WASI like bundle for runtime environment on the Web. - The wasm generation now uses the mainlin LLVM. - Dynamic link(dlopen) is not used due to limitation of wasm, instead we rely on the recent new RPC refactor to directly restart a new session for each wasm binary sent to the RPC. * Address review comments * Skip tensorcore test * [RELAY][ONNX]ReduceLogSumExp Operator support (#5453) * [RELAY]LogSumExp Op Support * [ONNX]LogSumExp Op Support * [RPC][BUGFIX] Fix remote device sync (#5538) * [Refactor][std::string --> String] IRModule is updated with String (#5523) * [std::string --> String] IRModule is updated with String * [1] Packedfunction updated * [2] Lint error fixed * [3] Remove std::string variant * [RUNTIME] Store nullptr PackedFunc as nullptr for better error propagation (#5540) * [Relay-TFLite] FP32 and Quantized Object Detection Model (#5479) * TFlite e2e FP32 Object detection model * Fix test * [Relay-TFLite] Quantized activations * Flexbuffer parsing * Lint * Relaxing checks. * Github reviews * comments Co-authored-by: Ubuntu <[email protected]> * Changes to cpp_rpc to make it work on Android (+ Hexagon offloading) (#5535) * Changes to cpp_rpc to make it work on Android (+ Hexagon offloading) - Implement getNextString to break up std::string into words. stringstream just doesn't work on Android. - string::find_last_of doesn't look for the last substring, but the last character from a given string. - Use SIGTERM to terminate processes (this isn't necessary, but using SIGKILL is not a good practice). - Convert "./rpc" to a full path. When a module is uploaded and offloaded to Hexagon, the dlopen on Hexagon needs an absolute path (or a path without directories). * Only set the absolute patch on non-Windows platforms Windows has different macros for the maximum path length. * Add Onnx Pad v11 (#5539) * fix restructured text (#5541) * [CRT]fix to reduce RAM size during loading model (#5507) * [CRT]fix to reduce RAM size during loading model * Release graph_json memory immediately after reading * Load platform specific lib for tvmdsoop instead of only so (#5542) * [RPC] Improve RPCServer AsyncIO support. (#5544) * [RPC] Improve RPCServer AsyncIO support. When the RPCServer is in the async IO mode, it is possible for the server to directly serve async function that may return its value via a callback in the future. This mode is particular useful to the web environment, where blocking is not an option. This PR introduces the Async support to the RPCSession, allowing the AsyncIO driven servers to serve the async functions. These functions will still be presented as synchronized version on the client side. Followup PR will refactor the web runtime to make use of this feature. * Address comments * [Rust] Add first stage of updating and rewriting Rust bindings. (#5526) * Add tvm-sys * Use as_mut_ptr * Address CR feedback * Update rust/tvm-sys/src/datatype.rs Co-authored-by: Nick Hynes <[email protected]> * Final CR comments * Fix find and replace error in frontend Co-authored-by: Nick Hynes <[email protected]> * [TE] Fix MakeLoopNest for warp memory (#5382) * [TIR][Printer] text format printer considering future parsing use (#5483) * [Optimization] Warp level reduction support for CUDA (#5498) - Added the warp level reduction support - Upgraded shfl intrinsics to the sync version. - This is the building block for scheduling softmax like operations. Signed-off-by: Wei Pan <[email protected]> * A clone of test/python/unittest/test_runtime_micro.py, however (#5546) modified to run specifically on ARM cortex-M hardware, which currently is just the STM32F746 discovery board. Signed-off-by: Tom Gall <[email protected]> * [CI] Install wasmtime for WebAssembly tests (#5494) * Apparently, ONNX Conv with no 'pads' defaults to zero padding (#5548) * [WEB] WebGPU support (#5545) This PR introduces WebGPU support to tvm. The WebGPU runtime is directly built in javascript(as WebGPU uses JS as the first class citizen API) and exposes back to the tvm's runtime via PackedFuncs. One important note is that `ctx.sync` is not async. This is due to the fact that WebGPU is a purely async API and we cannot block in the web environment. So the current best way to use the js api is to wrap things in an async function. When copy a GPU array to CPU, `await ctx.sync()` need to be called to wait for copy completion. We use a AsyncIO rpc server to serve the async functions to the clients. * [TOPI][RELAY][TENSORFLOW]Math ops added (#5502) * [TOPI][RELAY][TENSORFLOW]Math ops added * Extra newline removed * CI fix * Review comments fixed * Review comments fixed * [RUNTIME] Hexagon driver for offloading kernels to simulator (#5492) * [RUNTIME] Hexagon driver for offloading kernels to simulator * Add sim_dev as external project when building with Hexagon/sim support * Change target CPU for sim_dev to v60 * [LINT] clang-format the h,cc,m files. (#5557) This PR prepares for our migration to use the clang-format as part of the linter system. * [BYOC, MergeComposite] Add additional check before re-using the cached match (#5552) * Add additional check before re-using the cached match in merge composite * clean up ExtractPattern calls * [WEB] Setup lint, doc, test (#5556) * [CI] Update ci-cpu to bionic (#5555) * [CI] Update ci-cpu to bionic (#5554) * [Fix] Fix conv2d alter op for arm cpu (#5532) * [FRONTEND]onnx, mxnet, pytorch mathops added (#5561) * Fix topi test for tensorcore (#5563) * [Refactor][std::string --> String] IR is updated with String (#5547) * [std::string --> String] GlobalTypeVar is updated with String * [std::string --> String] GlobalVar is updated with String * [std::string --> String][IR] ADT is updated with String * [std::string --> String][IR] OP is updated with String * [std::string --> String][IR] Attrs is updated with String input * [std::string --> String][IR] GlobalVar is updated with String * [std::string --> String][Test] Pyconverter is updated with String change * [DOCKER] Fix vulkansdk in the ci-gpu (#5566) * [CI] reintroduce docker stage for wasm tests (#5565) * [DOCKER] Introduce ci-wasm * Add Jenkinsfile * Rename prepare to prepwasm so it won't run by default * [CI] Update ci-lint to use the latest image that contains clang-format (#5568) * [DOCKER] Add clang-format and nodejs to ci-lint (#5567) * [TARGET] Phase out WebGL (#5570) The graphics API is moving towards next generation. Vulkan/Metal on the native and WebGPU on the web. Due to the limited programming model, we cannot get the best compute performance in WebGL. Now that the mainline already have both WebGPU and vulkan support, this PR phases out WebGL. * [LINT] Enable clang-format. (#5572) * [LINT] Enable clang-format. * Add more docs * [CI] Update the ci-gpu to the lastest build with the new vulkansdk. (#5571) * [Relay] enable blocking format in x86 conv2d and fold scale axis (#5357) * [CI] Fix clang-format error (#5577) * Allow ubuntu_install_darknet.sh to work in both 18.04 and 16.04 (#5574) * [PYTORCH]expand bug fix (#5576) * [CI] Enable llvm-11 and llvm-10 in build tests, recover webdocs. (#5579) This PR ties up the last loosen end of the recent CI update. * [PYTORCH] Support max_pool2d_with_indices (#5549) * Use real output name instead of node_name * Add pytorch max_pool2d_with_indices converter. * Add test for maxpool2d with indices * Add explicit assert for single output * Only consume output (not indices) from max pool 2d with indices * undo change * [Relay] Fixed bug in attribute parsing for pool layers. (#5582) * Fixed pooling bug. * Added tests and fixed more cases. * [RELAY][TF] Support symbolic newshape for Reshape (#5429) * [RELAY][TF] Support symbolic newshape for Reshape * Only need to pass data * Use MakeReshape() in Reshape() * Change newshape to Expr * Create a template for Array<T> * Fuse reshape when newshape is constant * Make newshape Optional * Use bool() of Optional Co-authored-by: Li Xiaoquan <[email protected]> * Add prim::device op (#5584) * Fix the runtime raise error (#5586) * [RELAY][Convert Layout] Specify additional layouts in convert layout pass (#5422) * [RELAY] Specify additional layouts in convert layout pass * This patch means that you can specify an additional layout, rather than using the layout chosen by default during conversion. * This is specifically useful for external codegen when a 3rd party library needs to target a specific kernel layout for example. Change-Id: I3ef9cf45ead574801870a38af9768f93e29aab10 * Use mapping of op name to list of desired layouts Change-Id: Ibd691a3cb93e73a394f36112668ad52a84c7d5a2 * Fix issue with code block Change-Id: Ibb4e38c05ad4312b7dea845be699b8d5d57e0a94 * Address comments, Improve tutorial Change-Id: Ib824eead329d551c338234de3b2d814693afd0ec * Fix linting Change-Id: Ie9e1891f590b3a7496a56ff8362cdda9d4b5fa75 * Test uses NCHW default layout. Unrelated issue with NHWC. Change-Id: I1c16f0db73db56f5e9536db3fe5eb2624c3b595c * Fix mistake in tutorial Change-Id: I944041245d27af262dc96f1cd8117f1f19272062 * Address multiple comments Change-Id: If33a1e34acd8fc37d1c7797ee189a6448a392672 * Improve tutorial Change-Id: Ib04142c94c7958ab5067947d2ff4c84354e3d0c5 * Fix Clang-format Change-Id: Ieff39e3f0817d22579c68b3287e972a3b0fcfbc8 * Add a quantized conv2 unit test for the tflite front-end (#5558) Signed-off-by: Giuseppe Rossini <[email protected]> * [Relay][Transform] Safe check added for Merge Composite (#5562) * [MXNET]abs, round, reciprocal, sign, softsign, hard_sigmoid (#5587) * [Hexagon] One more fix for concurrency count (#5589) * Fix JSON graph dumping. (#5591) * Previously this function placed a JSON-escaped string containing the JSON-encoded graph. * [DOCS] Improve document in reflection (#5593) * Overestimate binary size for microTVM compiled binaries. (#5590) * Overestimate binary size for microTVM compiled binaries. * Currently uTVM binary section sizes are computed by summing the sizes of all symbols in the section. * This method produces errors because it presumes the linker works in a particular way, rather than analyzing the linked output. * As we intend to move away from linking inside TVM (RFC forthcoming), just using this stopgap to make forward progress until then. * address weberlo comments * fix regression (use 64 bit word size) * [TFLite Runtime] Fix bug and re-enable RPC execution test (#5436) * [Relay][VM] Memory planner (part 1) (#5144) * Start on memory planning WIP Move to test_memory_passes.py Work on memory planning Post-rebase and VM changes Plumb through the offsets Basic tests all pass, fix offset to data buffer. Fix compile errors Fix ws Apply suggestions from code review Co-Authored-By: Haichen Shen <[email protected]> Address CR Update src/runtime/vm/vm.cc Co-Authored-By: Haichen Shen <[email protected]> Fix another comment Fix lint Fix Fix Fix Lint is done? Fix More fix Trying to debug No clue Fix lint * Fix docs * Disable aggressive constant eval * It works * Fix lint * Found issue with dynamic * Fix the pass, but runtime segfaults * fix scalar tensor, test_any_elemwise passes * Fix split pass * Fix 0-rank issues * Fix * debug * apply Haichen's patch and clean up * lintgit add . * fix serializer and test_tyck_alloc_tensor test * Fix the constant lift pass in presence of closures * Restore old finder * Fix rebase issues * Fix * Fix * Fix issue coercing the shapes incorrectly from i64 to i32 * Fix linting * Fix clang format * Format memory.cc * Fix 0-rank case * Add fix for (0,) shape * Ignore shapes for now * Apply suggestions from code review Co-authored-by: Zhi <[email protected]> * Update src/runtime/vm/executable.cc Co-authored-by: Zhi <[email protected]> * Fix * lint Co-authored-by: Zhi Chen <[email protected]> Co-authored-by: Zhi <[email protected]> * Add ostream formatters for TargetPtr/TargetVal. (#5592) * Pattern Language, Matcher, Rewriter, and Function Paritioner (#5231) * [Reduction] Fix cross thread redunction (#5551) - The predictions were not correctly applied after transformation. This leads to normal reduction itervar appearing outside of the loop, which is undefined. See detailed comments. Signed-off-by: Wei Pan <[email protected]> * Fix TVMArray layout on device (#5599) * [LLVM] Represent alignment information in LLVM IR (#5598) * Add debug mode to tempdir() (#5581) * [PYTORCH]ImplicitTensorToNum support added (#5603) * [PYTORCH]Matmul fix for batch_matmul (#5604) * fix rpc server bug on VTA (#5607) * [REFACTOR][IR] Streamline ir/op Registry (#5609) * [REFACTOR][IR] Streamline ir/op Registry This PR refactors the attrregistry mechanism in the ir/op into a separate common base. The common base will provide a foundation for other attr related registries such as target and pass. We also streamlines the terminology of the registry API. - Use AttrMap for the column maps returned by the registry - Use RegEntry to refer to the registry entry. * Address review comments * [TFLITE]GATHER_ND (#5508) Signed-off-by: Dhruva Ray <[email protected]> * [CUDA] Fix codegen for warp shuffle intrinsics (#5606) * fix shfl intrin * improve test_lower_warp_memory_cuda_half_a_warp * Fix a typo. (#5611) Co-authored-by: Zeng Liyong <[email protected]> * fix pattern topological order (#5612) * [BYOC] Remove kCompiler attr from external functions (#5615) Functions destined for external codegen keep their kCompiler attribute which means SkipFunction returns true when running a pass over such functions during the codegen step. This makes sense during graph partitioning, however when lowering the functions for codegen the is no reason to keep this behaviour. Allowing this behaviour will mean a codegen can run a pass on functions only intended for one 3rd party library. Specifically, allowing pre-processing of a series of sub-graphs right before it is passes through codegen. This helps ensure that the functions destined for the 3rd party library are in the expected format. For example, we may want to ensure that these functions have a kernel layout of OHWI because the 3rd party library only supports OHWI. This wouldn't be possible before partitioning the graph as we don't know how the graph will be partitioned ahead of time. Change-Id: Ia68b9da335ef1acfc405a8528aac823de60a65c2 * [Relay]Improve Shape Func handling for Tuple inputs (#5467) * Improve Shape Func handling for Tuple inputs * Fix lint * Improve * Fix build * [Relay][Refactor][std::string --> String] Relay updated with String (#5578) * [KERAS]Global MaxPool3d and AvgPool3d support (#5098) * [IOS] Fix build error of iOS RPC (#5621) * [IOS] Fix build error of iOS RPC - Update to C++14 - Use the latest RPC protocol - Resolve CoreML dependency * Fix clang-format error * Fix three typos (#5620) Co-authored-by: Zeng Liyong <[email protected]> * [Frontend][Tensorflow] Gather nd bug fix for one dim support in tensorflow (#5588) * [Frontend][Tensorflow] Gather_nd one dim support added * Test case added * Doc error handled * Review comment handled: reverting new attr introduced * Check added at mxnet frontend * Doc error handled * TFLite test case failure resolved * [MXNET]MaxPool3d and AvgPool3d Ops support added (#5614) * [PYTORCH]ReflectionPad2d op (#5624) * [BYOC][MergeComposite] if root->args[i] isn't a CallNode, then Donwcast<Call> will check fail (#5623) we needn't execute L131 "call_map->Set(arg, new_arg)", because when arg is CallNode and root->args[i] is not CallNode, new_arg will be a null pointer. There is no point in caching null pointer. Signed-off-by: windclarion <[email protected]> * [DOCS] Move the api docs to the api subfolder (#5626) * [DOCS] Move the api docs to the api subfolder * Update numpydoc location * Ignore 403 * make sure folder exists * [RELAY][BYOC] Fix the creation of tuple of tuples in PartitionGraph (#5616) * [RELAY][BYOC] Fix the creation of tuple of tuples in PartitionGraph If the annotated compiler region contains multiple outputs where some of the outputs are tuple output, the current PartitionGraph will create tuple of tuples. This will not be handled by the runtime. This commit flattens the such tuples and re-create them after the call site of the partitioned function. Change-Id: I4e7ccbda73c129a9f4ae8705d5c9f2af6ab99ef6 * [RELAY][BYOC] Fix the creation of tuple of tuples in PartitionGraph *code refactor : extracted the passes as a sequential Change-Id: If4bc00b00a96fa244358d602fc1a361498342f46 * [RELAY][BYOC] Fix the creation of tuple of tuples in PartitionGraph *further refactor Change-Id: I69ddd0e835e88ef97da8a3a3b949be3f7b619c02 * [RELAY][BYOC] Fix the creation of tuple of tuples in PartitionGraph *class description comment amended Change-Id: I55720bf0467c96e979e1ab56c40d9d209e0f9456 * [NODE][PASS] Introduce config to PassContext. (#5631) This PR introduces a new config field to the PassContext to allow it store arbitary config values. To make sure that the config is validated, we allow each pass to register the config key they would expect and the corresponding types. We also introduce a CreateObject from Map<str, Object> to allow config creation from a json-nest(like in vscode) in python. We added an example of UnrollLoopConfig. Followup PR should migrate the passes to use the new config field. * another cmake fix (#5630) * Fix typo in test script (#5635) * Label Pattern Partitions (#5627) * Label Pattern Partitions with a default label to prevent nested partitions and an optional user supplied-label * Add node names in topological order to Partitioned attribute * respond to review comments * move partition tag into const in attr namespace * [RELAY][PYTORCH]Resize3d, Upsample3d op support (#5633) * [TUTORIAL]TFLite QNN Tutorial (#5595) * [TUTORIAL]TFLite QNN Tutorial * Review comments * Extend AttrPattern to support CallNode and FunctionNode attributes (#5637) * Extend AttrPattern to support CallNode and FunctionNode attributes * Update tutorial and add breaks * add func attr test * [DOCS] Fix the QNN TFLite tutorial build (#5641) * [TUTORIAL] Fix execution error of TFLite quantized tutorial * Assign TensorCore to docs build * [RUNTIME][VULKAN] Seg fault in WorkspacePool's destructor (#5632) (#5636) * [RUNTIME][VULKAN] Seg fault in WorkspacePool's destructor (#5632) * fixed this issue by changing WorkspacePool's destruction order * make line < 100 charactors long * [PYTORCH]Padding support (#5638) * Remove unnecessary print (#5642) * [CI] Allow CI_PYTEST_ADD_OPTIONS to be unbound. (#5644) This patch allows the test script to execute normally when CI_PYTEST_ADD_OPTIONS is not available. * [Runtime] Introduce runtime::Array (#5585) * Introduce runtime::Array * Sync with dmlc-core * Tests added: size, capacity, empty, front, back, push_back, pop_back, insert * 2, erase * 2, resize, reserve, clear * [CI] Add log check to the sphinx gallery docs (#5643) * [CI] Add log check to the sphinx gallery docs This PR add log check to sphinx gallery tutorials to prevent the case when sphinx failed to capture the error in tutorials. * Fix the status * [RELAY][BYOC] Preserve type information in Merge Composite (#5640) Keep the type information when extracting patterns so that it can be used as part of 'check' functions. Change-Id: I16cc70c3d013a794d2ceefb5bec815129c7b8825 * Add a check Callback to the Pattern Paritioner (#5646) * add a check callback to the paritioner * fix doc string * fix unit test spelling * add a test with types * [Relay, Topi][OP] Correlation (#5628) * [Relay,Topi] Correlation * fix * move * typo * Update test_topi_correlation.py * HG: Commit message of changeset 6281661. (#5622) [Relay] Move compiler_begin/end_op to local static objects * [AutoTVM] Update XGBoost verbosity option (#5649) * [RUNTIME] Resolve constexpr issue in debug mode. (#5651) static constexpr is a bit weird before c++17. They are not inlined by default and does not have symbols after compilation. It usually isn't a problem when they are inlined(in c++17 they are inlined by default). But will create compilation error when passed to functions that take (const)references. This PR fixes the problem so that we can compile on debugmode. * µtvm debug improvements (#5648) * Forever loop in UTVMDone to aid debugging * Use parameter and callback function as a micro debug hook. * Previously, users had to uncomment a region of code in micro_session.cc and recompile to debug. Now they can pass in a key in the micro.Session config: config = tvm.micro.device....generate_config() config['debug_func'] = _python_launch_gdb with micro.Session(config) as sess: .... * clang-format * Only forever loop on device (on host this blocks unittests) * [REFACTOR][IR] Migrate IRModule ObjectRef to not-null (#5654) * Upgrade XGBoost to latest (#5658) * Increase bss section size. (#5660) * Likely broken in PR 5590. * [PatternLang] Convert PatternGrouper to do pre-order, non-recursive analysis (#5653) * make the PatternGrouper iterate over the input Expr in a non-recursive pre-order fasion * add a comment * [Relay,Topi][OP] affine_grid and grid_sample (#5657) * [Relay,Topi][OP] affine_grid and grid_sample * lint * [TIR][BUILD] Remove buffer params from pass config. (#5652) Buffer configurations can be passed during construction and does not need to be part of the build config. This is a refactor step to simplify the BuildConfig for the PassContext migration. * handle likely in IRMutatorWithAnalyzer (#5665) * [TOPI] Improve CUDA softmax scheduling (#5600) - Do not use multiple kernels - Schedule with warp reductions - Fixed a bug on the lower warp memory pass - Fixed warp shuffle intrinsics for the nvptx backend. Signed-off-by: Wei Pan <[email protected]> * [Relay][Op]Support symbolic TopK, Ones, Zeros and Full (#5459) * Support symbolic TopK, Ones, Zeros and Full * Fix pylint * Add docstring for topk shape func * Fix grad * Fix lazy_gradient_init * Fix parser * Fix print ir text * Fix lint * Improve pattern_util * Fix topk * Fix build * Use Optional for attribute * Fix clang-format * Minot fix * Fix pylint * Fix build warning * Fix parser * Move ToScalar * Fix lint * Fix lint * Make topk shape func as data independent when k is constant. * Fix lint * Minor fix * [PYTHON] Add buffer name when creating tensor bindings (#5670) * [REFACTOR][TIR][API-Change] Migrate BuildConfig to PassContext. (#5668) * [REFACTOR][TIR] Migrate BuildConfig to PassContext. This PR migrates the TIR configurations from BuildConfig to the PassContext used by the unified IR. Moving forward, PassContext will be the unified way to configure passes in the TVM stack. Changes - Refactored TVM_PASS_REGISTER_CONFIG_OPTION to take in the reference type. - Removed BuildConfig. - Migrated the passes to use PassContext. * Update include/tvm/ir/attrs.h Co-authored-by: Zhi <[email protected]> Co-authored-by: Zhi <[email protected]> * [Doc] Misc doc fix (#5672) * [C++ RPC] Fix C++ RPC build problem on Linux (#5671) * enable amd_apu device on vulkan target (#5659) * [AutoTVM][TOPI] AutoTVM incorrect measurement (#5511) * [AutoTVM][TOPI] AutoTVM incorrect measurement * create new placeholder with converted layout * update _schedule_winograd * [POC][PatternLang]Remove constants from partitioned functions (#5663) * remove constants from partitioned functions * remove print statements * [TF] Support TupleWrapper as direct ancestor of control flow ops (#5639) * add tvm.micro pydoc to sphinx (#5661) * add tvm.micro pydoc to sphinx * making build pass and addressing tqchen comments * add a check for null function attributes (#5674) * [BYOC] Pattern Language MergeComposite (#5656) * Pattern Language MergeComposite * fix DNNL pattern * Use builtin binary operator syntax for demo * Improve unit test * add a testcase for #5674 (#5677) * Call previous excepthook in tvm_excepthook. (#5675) * Call previous excepthook in tvm_excepthook. * Rename prev_excepthook. * Create a tvm_wrap_excepthook to wrap a given excepthook with tvm custom excepthook work and call it on system previous excepthook. * Add docstring. * Fix the shift column for scale_shift_nchw and scale_shift_nhwc in C topi (#5679) * [Bugfix] Fix Python debugger segfaults with TVM built with LLVM (#5685) * Import readline before loading libtvm * make lint happy * [DOC] Improve Pattern Language Docs (#5676) * [DOC] Improve Pattern Language Docs * address comments * address comments * [TFLITE]Quantize & Dequantize op (#5394) * [TFLITE]Quantize & Dequantize op * Testcases added * Review comment fixed * [TIR][REFACTOR] std::string -> String Migration in TIR nodes (#5596) * [TIR][REFACTOR] std::string -> String Migration for Var node and SizeVar Node * update json_compact.py * [PatternLang] Add ConstantPattern (#5689) * Add ConstantPattern * update doc * [PYTORCH]Minor bug fixes (#5683) * [PYTORCH]Minor bug fixes * Review comment fix, testcase added * Added testcase for bert model * [Relay] Fix dataflow_pattern.rewrite() hang if Match in IR (#5680) rewrite() quits only if graph stop changing, but ExprMutator always creates new Match node. This patch fixes this. * [RELAY] Fix segfault in pretty print when ObjectRef is null (#5681) * [RELAY] Fix segfault in pretty print when ObjectRef is null Encountered when pretty printing module with function attribute equal to NullValue<ObjectRef>(). Change-Id: I2e7b304859f03038730ba9c3b9db41ebd3e1fbb5 * Add test case Change-Id: I579b20da3f5d49054823392be80aaf78a055f596 * [REFACTOR][RELAY] move fallback_device to config (#5690) * @zhiics -> PPMC (#5692) * [COMMUNITY] @masahi -> PPMC (#5691) * Support more dtypes for TVMDSOOp (#5694) * [ONNX]LpPool Support added (#5696) * In memory_plan, check if value is not None, instead of just checking value as boolean. (#5700) * [PatternLang]Conditionally Embedding Constants in Partitioned Functions (#5693) * Embed constants in the partition function if the pattern explicity requests constants fix rst fix pylint * improve comments based on Cody's feedback * [ONNX] Skip ADD inside Gemm op when vector is zero (#5697) * [BYOC] Support Tuple Output in C/DNNL Codegen (#5701) * Support tuple output runtime * fix unit test * [REFACTOR][RELAY] Replace build_config with PassContext (#5698) * [PYTORCH]floor_divide support for squeezenet (#5702) https://github.com/apache/incubator-tvm/issues/5133#issuecomment-636330705 * [AutoTVM][TOPI] Fix bifrost spatial packing conv2d auto tune (#5684) * [AutoTVM][TOPI] Fix bifrost spatial packing conv2d auto tune * [AutoTVM][TOPI] Putting placeholder replacement in compute * Fix winograd kernel replacement * Fix sanity check: Line too long * [Arith] ExtendedEuclidean merge impl to int_operator (#5625) * fix typo: anchor windoes should be anchor windows (#5706) * [REFACTOR][PY] relay.op.Op -> tvm.ir.Op (#5705) * [REFACTOR][PY] relay.op.Op -> tvm.ir.Op * Improve the error check * [PatternLang] Simplify Pattern API Implementations (#5703) * Add syntatic sugar; include pattern to API docs * fix doc warnings * [PYTORCH]ReplicationPad support added (#5708) * Remove deprecated opengl files (#5711) * Remove opengl runtime and cmake (#5712) * [BUGFIX][CRT] Fix Compilation Error in CRT (#5713) * Rename tvm_dso_op to libtvm_dso_op (#5714) * [Object] Unify StrMapNode and MapNode (#5687) * Pass cpptest and py unittest * fix graph runtime * right fix * fix a bug that runtime::String's operator < is actually compare by address * Update container.py * Renaming * Address comments * lint * Replace ObjectHash in object.py * [MXNET]Softmin, trunc op support added (#5715) * Avoid downloading when TOPHUB_LOCATION is NONE (#5720) * [Object][FFI] Introduce runtime::String::CanConvertFrom (#5718) * [Object][FFI] Introduce runtime::String::CanConvertFrom * Update container.h * [Object] Restore the StrMap behavior in JSON/SHash/SEqual (#5719) * Fix generating types like float44 and float88 (#5722) * [ONNX]ReduceL1, ReduceL2, ReduceSumSquare, ReduceLogSum ops added (#5721) * [TENSORFLOW]StatefulPartitionedCall/PartitionedCall Ops support added (#5617) * Implemented functionInvocation Unit Test for StatefulPartitionedCall operator(working) and initial changes for placeholder(not working as of now) * Placeholder exercises with tvm * placeholder interim * SPOP Test cases structure * New test cases for spop * miscellaneous test cases for spop * Placeholder samples..working with shapes explicitly passed * Variables test case. Works with the same fix of shape_dict * SPOP Positive test cases first iteration * support output tensors as function args, multiple functions * Corrected Indentation * filewritter is only for debug purpose * support variables in function args * First working iteration of positive spop test cases * Removed commented code, simplified code * Code Reorganization- First working iteration of positive spop test cases * corrected variable name after refactor * Code Reorganization- First working iteration of positive spop test cases * move code inside mapped operator function * Removed extra line * support variables in function args * Removed commented code, simplified code * move code inside mapped operator function * Code Reorganization- First working iteration of positive spop test cases # Conflicts: # tests/python/frontend/tensorflow/test_forward.py * Code Reorganization- First working iteration of positive spop test cases * Function invocation more test cases * Simplified & Merged different Function Invocation Test cases * support invocation of nested callables no need to explicitly handle paratitioned and statefulPartitioned condition in convert_operator function * Simplified and Uniform testcases * support invocation of nested callables no need to explicitly handle paratitioned and statefulPartitioned condition in convert_operator function * Simplified and Uniform testcases * removed duplicate and renamed testcase * Negative scenario added for testing operator statefulness. Only Exception to stateful operators are Partitioned & StatefulPartitionedOp which have capability to execute even stateless operators within them * Miscellaneous reorganization changes for spop scenarios * Miscellaneous reorganization changes for spop scenarios * Corrected import of tensorflow modules safely using try except and other code reorganization * Negative scenario for resource variables handled * Documentation update for code * SPOP change in function handling * handle nested subgraph * refactor * get op def compatible with tf 1x & 2x * Fixed liniting issues * added doctsring and few nits * Merged changes for positive test cases and negative test cases * Moved StatefulPartitionedCall test case to the end of the TC list * Fixed some typos and semantics * dmlc-core * dmlc-core * fixes * Addressing Review comments in the PR for SPOP support * Fixed pylint errors * Corrected tensorflow import syntax * Placed the op_def_registry module import outside of for loop * Removed new stateful operators list and combined these operators with missing operators to display as single list. Also removed throwing seperate exception for stateful ops Co-authored-by: Prashant Sail <[email protected]> Co-authored-by: maheshambule <[email protected]> * [AutoTVM, Relay] Clear compile engine after task extraction (#5724) * Fix runtime::String backward compatibility in JSON (#5725) * codegen llvm: move nvptx-specific intrinsic handling into codegen_nvptx (#5726) See discussion in #5600. I'm also throwing in a pointer lifetime fix for the context held by NVPTX because otherwise topi/tests/python/test_topi_softmax.py would sefault for me. With the test, I can also run resnet-18 on the nvptx target in gpu_imagenet_bench.py. * [TOPI,RELAY][TFLITE] Sparse to dense operator (#5447) * [Relay][Frontend][TFLite] Add parser support for shape and range Signed-off-by: Dhruva Ray <[email protected]> * [TOPI,RELAY][TFLITE] Sparse to dense operator Signed-off-by: Dhruva Ray <[email protected]> * use param name in documentation Signed-off-by: Dhruva Ray <[email protected]> * sphinx doc errors fixed Signed-off-by: Dhruva Ray <[email protected]> * incorporated review comments Signed-off-by: Dhruva Ray <[email protected]> * Missing a blank line... Signed-off-by: Dhruva Ray <[email protected]> * use get_tensor_expr Signed-off-by: Dhruva Ray <[email protected]> * Accidently removed this function in the rebase... Signed-off-by: Dhruva Ray <[email protected]> * support default value for default_value Signed-off-by: Dhruva Ray <[email protected]> * clang format fixes Signed-off-by: Dhruva Ray <[email protected]> * topi pylint fixes Signed-off-by: Dhruva Ray <[email protected]> * [Frontend][TFLite] Add parser support for shape and range (#5329) * [Relay][Frontend][TFLite] Add parser support for shape and range Signed-off-by: Dhruva Ray <[email protected]> * Incorporated review comments and used new functions Signed-off-by: Dhruva Ray <[email protected]> * Few cosmetic changes Signed-off-by: Dhruva Ray <[email protected]> * Removed an extra line added by rebase... Signed-off-by: Dhruva Ray <[email protected]> * [REFACTOR] Separate ArgTypeCode from DLDataTypeCode (#5730) We use a single enum(TypeCode) to represent ArgTypeCode and DLDataTypeCode. However, as we start to expand more data types, it is clear that argument type code(in the FFI convention) and data type code needs to evolve separately. So that we can add first class for data types without having changing the FFI ABI. This PR makes the distinction clear and refactored the code to separate the two. - [PY] Separate ArgTypeCode from DataTypeCode - [WEB] Separate ArgTypeCode from DataTypeCode - [JAVA] Separate ArgTypeCode from DataTypeCode * [ONNX]MaxRoiPool, Mod & Xor op support added (#5729) * ROCm: Add warp shuffles and enable reductions (#5727) Thank you @masahi and @wpan11nv for the feedback * Change 'delete's in Relay VM Instruction dtor to 'delete[]'s (#5735) * Fix reshape usage in ARM Winograd (#5732) * [TEST] Fix flaky topi/tests/python/test_topi_pooling.py:test_adaptive_pool (#5736) * Fix the values for test_fmod since it fails way too often otherwise (#5723) * fix small bug about dense_grad (#5695) * [REFACTOR][ARITH] Remove legacy compute_expr.h (#5738) Replaces most of the ComptuteReduce using foldl. * Add some docs on downstream consistency (#5742) https://github.com/apache/incubator-tvm/pull/5730#issuecomment-639567636 * sequential cpp test (#5745) * [REFACTOR][TE][TIR] Call::Halide => ProducerLoad, DSL/TIR decouple. (#5743) In the HalideIR's design, DSL components and IR are mixed together. For example, Call::Halide can containa reference to a function which is constructed in the tensor expression language. While this coupled design simplifies certain aspect of the DSL construction, it prevents the TIR to evolve as a clean standalone IR: - The additional tensor expression provided in the function is opaque to the IR and may become obsolete as we transform them. - The duplication of the information in the DSL tensor and IR makes it hard to design a stand-alone text format (when there are elements shared in the tensor expression and normal statements). This PR aims to clearly de-couple the TIR from high-level DSL structures(tensor expression), while still provide clear extensions to build DSLs on top of the TIR. We introduce a DataProducer as a base class for high level tensor expressions objects that produce data. We then introduce ProducerLoad to replace the Call::Halide usage, so that the Call node can always be self contained and used for low-level calls. The high-level tensor expression DSL can still generate a PrimExpr that contains a ProducerLoad. These PrimExprs contains fragments of information that can be combined together to generate a low-level TIR PrimFunc. We also state clearly that DataProducer **should not** appear in any TIR PrimFunc. Instead, the high-level DSL layer should lowered DataProducers to Buffers and TIR statements that produces these buffers. We can further provide verifications to validate such invariance. Changes: - Introduce DataProducer to serve as a base class for Tensor in tensor expressions. - Migrate use of Call::Halide to ProducerLoad - Migrate the other usages of Calls. We will also create follow-up PRs to migrate the remaining two DSL related IR nodes(Realize/Provide) to use the DataProducer. * Don't add cast for TF batch norm when type isn't changing (#5731) * [ARITH][BACKPORT-0.6] fix a min/max simplify bug (#5749) * fix a min/max simplify bug * fix cpplint * turn into oposite when c1val<0 and add more case * fix c1=0 Co-authored-by: xqdan <[email protected]> * [TOPI][Relay][OP] support dynamic NMS(Non Maximum Suppression), symbolic begin, end, and strides for strided_slice (#4312) * [TOPI][Relay][OP] Dynamic NMS and strided_slice * Incorporate comments * fix nnvm compatibility issues * fix InferCorrectLayout * Minor fix * fix for fuse * Workaround to pass batch_size into hybrid function to handle dynamic shape * Seperate rearrange * fix lint * fix ci, comments * change attr to Optional<T> * clang format * remove empty lines * partial ignore for end of strided_slice * pylint * add out_indices for gpu get_valid_counts * change to slice_mode * clang-format, fix comments * fix comment * change slice_mode to string * fix CI * update docstring Co-authored-by: Yao Wang <[email protected]> * Update dmlc_tvm_commit_id.txt * Update TRT Integration to reflect upstream changes * Sync submodules * Fix jenkinsfile * git-clang-format against origin/dev instead of origin/master * Fix formatting. * Remove is_empty in export_lib (used for old trt) * Disable test_forward_qnn_mobilenet_v2_net * Add Scatter to Topi/Relay/ONNX via hybrid script (#5619) * I can construct scatter but not embed it in a Relay Graph * working 1-4 dimesion scatter * add scatter to ONNX fix lint * isolate tests to cpu backend * Fix i386 test * fix gpu tolerance * use elemwise_shape_func for scatter * fix incorrect rebase * [Minor][Test] Clean WASM environment before build (#5759) * [Bugfix] Fix reshape (#5739) * Fix reshape * fix doc warning * fix ci * address comments * [REFACTOR][TIR] Provide->ProducerStore, Realize->ProducerRealize. (#5750) This PR finishes up the final step for DSL/TIR de-coupling to refactor Provide/Realize to use the DataProducer. As in the case of ProducerLoad, ProducerStore/Realize are not supposed to appear in a vaid TIR function ans are only used by high-level DSLs as intermediate structures. * [Rust] Second stage of Rust Refactor (#5527) * Add tvm-rt crate * Backport changes from frontend branch * Format * Add ASF headers * Address self-code review * Replace with helper * Fix lint * Fix * Clean up repro debugging * WIP * Remove global resgistry to fix one memory issue * Fix * Format * Format * Update rust/tvm-rt/README.md Co-authored-by: Jason Knight <[email protected]> * Format * Duplicate TVM macros * Split macros * Restore old macro for old crates * Repair macros * Fix format * Format Co-authored-by: Jason Knight <[email protected]> * [topi] block sparse dense on cuda (#5746) * [Relay] Fix for recursive let (#5757) * Make let processing iterative * Try again * Fix pretty printer overflow * cleanup * fix lint * Fix text printer Co-authored-by: Jared Roesch <[email protected]> Co-authored-by: Jared Roesch <[email protected]> * [TOPI][RELAY][PYTORCH]Conv3d_transpose op support added (#5737) * [TOPI][RELAY][PYTORCH]Conv3d_transpose op support added * Test cases in topi/relay * conv3d_transpose_ncdhw_python added * Review comments fixed * Fix gelu in PyTorch frontend, tighten numerical checks (#5763) Previously, the PyTorch frontend approximated gelu with fastgelu. To provide a more faithful conversion, we implement gelu instead. We also tighten the numerical comparisons between PyTorch and TVM-from-PyTorch to 1e-5. The object detection models need an increased tolerance of 1e-4 to pass. I had to throw in a few fixes for missing conversions (probably due to working with very new PyTorch). I must admit the GoogLeNet/NasNet test didn't run on my machine, probably due to problems at my end. * Add ShapePattern and DataTypePattern (#5760) * Make batch matrix multiplication on GPU tunable (#5752) This is primarily aimed at the AMD GPU backend and done as part of a project for AMD, but should work for all users of the GPU schedule. * [TIR][REFACTOR][API-Change] Migrate the tvm/tir/expr.h to construct style. (#5773) This PR migrate tvm/tir/expr.h to the new constructor style that is consistent with the rest of the codebase and changes the affected files accordingly. * [TIR][REFACTOR][API-Change] Migrate tir/stmt.h to use constructor. (#5778) This PR migrate tvm/tir/stmt.h to the new constructor style that is consistent with the rest of the codebase and changes the affected files accordingly. * [Frontend][TensorFlow] Improve Control Flow and TensorArray (#5699) * Improve TF parser control flow and tensor array * Fix tf tensor array scatter * Add ssd test * Add back static ta test * Minor fix for frontend and test_forward * SplitRel for dynamic shape * Fix test ssd * Fix loop var naming issue * Minor improve * Fix format * Fix clang format * Fix tensor array in pytorch frontend * Fix stack size issue for ssd test * Address comments * Fix slice size * Fix build * Rebase * [DOC][FIX] Fix some typos in git-clang-format.sh (#5786) * fix #5686: remove a overstrict assert in MakeAllreduce (#5686) (#5785) * [RUNTIME] Add compile_shared option to linux compile utility fn (#5751) * feat: Add compile_shared option to linux compile fn * feat: Add compile_shared option for linux compile util fn * fix: Fix minrpc testcase use executable compilation * fix: Fix binutil case where call create_shared to create executable Co-authored-by: baoxinqi <[email protected]> * [REFACTOR][API-Change] Migrate all Object construction to constructor. (#5784) This PR migrates all the remaining object constructions to the new constructor style that is consistent with the rest of the codebase and changes the affected files accordingly. Other changes: - ThreadScope::make -> ThreadScope::Create - StorageScope::make -> StorageScope::Create * [Topi] pass-by-value -> pass-by-const-reference (#5783) * [topi][relay] Add operation gather to relay. (#5716) * [CODEGEN][CONTRIB] CoreML codegen (#5634) * [CODEGEN][CONTRIB] CoreML codegen * import coremltools only when it is necessary * fix pylint errors * don't import contrib.coreml when using runtime lib * skip coreml codegen test in CI * don't register relay.ext.coremlcompiler in __init__.py * move tvm/contrib/coreml.py to tvm/contrib/target/coreml.py * use existing transformers for graph partitioning * skip test only when coremltools is not available * add check for annotation * move _register_coreml_op to python/tvm/relay/op/contrib/coreml.py * skip compile when xcode is unavailable * relay.op.Op -> tvm.ir.Op * set USE_COREML on * refine test * fix calibration pass to support multiple functions (#5768) Co-authored-by: Ubuntu <[email protected]> * [cmake] update vulkan rules (#5777) * Add ignore storage_order attribute to onnx pooling parser. (#5781) * [BYOC][FIX] Infer types in MergeComposite (#5766) If InferType isn't run between partitioning passes, function calls are inserted which don't have a type. This can result in failures for patterns which want to check types. This works around it simply by running InferType after every partitioning. Change-Id: Ie0887f0564a41eb0913bfe42a362e8effe9681b9 * [FRONTEND]Darknet support batch size for yolo (#5688) Fix the issue reported in https://discuss.tvm.ai/t/yolov3-tiny-batch-input-test-failed/6796 * Update dmlc_tvm_commid_id.txt * Skip tflite test_forward_mediapipe_hand_landmark * Increase stack limit for failing tflite tests. Skip TF tests which require TF 1.x * [PYTORCH]aten::norm support added (#5776) * [TENSORFLOW]Conv3d Transpose OP added (#5775) * [TENSORFLOW]Conv3d Transpose OP added * Testcase updated, tf cpu supports only ndhwc * [TF] Support symbolic inputs of Fill (#5762) * [TF] Support symbolic inputs of Fill * Rebase and simplify. Value has been converted to constant if it is tf.Constant * [COMMUNITY] @wpan11nv -> Reviewer (#5790) * Edit onnx parser to infer values in post order (#5755) * edit onnx parser to infer values in post order to speed up onnx imports with many calls to infer_value * fix pylint * [TIR][REFACTOR] Cleanup unused classes (#5789) * Fix tf parser (#5794) * support aten::type_as in the pytorch frontend (#5787) * support aten::type_as in the pytorch frontend * use _convert_data_type to convert torch type to tvm type and add more types in the type_as test * [TIR][REFACTIR] Update TIR nodes std::string->String. (#5793) This PR updates the remaining TIR node's member to use String instead of std::string. * [TEST] Temporary disable fp16 type_as test for PyTorch Frontend (#5799) * [ONNX] Skip multiply with 1.0f constant for GEMM import (#5800) * [ONNX] Skip ADD inside Gemm op when vector is zero * [ONNX] Skip multiply with 1.0f constant for GEMM import * [TIR][REFACTOR] Add tir prefix to type keys (#5802) * [QUANTIZE] Add config switch for nn.dense layer type. (#5801) * [topi] fix sparse dense schedule on cuda (#5803) * Allow RPCWrappedFunc to rewrite runtime::String as std::string (#5796) * [topi] fix strategy for sparse dense cuda (#5782) * [CI] Move cpu-only frontend tests to a CPU stage (#5807) * [MXNET]conv3d and conv3d_transpose addedx (#5814) * Pin hand landmark network to version 0.7.4. (#5813) * Versions above 0.7.4 are broken due to changes in the quantization operations in the model, which are current not supported by TVM. Fixes #5774. * [CI] Limit number of threads in all jobs (#5815) * Update dmlc_tvm_commit_id.txt * Disable tensorflow.test_forward_sdd because stack limit of 100mb is exceeded by WellFormedChecker Co-authored-by: Samuel <[email protected]> Co-authored-by: ANSHUMAN TRIPATHY <[email protected]> Co-authored-by: wsl-inspur <[email protected]> Co-authored-by: Krzysztof Parzyszek <[email protected]> Co-authored-by: Matthew Brookhart <[email protected]> Co-authored-by: Mahesh Ambule <[email protected]> Co-authored-by: Tianqi Chen <[email protected]> Co-authored-by: Animesh Jain <[email protected]> Co-authored-by: Ubuntu <[email protected]> Co-authored-by: Thierry Moreau <[email protected]> Co-authored-by: tobe <[email protected]> Co-authored-by: Jared Roesch <[email protected]> Co-authored-by: Nick Hynes <[email protected]> Co-authored-by: Tang, Shizhi <[email protected]> Co-authored-by: Bohan Hou <[email protected]> Co-authored-by: Wei Pan <[email protected]> Co-authored-by: Tom Gall <[email protected]> Co-authored-by: MORITA Kazutaka <[email protected]> Co-authored-by: masahi <[email protected]> Co-authored-by: Haichen Shen <[email protected]> Co-authored-by: Ramana Radhakrishnan <[email protected]> Co-authored-by: Menooker <[email protected]> Co-authored-by: Josh Fromm <[email protected]> Co-authored-by: lixiaoquan <[email protected]> Co-authored-by: Li Xiaoquan <[email protected]> Co-authored-by: Candy <[email protected]> Co-authored-by: LiangLiu <[email protected]> Co-authored-by: lhutton1 <[email protected]> Co-authored-by: Giuseppe Rossini <[email protected]> Co-authored-by: Andrew Reusch <[email protected]> Co-authored-by: Liangfu Chen <[email protected]> Co-authored-by: Michal Piszczek <[email protected]> Co-authored-by: Zhi Chen <[email protected]> Co-authored-by: Zhi <[email protected]> Co-authored-by: Dhruva Ray <[email protected]> Co-authored-by: Liyong Zeng <[email protected]> Co-authored-by: Zeng Liyong <[email protected]> Co-authored-by: Yao Wang <[email protected]> Co-authored-by: windclarion <[email protected]> Co-authored-by: manupa-arm <[email protected]> Co-authored-by: Wuwei Lin <[email protected]> Co-authored-by: Yi Wang <[email protected]> Co-authored-by: Cody Yu <[email protected]> Co-authored-by: Junru Shao <[email protected]> Co-authored-by: mbaret <[email protected]> Co-authored-by: hlu1 <[email protected]> Co-authored-by: Philip Hyunsu Cho <[email protected]> Co-authored-by: Zhao Wu <[email protected]> Co-authored-by: Mei Ye <[email protected]> Co-authored-by: Neo Chien <[email protected]> Co-authored-by: notoraptor <[email protected]> Co-authored-by: Balint Cristian <[email protected]> Co-authored-by: Rand Xie <[email protected]> Co-authored-by: abergeron <[email protected]> Co-authored-by: Deepak <[email protected]> Co-authored-by: Prashant Sail <[email protected]> Co-authored-by: maheshambule <[email protected]> Co-authored-by: Thomas Viehmann <[email protected]> Co-authored-by: akosik-anyvision <[email protected]> Co-authored-by: handar423 <[email protected]> Co-authored-by: xqdan <[email protected]> Co-authored-by: xqdan <[email protected]> Co-authored-by: Yong Wu <[email protected]> Co-authored-by: Jason Knight <[email protected]> Co-authored-by: Zijing Gu <[email protected]> Co-authored-by: Jared Roesch <[email protected]> Co-authored-by: majiang31312 <[email protected]> Co-authored-by: wrongtest <[email protected]> Co-authored-by: baoxinqi <[email protected]> Co-authored-by: Yi-Hsiang (Sean) Lai <[email protected]> Co-authored-by: Ubuntu <[email protected]> Co-authored-by: Bing Xu <[email protected]> Co-authored-by: Leandro Nunes <[email protected]>
- Added the warp level reduction support - Upgraded shfl intrinsics to the sync version. - This is the building block for scheduling softmax like operations. Signed-off-by: Wei Pan <[email protected]>
- Added the warp level reduction support - Upgraded shfl intrinsics to the sync version. - This is the building block for scheduling softmax like operations. Signed-off-by: Wei Pan <[email protected]>
Added the warp level reduction support
Added new shfl_sync intrinsics
This is the building block for scheduling softmax like operations.
Signed-off-by: Wei Pan [email protected]
Thanks for contributing to TVM! Please refer to guideline https://tvm.apache.org/docs/contribute/ for useful information and tips. After the pull request is submitted, please request code reviews from Reviewers by @ them in the pull request thread.