DeepSpeed-Triton for Inference #3748

stephen-youn · 2023-06-14T00:33:46Z

This PR introduces the Triton to DeepSpeed (README)

We have integrated Triton, an open source compiler for GPU programming, into DeepSpeed, which further boosts the inference speed of BERT-like models in float16 precision.
It supports BERT-type model inference and providing 1.14~1.68x speedup for different models and GPUs.

Co-authored-by: Stephen Youn <[email protected]>

* Add residual_add triton op * add support of gptj style models to triton residual_add kernel * fix the residual_add tests * Add support of end to end run for residal_add triton kernels * Fix the MLP output tensor's shape * Fix the output tensor of residual_add_func python call * triton matmul kernels with python wrapper class added with pytests * clean-up and make it read autotune table when importing * fixed import problems with the naming * enable update_autotune_table for every forward in matmul * a int4 into int8 weight packing function added test parameters with alignment only (i.e. integer multiple of block_size in matmul kernel), this will be further investigated * lint * quantization added int8-packed-int4-fp16 matmul-block-deq added illegal cuda mem access bug in triton matmul kernel fixed (i.e. a mem boundary problem) * add torch block qunatization * dual quantization matmul added * cleanup, fix for lint * documentation lint fix * README added * typo * updated the kernel to have fused bias additioin and activation too * Add residual_add triton op * modified quantization to take additional bits, more than int8 * enable triton residual_add kernel in DS MLP * Add flash attention kernel and glue code * additional scale-norm added for weight * a temporary example for quantization added * comments * use the exact same ds quantizer as reference * added scale-norm (i.e. scale-of-scale) to both triton/torch version * snr check with fugsed-deq-gemm for block_deq and dual_block_deq * makes matmul kernels work for a6000 with smaller mem w8a8/w4a8 with sym block quantization on activation and row(or col)-wise quatnziation on weight works (snr test added) * Add layer norm triton kernel * Add gelu triton kernel * Add softmax triton kernel * Rename flash attn api * add triton gemm kernels * fix formatting of triton kernels * Add matmul triton kernels * Updated Triton Gelu to use non-approx computation * Updated Triton Gemm for f16 bias-add parity * Add DS triton encoder layer * Updated Softmax to work around block size 1 * fix the issue caused by merge conflict * Add trition layer norm unittests * dual-qblock snr verified too * Add triton gelu kernel unittests * Add triton softmax kernel unittests * fix flash kernels formatting (#382) * Add triton dependency to unittests workflow (#381) * w8a8 and w8a4 matmul with block quantization verified * Allow Gemm & MatMul to take arbitrary dimensions * Add triton matmul kernel unittests * fix triton dependency in github CI workflows * Fix matmul launching grid * fix formatting * Add triton gemm kernel unittests * modified dual-qblock to support wider scale_bits with int64 acc and vec-ops, which caused perf degradation workaround is to use "v2" kernel added with internal shift ops but not enabled yet * fix residual in gemm_3d kernel * Add flash attention trition kernels unit tests * test_matmul and test_gemm pass (but with smaller coverage as mentioned in the code) float32 can be supported later * added 'triton_gemm_eval.py' it is temporary script to evaluate accuracy of the triton matmul against the torch matmul * typo * typo * root-caused the parity error with fused_gelu. it is not with gelu but with residual-addition. disabled residual-addition and it still needs debugging * location of residual addition in reference modified to be after the activation * fixed index typo in the snr plot * Fix trition attention kernel unit tests * fix formatting * added batch support in matmul row/col-wise quantization matmul debugged * fixed bugs in the unit tests after the batch support change and so on test_int8_int8_fp_matmul_dual_block_deq still fails and need further debugging though * weight-only quantizatioin example and test are added to check_snr * matmul_ext basic check added as unit test under tests/unit * move triton ops under inference/triton * restore triton_ops.py * import path correction * restore ds_mlp and ds_attention * shaping bug with batching in matmul_ext fixed changed the gelu computation to use libdevice.erf instead of approx with sigmoid (otherwise, roberta unit test fails) * triton ops added with an option in config to use it with op_binding and config option * Triton transformer added: InferenceTransformerFactory, TritonTransformer, TritonSelfAttention, TritonMLP and so forth * Triton wrapper classes added * added simple triton eval scripts * rename the new benchmark script for triton-bert * added triton attention, triton layer-norm/softmax * adds tests to measure attention perf in triton and others * changed triton flash attn function name * attention set to use triton non-flash by default * enable triton for bert * made udpate_autotable be false by default because it degrade the perf * temp commit with debugging/profiling codes * temporary debugging/profiling code lines added, need to be cleaned up later * clean-up * unit tests for triton inference ops are now passing * removed unnecessary triton kernels * test_inference passes * removed debugging/profiling codes * triton==2.0.0.dev20221202 * clean-up for formating check pass added layer_norm test without residual-add * set triton version requirement * further clean-up * removed redundant files * readme for triton matmul * clean-up and add more test for triton-matmul * typo * removed another obsolete triton kernels and tests * removed unnecessary TransformerInferenceFactory class * removed obsolete test * formatting check, cleanup * formatting fix: added copyright to the head * formatting: missing lticense added * add pytest skip condition to test_matmul_ext * formatting fix * formatting * added --forked option to inference_ops unit pytests * Revert "added --forked option to inference_ops unit pytests" This reverts commit 743b86d354b041172b06e4a8505f43ddd4c2544a. * changed the pytest mark for softmax to be inference_ops * formatting fix * cleanup comments * add missing import * keep only fp16 matmuls because it's out of this PR's scope int8-based gemm kernels will be added later * removed the previous matmul_ext test * triton quantization kernel removed too * clean up comments * added comments for license * triton matmul always read the autotune table when imported and write the final table when closing * modfied triton kernels to have a new transposed_model arg * added license note to files * set default mlp kernel to be cuda as it's better than triton kernel with bert * adds changes missed from the prev commit * added license notes increased DEEPSPEED_TEST_TIMEOUT from 600 to 900 for triton compilation * added unit test for triton attention * moved tests in layer_norm.py to test_layer_norm.py * removed commented code lines * removed triton from the main requirement as commented in PR * follow PascalCase convention in class naming as suggested from pr review * changes to make deepspeed work without triton specifically, resolves error with importing any triton ops added code lines that check the availabilty of triton and skip the tests if it's not * added a feature to run triton autotune at initialization, i.e., at op-building phase * fix for the lint/formatting added " # noqa: F401" * move triton-bert-benchmark.py to microsoft/DeepSpeedExamples * modify the code as suggested from PR * make DEEPSPEED_TEST_TIMEOUT in unit test back to 600s * made an optioni to skip triton-autotune in config * lint fix for formatting * removed repeated has_triton when importing triton also the change for pr comment * removed duplicated triton_autotune arg passing * upgrade to triton 2.0 pydantic.validator for use_triton * move triton specific op mapping into model_implementation as commented from PR * removed commented lines * need to cite where the file came from, as commented from the PR review * change for the recent merge with the master * qkv-gemm change to make distilbert work after the merge with the master * format fix * fix triton attention for qkv passing for non-pre-norm requirements all use triton2.0.0 * skip autotune in test_matmul and test_attention with triton * formatting with pre-commit * add config for v100 test in matmul_4d kernel (small shared mem requirement) * inject tritn kernels only in bert and let it inform it through log_dist set triton to be the latest from requirements * reduced the config and added mem check for matmul_4d * added README.md tutorial page for triton-deepspeed * typi in README * refine README * refine readme * refine readme * refine readme * "Fix apex install bugs #3741" --------- Co-authored-by: Arash Bakhtiari <[email protected]> Co-authored-by: Stephen Youn <[email protected]> Co-authored-by: Cheng Li <[email protected]> Co-authored-by: Ethan Doe <[email protected]> Co-authored-by: yidoe <[email protected]> Co-authored-by: Jeff Rasley <[email protected]>

blogs/deepspeed-triton/README.md

…/DeepSpeed into staging-triton-bert-v1

deepspeed/inference/config.py

blogs/deepspeed-triton/README.md

…/DeepSpeed into staging-triton-bert-v1

Co-authored-by: Stephen Youn <[email protected]> Co-authored-by: Arash Bakhtiari <[email protected]> Co-authored-by: Cheng Li <[email protected]> Co-authored-by: Ethan Doe <[email protected]> Co-authored-by: yidoe <[email protected]> Co-authored-by: Jeff Rasley <[email protected]>

* zero++ tutorial PR (#3783) * [Fix] _conv_flops_compute when padding is a str and stride=1 (#3169) * fix conv_flops_compute when padding is a str when stride=1 * fix error * change type of paddings to tuple * fix padding calculation * apply formatting check --------- Co-authored-by: Cheng Li <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]> * fix interpolate flops compute (#3782) * use `Flops Profiler` to test `model.generate()` (#2515) * Update profiler.py * pre-commit run --all-files * Delete .DS_Store * Delete .DS_Store * Delete .DS_Store --------- Co-authored-by: Jeff Rasley <[email protected]> Co-authored-by: Cheng Li <[email protected]> * revert PR #3166, it disabled grad clip for bf16 * ensure no loss scaling for non-fp16 dtypes * revert PR #3611 (#3786) * bump to 0.9.6 * ZeRO++ chinese blog (#3793) * zeropp chinese blog * try better quality images * make title larger * even larger... * various fix * center captions * more fixes * fix format * remove staging trigger (#3792) * DeepSpeed-Triton for Inference (#3748) Co-authored-by: Stephen Youn <[email protected]> Co-authored-by: Arash Bakhtiari <[email protected]> Co-authored-by: Cheng Li <[email protected]> Co-authored-by: Ethan Doe <[email protected]> Co-authored-by: yidoe <[email protected]> Co-authored-by: Jeff Rasley <[email protected]> * ZeRO++ (#3784) Co-authored-by: HeyangQin <[email protected]> Co-authored-by: GuanhuaWang <[email protected]> Co-authored-by: cmikeh2 <[email protected]> Co-authored-by: Ammar Ahmad Awan <[email protected]> Co-authored-by: Jeff Rasley <[email protected]> Co-authored-by: Michael Wyatt <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]> Co-authored-by: Reza Yazdani <[email protected]> * adding zero++ to navigation panel of deepspeed.ai (#3796) * Add ZeRO++ Japanese blog (#3797) * zeropp chinese blog * try better quality images * make title larger * even larger... * various fix * center captions * more fixes * fix format * add ZeRO++ Japanese blog * add links --------- Co-authored-by: HeyangQin <[email protected]> Co-authored-by: Conglong Li <[email protected]> * Bug Fixes for autotuner and flops profiler (#1880) * fix autotuner when backward is not called * fix format --------- Co-authored-by: Olatunji Ruwase <[email protected]> * Missing strided copy for gated MLP (#3788) Co-authored-by: Ammar Ahmad Awan <[email protected]> Co-authored-by: Jeff Rasley <[email protected]> Co-authored-by: Logan Adams <[email protected]> * Requires grad checking. (#3789) Co-authored-by: Jeff Rasley <[email protected]> * bump to 0.10.0 * Fix Bug in transform.cu (#3534) * Bug fix * Fixed formatting error --------- Co-authored-by: Logan Adams <[email protected]> * bug fix: triton importing error (#3799) Co-authored-by: Stephen Youn <[email protected]> Co-authored-by: Jeff Rasley <[email protected]> --------- Co-authored-by: Heyang Qin <[email protected]> Co-authored-by: Bill Luo <[email protected]> Co-authored-by: Cheng Li <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]> Co-authored-by: Guorun <[email protected]> Co-authored-by: stephen youn <[email protected]> Co-authored-by: Stephen Youn <[email protected]> Co-authored-by: Arash Bakhtiari <[email protected]> Co-authored-by: Ethan Doe <[email protected]> Co-authored-by: yidoe <[email protected]> Co-authored-by: GuanhuaWang <[email protected]> Co-authored-by: cmikeh2 <[email protected]> Co-authored-by: Ammar Ahmad Awan <[email protected]> Co-authored-by: Michael Wyatt <[email protected]> Co-authored-by: Reza Yazdani <[email protected]> Co-authored-by: Masahiro Tanaka <[email protected]> Co-authored-by: Conglong Li <[email protected]> Co-authored-by: Logan Adams <[email protected]> Co-authored-by: Joe Mayer <[email protected]> Co-authored-by: Ramya Ramineni <[email protected]>

mpjlu · 2023-07-21T09:30:30Z

@stephen-youn Do you have plan to support triton on text generation models? Thanks.

…osoft#3790) * zero++ tutorial PR (microsoft#3783) * [Fix] _conv_flops_compute when padding is a str and stride=1 (microsoft#3169) * fix conv_flops_compute when padding is a str when stride=1 * fix error * change type of paddings to tuple * fix padding calculation * apply formatting check --------- Co-authored-by: Cheng Li <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]> * fix interpolate flops compute (microsoft#3782) * use `Flops Profiler` to test `model.generate()` (microsoft#2515) * Update profiler.py * pre-commit run --all-files * Delete .DS_Store * Delete .DS_Store * Delete .DS_Store --------- Co-authored-by: Jeff Rasley <[email protected]> Co-authored-by: Cheng Li <[email protected]> * revert PR microsoft#3166, it disabled grad clip for bf16 * ensure no loss scaling for non-fp16 dtypes * revert PR microsoft#3611 (microsoft#3786) * bump to 0.9.6 * ZeRO++ chinese blog (microsoft#3793) * zeropp chinese blog * try better quality images * make title larger * even larger... * various fix * center captions * more fixes * fix format * remove staging trigger (microsoft#3792) * DeepSpeed-Triton for Inference (microsoft#3748) Co-authored-by: Stephen Youn <[email protected]> Co-authored-by: Arash Bakhtiari <[email protected]> Co-authored-by: Cheng Li <[email protected]> Co-authored-by: Ethan Doe <[email protected]> Co-authored-by: yidoe <[email protected]> Co-authored-by: Jeff Rasley <[email protected]> * ZeRO++ (microsoft#3784) Co-authored-by: HeyangQin <[email protected]> Co-authored-by: GuanhuaWang <[email protected]> Co-authored-by: cmikeh2 <[email protected]> Co-authored-by: Ammar Ahmad Awan <[email protected]> Co-authored-by: Jeff Rasley <[email protected]> Co-authored-by: Michael Wyatt <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]> Co-authored-by: Reza Yazdani <[email protected]> * adding zero++ to navigation panel of deepspeed.ai (microsoft#3796) * Add ZeRO++ Japanese blog (microsoft#3797) * zeropp chinese blog * try better quality images * make title larger * even larger... * various fix * center captions * more fixes * fix format * add ZeRO++ Japanese blog * add links --------- Co-authored-by: HeyangQin <[email protected]> Co-authored-by: Conglong Li <[email protected]> * Bug Fixes for autotuner and flops profiler (microsoft#1880) * fix autotuner when backward is not called * fix format --------- Co-authored-by: Olatunji Ruwase <[email protected]> * Missing strided copy for gated MLP (microsoft#3788) Co-authored-by: Ammar Ahmad Awan <[email protected]> Co-authored-by: Jeff Rasley <[email protected]> Co-authored-by: Logan Adams <[email protected]> * Requires grad checking. (microsoft#3789) Co-authored-by: Jeff Rasley <[email protected]> * bump to 0.10.0 * Fix Bug in transform.cu (microsoft#3534) * Bug fix * Fixed formatting error --------- Co-authored-by: Logan Adams <[email protected]> * bug fix: triton importing error (microsoft#3799) Co-authored-by: Stephen Youn <[email protected]> Co-authored-by: Jeff Rasley <[email protected]> --------- Co-authored-by: Heyang Qin <[email protected]> Co-authored-by: Bill Luo <[email protected]> Co-authored-by: Cheng Li <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]> Co-authored-by: Guorun <[email protected]> Co-authored-by: stephen youn <[email protected]> Co-authored-by: Stephen Youn <[email protected]> Co-authored-by: Arash Bakhtiari <[email protected]> Co-authored-by: Ethan Doe <[email protected]> Co-authored-by: yidoe <[email protected]> Co-authored-by: GuanhuaWang <[email protected]> Co-authored-by: cmikeh2 <[email protected]> Co-authored-by: Ammar Ahmad Awan <[email protected]> Co-authored-by: Michael Wyatt <[email protected]> Co-authored-by: Reza Yazdani <[email protected]> Co-authored-by: Masahiro Tanaka <[email protected]> Co-authored-by: Conglong Li <[email protected]> Co-authored-by: Logan Adams <[email protected]> Co-authored-by: Joe Mayer <[email protected]> Co-authored-by: Ramya Ramineni <[email protected]>

* zero++ tutorial PR (#3783) * [Fix] _conv_flops_compute when padding is a str and stride=1 (#3169) * fix conv_flops_compute when padding is a str when stride=1 * fix error * change type of paddings to tuple * fix padding calculation * apply formatting check --------- Co-authored-by: Cheng Li <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]> * fix interpolate flops compute (#3782) * use `Flops Profiler` to test `model.generate()` (#2515) * Update profiler.py * pre-commit run --all-files * Delete .DS_Store * Delete .DS_Store * Delete .DS_Store --------- Co-authored-by: Jeff Rasley <[email protected]> Co-authored-by: Cheng Li <[email protected]> * revert PR #3611 (#3786) * bump to 0.9.6 * ZeRO++ chinese blog (#3793) * zeropp chinese blog * try better quality images * make title larger * even larger... * various fix * center captions * more fixes * fix format * remove staging trigger (#3792) * DeepSpeed-Triton for Inference (#3748) Co-authored-by: Stephen Youn <[email protected]> Co-authored-by: Arash Bakhtiari <[email protected]> Co-authored-by: Cheng Li <[email protected]> Co-authored-by: Ethan Doe <[email protected]> Co-authored-by: yidoe <[email protected]> Co-authored-by: Jeff Rasley <[email protected]> * ZeRO++ (#3784) Co-authored-by: HeyangQin <[email protected]> Co-authored-by: GuanhuaWang <[email protected]> Co-authored-by: cmikeh2 <[email protected]> Co-authored-by: Ammar Ahmad Awan <[email protected]> Co-authored-by: Jeff Rasley <[email protected]> Co-authored-by: Michael Wyatt <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]> Co-authored-by: Reza Yazdani <[email protected]> * adding zero++ to navigation panel of deepspeed.ai (#3796) * Add ZeRO++ Japanese blog (#3797) * zeropp chinese blog * try better quality images * make title larger * even larger... * various fix * center captions * more fixes * fix format * add ZeRO++ Japanese blog * add links --------- Co-authored-by: HeyangQin <[email protected]> Co-authored-by: Conglong Li <[email protected]> * Bug Fixes for autotuner and flops profiler (#1880) * fix autotuner when backward is not called * fix format --------- Co-authored-by: Olatunji Ruwase <[email protected]> * Missing strided copy for gated MLP (#3788) Co-authored-by: Ammar Ahmad Awan <[email protected]> Co-authored-by: Jeff Rasley <[email protected]> Co-authored-by: Logan Adams <[email protected]> * Requires grad checking. (#3789) Co-authored-by: Jeff Rasley <[email protected]> * bump to 0.10.0 * Fix Bug in transform.cu (#3534) * Bug fix * Fixed formatting error --------- Co-authored-by: Logan Adams <[email protected]> * bug fix: triton importing error (#3799) Co-authored-by: Stephen Youn <[email protected]> Co-authored-by: Jeff Rasley <[email protected]> * init commit for mixed precision lora * fix format * patch _allgather_params & minor fixes * make sure initial quantization are finished * make sure dequantization is finished * skip quantization for small parameters * fix format * remove unused async_op * lazy load of quantizer kernels * add mixed precision lora tutorial * cleanup mics * cleanup mics * replace get_accelerator().current_device() * add kwargs to mics * fix format * seperate code and tutorial * fix _all_gather in zero3 --------- Co-authored-by: Bill Luo <[email protected]> Co-authored-by: Cheng Li <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]> Co-authored-by: Guorun <[email protected]> Co-authored-by: Jeff Rasley <[email protected]> Co-authored-by: stephen youn <[email protected]> Co-authored-by: Stephen Youn <[email protected]> Co-authored-by: Arash Bakhtiari <[email protected]> Co-authored-by: Ethan Doe <[email protected]> Co-authored-by: yidoe <[email protected]> Co-authored-by: GuanhuaWang <[email protected]> Co-authored-by: cmikeh2 <[email protected]> Co-authored-by: Ammar Ahmad Awan <[email protected]> Co-authored-by: Michael Wyatt <[email protected]> Co-authored-by: Reza Yazdani <[email protected]> Co-authored-by: Masahiro Tanaka <[email protected]> Co-authored-by: Conglong Li <[email protected]> Co-authored-by: Logan Adams <[email protected]> Co-authored-by: Joe Mayer <[email protected]> Co-authored-by: Ramya Ramineni <[email protected]>

* INT4 weight only quantization (#479) * INT4 weight only quantization * pre commit * fix UT * fix UT * fix UT * fix UT * fix UT * fix UT * fix UT * add zero3 test * quantize small weight first to prevent oom * fold quantization config into ds_config * Fix license & refactor ds_config & rebase master * fix UT * Moving quantization into post_init_method and add int4 dequantization kernel (#522) * Add experimental int4 dequantize kernel * move quantiation into post_init_method * fix * Refactor: move int4 code to deepspeed/inference (#528) * Move int 4 code to deepspeed/inference * fix * fix * fix * zero++ tutorial PR (#3783) * [Fix] _conv_flops_compute when padding is a str and stride=1 (#3169) * fix conv_flops_compute when padding is a str when stride=1 * fix error * change type of paddings to tuple * fix padding calculation * apply formatting check --------- Co-authored-by: Cheng Li <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]> * fix interpolate flops compute (#3782) * use `Flops Profiler` to test `model.generate()` (#2515) * Update profiler.py * pre-commit run --all-files * Delete .DS_Store * Delete .DS_Store * Delete .DS_Store --------- Co-authored-by: Jeff Rasley <[email protected]> Co-authored-by: Cheng Li <[email protected]> * revert PR #3611 (#3786) * bump to 0.9.6 * ZeRO++ chinese blog (#3793) * zeropp chinese blog * try better quality images * make title larger * even larger... * various fix * center captions * more fixes * fix format * remove staging trigger (#3792) * DeepSpeed-Triton for Inference (#3748) Co-authored-by: Stephen Youn <[email protected]> Co-authored-by: Arash Bakhtiari <[email protected]> Co-authored-by: Cheng Li <[email protected]> Co-authored-by: Ethan Doe <[email protected]> Co-authored-by: yidoe <[email protected]> Co-authored-by: Jeff Rasley <[email protected]> * ZeRO++ (#3784) Co-authored-by: HeyangQin <[email protected]> Co-authored-by: GuanhuaWang <[email protected]> Co-authored-by: cmikeh2 <[email protected]> Co-authored-by: Ammar Ahmad Awan <[email protected]> Co-authored-by: Jeff Rasley <[email protected]> Co-authored-by: Michael Wyatt <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]> Co-authored-by: Reza Yazdani <[email protected]> * adding zero++ to navigation panel of deepspeed.ai (#3796) * Add ZeRO++ Japanese blog (#3797) * zeropp chinese blog * try better quality images * make title larger * even larger... * various fix * center captions * more fixes * fix format * add ZeRO++ Japanese blog * add links --------- Co-authored-by: HeyangQin <[email protected]> Co-authored-by: Conglong Li <[email protected]> * Bug Fixes for autotuner and flops profiler (#1880) * fix autotuner when backward is not called * fix format --------- Co-authored-by: Olatunji Ruwase <[email protected]> * Missing strided copy for gated MLP (#3788) Co-authored-by: Ammar Ahmad Awan <[email protected]> Co-authored-by: Jeff Rasley <[email protected]> Co-authored-by: Logan Adams <[email protected]> * Requires grad checking. (#3789) Co-authored-by: Jeff Rasley <[email protected]> * bump to 0.10.0 * Fix Bug in transform.cu (#3534) * Bug fix * Fixed formatting error --------- Co-authored-by: Logan Adams <[email protected]> * bug fix: triton importing error (#3799) Co-authored-by: Stephen Youn <[email protected]> Co-authored-by: Jeff Rasley <[email protected]> * Fix dequant bug * Address PR feedback * Use super() __exit__ * Fix unit tests --------- Co-authored-by: Donglin Zhuang <[email protected]> Co-authored-by: Heyang Qin <[email protected]> Co-authored-by: Bill Luo <[email protected]> Co-authored-by: Cheng Li <[email protected]> Co-authored-by: Guorun <[email protected]> Co-authored-by: Jeff Rasley <[email protected]> Co-authored-by: stephen youn <[email protected]> Co-authored-by: Stephen Youn <[email protected]> Co-authored-by: Arash Bakhtiari <[email protected]> Co-authored-by: Ethan Doe <[email protected]> Co-authored-by: yidoe <[email protected]> Co-authored-by: GuanhuaWang <[email protected]> Co-authored-by: cmikeh2 <[email protected]> Co-authored-by: Ammar Ahmad Awan <[email protected]> Co-authored-by: Michael Wyatt <[email protected]> Co-authored-by: Reza Yazdani <[email protected]> Co-authored-by: Masahiro Tanaka <[email protected]> Co-authored-by: Conglong Li <[email protected]> Co-authored-by: Logan Adams <[email protected]> Co-authored-by: Joe Mayer <[email protected]> Co-authored-by: Ramya Ramineni <[email protected]>

* zero++ tutorial PR (#3783) * [Fix] _conv_flops_compute when padding is a str and stride=1 (#3169) * fix conv_flops_compute when padding is a str when stride=1 * fix error * change type of paddings to tuple * fix padding calculation * apply formatting check --------- Co-authored-by: Cheng Li <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]> * fix interpolate flops compute (#3782) * use `Flops Profiler` to test `model.generate()` (#2515) * Update profiler.py * pre-commit run --all-files * Delete .DS_Store * Delete .DS_Store * Delete .DS_Store --------- Co-authored-by: Jeff Rasley <[email protected]> Co-authored-by: Cheng Li <[email protected]> * revert PR #3611 (#3786) * bump to 0.9.6 * ZeRO++ chinese blog (#3793) * zeropp chinese blog * try better quality images * make title larger * even larger... * various fix * center captions * more fixes * fix format * remove staging trigger (#3792) * DeepSpeed-Triton for Inference (#3748) Co-authored-by: Stephen Youn <[email protected]> Co-authored-by: Arash Bakhtiari <[email protected]> Co-authored-by: Cheng Li <[email protected]> Co-authored-by: Ethan Doe <[email protected]> Co-authored-by: yidoe <[email protected]> Co-authored-by: Jeff Rasley <[email protected]> * ZeRO++ (#3784) Co-authored-by: HeyangQin <[email protected]> Co-authored-by: GuanhuaWang <[email protected]> Co-authored-by: cmikeh2 <[email protected]> Co-authored-by: Ammar Ahmad Awan <[email protected]> Co-authored-by: Jeff Rasley <[email protected]> Co-authored-by: Michael Wyatt <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]> Co-authored-by: Reza Yazdani <[email protected]> * adding zero++ to navigation panel of deepspeed.ai (#3796) * Add ZeRO++ Japanese blog (#3797) * zeropp chinese blog * try better quality images * make title larger * even larger... * various fix * center captions * more fixes * fix format * add ZeRO++ Japanese blog * add links --------- Co-authored-by: HeyangQin <[email protected]> Co-authored-by: Conglong Li <[email protected]> * Bug Fixes for autotuner and flops profiler (#1880) * fix autotuner when backward is not called * fix format --------- Co-authored-by: Olatunji Ruwase <[email protected]> * Missing strided copy for gated MLP (#3788) Co-authored-by: Ammar Ahmad Awan <[email protected]> Co-authored-by: Jeff Rasley <[email protected]> Co-authored-by: Logan Adams <[email protected]> * Requires grad checking. (#3789) Co-authored-by: Jeff Rasley <[email protected]> * bump to 0.10.0 * Fix Bug in transform.cu (#3534) * Bug fix * Fixed formatting error --------- Co-authored-by: Logan Adams <[email protected]> * bug fix: triton importing error (#3799) Co-authored-by: Stephen Youn <[email protected]> Co-authored-by: Jeff Rasley <[email protected]> * DeepSpeed4Science (#569) * Integrating evoformer attention * add cutlass version check * Updaate error message * add benchmark * Update * Update evoformer_attn.py * Update run_evoformer_test.py * Update evoformer_attn.py * Update run_evoformer_test.py * support more GPU archs * add copyright * add tests * Fix bugs * Update benchmark * update * Fix nvcc macro * clean code * fix formatting * fix yaml import * skip unit test when not compatible * fix yaml requirement * revert changes * update tutorial * update * fix formatting * fix format * skip evoformer attn in pre-compile-ops * revert changes * update tutorial * fix cutlass check * update tutorial * refactor tutorial * revise * Updated the Megatron-DS section (#565) * Updated the Megatron-DS section * minor fix * minor fix * minor fix * separate evoformer tutorial * Revised the ds4science landing page (#566) * Updated the Megatron-DS section * minor fix * minor fix * minor fix * Revised the landing page * Revised the landing page * Removing unused file * fix links image position * modify main page * fix doc --------- Co-authored-by: Shiyang Chen <[email protected]> Co-authored-by: Minjia Zhang <[email protected]> --------- Co-authored-by: Heyang Qin <[email protected]> Co-authored-by: Bill Luo <[email protected]> Co-authored-by: Cheng Li <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]> Co-authored-by: Guorun <[email protected]> Co-authored-by: Jeff Rasley <[email protected]> Co-authored-by: stephen youn <[email protected]> Co-authored-by: Stephen Youn <[email protected]> Co-authored-by: Arash Bakhtiari <[email protected]> Co-authored-by: Ethan Doe <[email protected]> Co-authored-by: yidoe <[email protected]> Co-authored-by: GuanhuaWang <[email protected]> Co-authored-by: cmikeh2 <[email protected]> Co-authored-by: Ammar Ahmad Awan <[email protected]> Co-authored-by: Michael Wyatt <[email protected]> Co-authored-by: Reza Yazdani <[email protected]> Co-authored-by: Masahiro Tanaka <[email protected]> Co-authored-by: Logan Adams <[email protected]> Co-authored-by: Joe Mayer <[email protected]> Co-authored-by: Ramya Ramineni <[email protected]> Co-authored-by: Shiyang Chen <[email protected]> Co-authored-by: Minjia Zhang <[email protected]>

[squash] styoun/triton fp16 transformer (#530)

6d291db

Co-authored-by: Stephen Youn <[email protected]>

stephen-youn requested review from RezaYazdaniAminabadi, jeffra, mrwyattii, awan-10, cmikeh2, arashb, tjruwase and loadams as code owners June 14, 2023 00:33

stephen-youn and others added 21 commits June 14, 2023 14:51

readme for blog

f41e279

typo in readme

2d499bf

readme

42b8de4

readme

d16aaa2

readme

b4fee89

readme

468d698

plots

d5fff4f

readme

643a2fc

readme

e6d44c9

readme

fdb8706

readme

b29157b

readme

8078e04

typo in readme

f73bd90

readme revision after the feedbacks

c2bd7dd

typo

19d4231

refined the writing in readme

314c5ee

readme

4da70a4

readme

bad841c

removed obsolete comments from matmul_ext.py

47928a2

typo

db16fa0

Merge branch 'master' into staging-triton-bert-v1

39cbe47

stephen-youn changed the title ~~[squash] styoun/triton fp16 transformer (#530)~~ DeepSpeed-Triton for Inference Jun 22, 2023

jeffra reviewed Jun 22, 2023

View reviewed changes

blogs/deepspeed-triton/README.md Outdated Show resolved Hide resolved

jeffra reviewed Jun 22, 2023

View reviewed changes

blogs/deepspeed-triton/README.md Outdated Show resolved Hide resolved

jeffra approved these changes Jun 22, 2023

View reviewed changes

cmikeh2 approved these changes Jun 22, 2023

View reviewed changes

styoun and others added 3 commits June 22, 2023 17:00

readme change from pr comments

219a1b8

Merge branch 'staging-triton-bert-v1' of https://github.com/microsoft…

072fe37

…/DeepSpeed into staging-triton-bert-v1

Merge branch 'master' into staging-triton-bert-v1

7f7b76d

RezaYazdaniAminabadi reviewed Jun 22, 2023

View reviewed changes

deepspeed/inference/config.py Show resolved Hide resolved

removed obsolete codes and comments

223ad1f

awan-10 reviewed Jun 22, 2023

View reviewed changes

blogs/deepspeed-triton/README.md Outdated Show resolved Hide resolved

Merge branch 'staging-triton-bert-v1' of https://github.com/microsoft…

afc34fa

…/DeepSpeed into staging-triton-bert-v1

awan-10 approved these changes Jun 22, 2023

View reviewed changes

styoun and others added 2 commits June 22, 2023 18:32

readme

1eadebd

Merge branch 'master' into staging-triton-bert-v1

4cb8b37

jeffra added the merge-queue label Jun 23, 2023

Merge branch 'master' into staging-triton-bert-v1

2835e0d

jeffra merged commit 4dc65f7 into master Jun 23, 2023

jeffra deleted the staging-triton-bert-v1 branch June 23, 2023 02:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DeepSpeed-Triton for Inference #3748

DeepSpeed-Triton for Inference #3748

stephen-youn commented Jun 14, 2023 •

edited

Loading

mpjlu commented Jul 21, 2023

DeepSpeed-Triton for Inference #3748

DeepSpeed-Triton for Inference #3748

Conversation

stephen-youn commented Jun 14, 2023 • edited Loading

mpjlu commented Jul 21, 2023

stephen-youn commented Jun 14, 2023 •

edited

Loading