remove mpi_cxx from multi-gpu build #705

dwyatte · 2023-07-05T16:56:42Z

As noted in this comment linking against mpi_cxx breaks https://github.com/triton-inference-server/fastertransformer_backend since it builds against main

This reverts a single line from #660 to get builds working again while development on fastertransformer and the triton backend is restructured

dwyatte · 2023-07-05T16:57:08Z

@byshiue would you mind having a look in lieu of making any further changes to https://github.com/triton-inference-server/fastertransformer_backend

* Update beam_search_topk_kernels.cu fix: fix bug of beam search * fix: change int of some kernels to int64_t to prevent overflow * fix: gpt tensor shapes inconsistency (NVIDIA#505) Signed-off-by: AkiyamaYummy <[email protected]> * Update gpt_guide.md (NVIDIA#529) * fix: fix bug of gpt buffer and gpt gemm overflow * Update T5DecodingWeight.cc fix: fix loading bug of t5 * [Enhancement]add pytorch backend support for gptneox (NVIDIA#550) * add pytorch backend support for gptneox Signed-off-by: AkiyamaYummy <[email protected]> * fix early stopping invalid * 1) Some unused parameters and logic have been removed. 2) Revisions that would affect pipeline parallelism have been reverted. 3) The code has been made capable of direct validation on TabbyML/NeoX-1.3B. Signed-off-by: AkiyamaYummy <[email protected]> * Change the names of classes, removing 'parallel' from their names Signed-off-by: AkiyamaYummy <[email protected]> * Format the code. Signed-off-by: AkiyamaYummy <[email protected]> * Only print results when rank is 0. Signed-off-by: AkiyamaYummy <[email protected]> * Add dist.init_process_group(). Signed-off-by: AkiyamaYummy <[email protected]> * update docs Signed-off-by: AkiyamaYummy <[email protected]> --------- Signed-off-by: AkiyamaYummy <[email protected]> * Update cublasMMWrapper.cc Fix the CUBLAS_VERSION checking of cublasMMWrapper * Update cublasMMWrapper.cc * fix overflow in softmax_kernel when process long seqlen and big batch_size (NVIDIA#524) * Update unfused_attention_kernels.cu fix bug of softmax kernel * [Enhancement]create huggingface_gptneox_convert.py (NVIDIA#569) * create huggingface_gptneox_convert.py Signed-off-by: AkiyamaYummy <[email protected]> * adjust HF's multi bin files Signed-off-by: AkiyamaYummy <[email protected]> * update gptneox_guide.md Signed-off-by: AkiyamaYummy <[email protected]> --------- Signed-off-by: AkiyamaYummy <[email protected]> * perf(bloom): improve performance of huggingface_bloom_convert.py, decrease the time cost and the mem using (NVIDIA#568) Co-authored-by: r.yang <[email protected]> * Fix/gpt early stop (NVIDIA#584) * fix: fix bug of early stopping of gpt * [bugfix] Fix 2-shot All Reduce correctness issue (indexing bug). (NVIDIA#672) FasterTransformer 2-shot all reduce is implemented as a reduce-scatter + all-gather. There is an indexing bug in the all-gather step. Prior to this change, 2-shot all reduce was only producing correct results on device 0. Now, all devices have the correct results. * fix: swap tensor bug (NVIDIA#683) * Support size_per_head=112 (NVIDIA#660) * fix multi-gpu build * add support for size_per_head=112 for gpt decoder * remove mpi_cxx from multi-gpu build for now (NVIDIA#705) --------- Signed-off-by: AkiyamaYummy <[email protected]> Co-authored-by: byshiue <[email protected]> Co-authored-by: _yummy_ <[email protected]> Co-authored-by: Ying Sheng <[email protected]> Co-authored-by: zhangxin81 <[email protected]> Co-authored-by: 杨睿 <[email protected]> Co-authored-by: r.yang <[email protected]> Co-authored-by: Rahul Kindi <[email protected]> Co-authored-by: Perkz Zheng <[email protected]> Co-authored-by: Daya Khudia <[email protected]> Co-authored-by: Dean Wyatte <[email protected]>

* Merge with main (#1) * Update beam_search_topk_kernels.cu fix: fix bug of beam search * fix: change int of some kernels to int64_t to prevent overflow * fix: gpt tensor shapes inconsistency (NVIDIA#505) Signed-off-by: AkiyamaYummy <[email protected]> * Update gpt_guide.md (NVIDIA#529) * fix: fix bug of gpt buffer and gpt gemm overflow * Update T5DecodingWeight.cc fix: fix loading bug of t5 * [Enhancement]add pytorch backend support for gptneox (NVIDIA#550) * add pytorch backend support for gptneox Signed-off-by: AkiyamaYummy <[email protected]> * fix early stopping invalid * 1) Some unused parameters and logic have been removed. 2) Revisions that would affect pipeline parallelism have been reverted. 3) The code has been made capable of direct validation on TabbyML/NeoX-1.3B. Signed-off-by: AkiyamaYummy <[email protected]> * Change the names of classes, removing 'parallel' from their names Signed-off-by: AkiyamaYummy <[email protected]> * Format the code. Signed-off-by: AkiyamaYummy <[email protected]> * Only print results when rank is 0. Signed-off-by: AkiyamaYummy <[email protected]> * Add dist.init_process_group(). Signed-off-by: AkiyamaYummy <[email protected]> * update docs Signed-off-by: AkiyamaYummy <[email protected]> --------- Signed-off-by: AkiyamaYummy <[email protected]> * Update cublasMMWrapper.cc Fix the CUBLAS_VERSION checking of cublasMMWrapper * Update cublasMMWrapper.cc * fix overflow in softmax_kernel when process long seqlen and big batch_size (NVIDIA#524) * Update unfused_attention_kernels.cu fix bug of softmax kernel * [Enhancement]create huggingface_gptneox_convert.py (NVIDIA#569) * create huggingface_gptneox_convert.py Signed-off-by: AkiyamaYummy <[email protected]> * adjust HF's multi bin files Signed-off-by: AkiyamaYummy <[email protected]> * update gptneox_guide.md Signed-off-by: AkiyamaYummy <[email protected]> --------- Signed-off-by: AkiyamaYummy <[email protected]> * perf(bloom): improve performance of huggingface_bloom_convert.py, decrease the time cost and the mem using (NVIDIA#568) Co-authored-by: r.yang <[email protected]> * Fix/gpt early stop (NVIDIA#584) * fix: fix bug of early stopping of gpt * [bugfix] Fix 2-shot All Reduce correctness issue (indexing bug). (NVIDIA#672) FasterTransformer 2-shot all reduce is implemented as a reduce-scatter + all-gather. There is an indexing bug in the all-gather step. Prior to this change, 2-shot all reduce was only producing correct results on device 0. Now, all devices have the correct results. * fix: swap tensor bug (NVIDIA#683) * Support size_per_head=112 (NVIDIA#660) * fix multi-gpu build * add support for size_per_head=112 for gpt decoder * remove mpi_cxx from multi-gpu build for now (NVIDIA#705) --------- Signed-off-by: AkiyamaYummy <[email protected]> Co-authored-by: byshiue <[email protected]> Co-authored-by: _yummy_ <[email protected]> Co-authored-by: Ying Sheng <[email protected]> Co-authored-by: zhangxin81 <[email protected]> Co-authored-by: 杨睿 <[email protected]> Co-authored-by: r.yang <[email protected]> Co-authored-by: Rahul Kindi <[email protected]> Co-authored-by: Perkz Zheng <[email protected]> Co-authored-by: Daya Khudia <[email protected]> Co-authored-by: Dean Wyatte <[email protected]> * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit --------- Signed-off-by: AkiyamaYummy <[email protected]> Co-authored-by: Asim Shankar <[email protected]> Co-authored-by: byshiue <[email protected]> Co-authored-by: _yummy_ <[email protected]> Co-authored-by: Ying Sheng <[email protected]> Co-authored-by: zhangxin81 <[email protected]> Co-authored-by: 杨睿 <[email protected]> Co-authored-by: r.yang <[email protected]> Co-authored-by: Rahul Kindi <[email protected]> Co-authored-by: Perkz Zheng <[email protected]> Co-authored-by: Daya Khudia <[email protected]> Co-authored-by: Dean Wyatte <[email protected]>

remove mpi_cxx from multi-gpu build for now

af44084

dwyatte mentioned this pull request Jul 5, 2023

Support size_per_head=112 #660

Merged

byshiue merged commit f8e42aa into NVIDIA:main Jul 6, 2023

azahed98 pushed a commit to togethercomputer/FasterTransformer that referenced this pull request Jul 20, 2023

remove mpi_cxx from multi-gpu build for now (NVIDIA#705)

0f2d00b

MinghaoYan mentioned this pull request Sep 5, 2023

Compatibility issue with CUDA 12.2 #730

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

remove mpi_cxx from multi-gpu build #705

remove mpi_cxx from multi-gpu build #705

dwyatte commented Jul 5, 2023

dwyatte commented Jul 5, 2023 •

edited

Loading

remove mpi_cxx from multi-gpu build #705

remove mpi_cxx from multi-gpu build #705

Conversation

dwyatte commented Jul 5, 2023

dwyatte commented Jul 5, 2023 • edited Loading

dwyatte commented Jul 5, 2023 •

edited

Loading