-
Notifications
You must be signed in to change notification settings - Fork 9.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cuda : improve multi-GPU performance using cuBLAS #3814
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand the purpose of this PR. I would very much expect either FP16 or FP32 to be optimal, regardless of whether or not the tensor is split. As I said before, I would expect the best implementation to be converting the hidden state to FP16 on the main GPU and then distributing that.
@slaren when you implemented the use of FP16 cuBLAS GEMM, did you consider the case of >1 GPU? Because to me the code looks like for multiple GPUs FP32 is still being used on master.
#ifdef GGML_CUDA_FORCE_MMQ | ||
#define MUL_MAT_SRC1_COL_STRIDE 128 | ||
#else | ||
// with tensor cores, we copy the entire hidden state to the devices in one go | ||
#define MUL_MAT_SRC1_COL_STRIDE 4096 | ||
#endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs a check for compute capability/AMD.
The reason to do it like this is because on the main device, the F16 -> F32 conversion of
Yes, but I have limited access to multi-GPU systems and can try only simple things like the change in this PR. Implementing F16 hidden state distribution would require more effort and resources. The long term goal is to keep all results in F16 anyway, so this will likely resolve that problem as we start to support that. |
I explicitly disabled the FP16 mat mul when using multiple GPUs. This is what this PR should fix. |
This PR makes prompt processing much faster when using both my Tesla P40 and GTX 970.
I ran some benchmarks with batched-bench on a 7b q4_0 model with mmq disabled (since my GTX 970 is too old). master @ 2f9ec7e
PR
|
How can this work on P40 at all since it has no tensor cores? |
The MUL_MAT_SRC1_COL_STRIDE change applies to all GPUs. Johannes says that needs a compute capability check, but I'm not sure if he means for performance reasons or for correctness. The perplexity seems fine. |
@cebtenzzre If you apply this patch to the current PR: diff --git a/ggml-cuda.cu b/ggml-cuda.cu
index 9c7cf357..8e27a76d 100644
--- a/ggml-cuda.cu
+++ b/ggml-cuda.cu
@@ -6366,7 +6366,7 @@ inline void ggml_cuda_op_mul_mat_cublas(
// ldc == nrows of the matrix that cuBLAS writes into
const int ldc = dst->backend == GGML_BACKEND_GPU && id == g_main_device ? ne0 : row_diff;
- const bool is_split = row_diff != src0->ne[1];
+ const bool is_split = false;
if (compute_capability >= CC_VOLTA && (src0->type == GGML_TYPE_F16 || ggml_is_quantized(src0->type)) && ggml_is_contiguous(src0)) {
// convert src0 and src1 to fp16, multiply as fp16, convert dst to fp32
ggerganov ▶ Georgis-MacBook-Pro ▶ ~/development/github/llama.cpp ▶ and run on single GPU (P40 or GTX970) without |
Tested on A10 * 2. It seems that the speed of both S_PP and S_TG has been improved. Master 2 GPUs 6e08281CUDA_VISIBLE_DEVICES=0,1 ./batched-bench Mistral-7B-Instruct-v0.1/ggml-model-f16.gguf 14592 0 99 1 100 128 1,2,3,4,5,6,7,8,16,32,64
PR 2 GPUsCUDA_VISIBLE_DEVICES=0,1 ./batched-bench Mistral-7B-Instruct-v0.1/ggml-model-f16.gguf 14592 0 99 1 100 128 1,2,3,4,5,6,7,8,16,32,64
But it's still slower than on 1 GPU. Master 1 GPU 6e08281CUDA_VISIBLE_DEVICES=0 ./batched-bench Mistral-7B-Instruct-v0.1/ggml-model-f16.gguf 14592 0 99 1 100 128 1,2,3,4,5,6,7,8,16,32,64
|
@cebtenzzre program correctness should not depend on |
I did dual P40s with this PR + patch and had a regression in token speed (down to 5.5t/s). No real improvement for prompt processing or anything. That is without forcing MMV on. With it, it behaves like master. Dual 3090 had a regression of speed from 18.93t/s down to 17.0 t/s and crashed on long context as I never took the patch off. |
Test onto dual T4 CUDA_VISIBLE_DEVICES=0,1 ./batched-bench ./CodeLlama-7b-hf/ggml-model-f16.gguf 14592 0 99 1 100 128 1,2,3,4,5,6,7,8,16,32,64 master: main: n_kv_max = 14592, is_pp_shared = 0, n_gpu_layers = 99, mmq = 1
llama_print_timings: load time = 9256.74 ms PR: main: n_kv_max = 14592, is_pp_shared = 0, n_gpu_layers = 99, mmq = 1
llama_print_timings: load time = 8965.64 ms |
FWIW, using the GTX 970 disables mmq due to lack of dp4a, and I cannot fit a 7b q4_0 on it since it has only 4GB of VRAM. But as of 2f9ec7e, there is no clear way to disable mmq in order to make a fair comparison. |
This part makes multi-GPU faster for me with no clear drawbacks.
ref #3479
Not sure if this has any positive effect. Looking for feedback
LLAMA_CUBLAS=1 make -j && CUDA_VISIBLE_DEVICES=0,1 ./batched-bench models/codellama-7b/ggml-model-f16.gguf 14592 0 99 1 100 128 1,2,3,4,5,6,7,8,16,32,64