-
Notifications
You must be signed in to change notification settings - Fork 9.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SYCL] iq2_s #6052
[SYCL] iq2_s #6052
Conversation
Tested on A100 GPU and Intel PVC.
|
Yes this is known for other quants as well, I will fix this . |
What's the performance change with this PR? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently breaks both supported target vendors
@@ -15435,7 +15630,7 @@ static void ggml_sycl_mul_mat(const ggml_tensor * src0, const ggml_tensor * src1 | |||
#ifdef GGML_SYCL_FORCE_DMMV | |||
const bool use_mul_mat_vec_q = false; | |||
#else | |||
const bool use_mul_mat_vec_q = min_compute_capability >= VER_4VEC && ggml_is_quantized(src0->type) && ggml_nrows(src1) == 1; | |||
const bool use_mul_mat_vec_q = min_compute_capability >= VER_4VEC && ggml_is_quantized(src0->type); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
const bool use_mul_mat_vec_q = min_compute_capability >= VER_4VEC && ggml_is_quantized(src0->type); | |
const bool use_mul_mat_vec_q = min_compute_capability >= VER_4VEC && ggml_is_quantized(src0->type) & ggml_nrows(src1) == 1; |
This is a breaking change for both Intel and Nvidia targets. ggml_sycl_op_mul_mat_vec_q
asserts that GGML_ASSERT(ggml_nrows(src1) == 1);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes actually the assertion should not be mandatory and ggml_nrows=1 wont allow the iq stages to run.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Iv tested with the suppression of the assert. It fails on both Intel and Nvidia targets. In both cases it is due to the function lacking the case for your new quantization type
Intel (Intel(R) Level-Zero, Intel(R) Data Center GPU Max 1100 1.3 [1.3.28454]):
There is a failure in ggml_sycl_op_dequantize_mul_mat_vec
of
MUL_MAT(type_a=iq2_s,type_b=f32,m=16,n=1,k=256,bs=[1,1],nr=[1,1]): GGML_ASSERT: /home/aidanbelton/source/temp/llama.cpp/ggml-sycl.cpp:14217: false
Nvidia (NVIDIA A100-PCIE-40GB 8.0 [CUDA 12.2]):
There is a failure in ggml_sycl_op_mul_mat_q
of MUL_MAT(type_a=iq2_s,type_b=f32,m=16,n=16,k=256,bs=[1,1],nr=[1,1]): GGML_ASSERT: /home/aidanbelton/source/temp/llama.cpp/ggml-sycl.cpp:14019: false. Issue is missing case for quantization type
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually for both the cases, it should go to neither of the methods. It is strange that it is falling back to either dequantize_mul_mat_vec or op_mul_mat_q path. For this type only vectorised mul_mat_q-mmvq should be called.
@abhilash1910 I test a basic model llama2-7b-Q4 with your code. it's fault. I feel your solution change more legacy code, instead of only impacting IQ2_S data type. And in ggml_sycl_mul_mat(), the fp32 and fp16 branchs are removed. I'm afraid it will impact the legacy (Q4) performance. I suggest you update the code with minimum change to impact IQ2_S type only. |
In this case, the logic flow is kept unchanged w.r.t fp32/16 dtypes . I think it wont impact the legacy functionality and would rather send the src types to correct branch (dmmq/mmq or mmvq) based on the implemented branching rules in cuda source. This would help enable other quants through mmvq pathway as well. |
This PR is used to support IQ2_S, I suggest to minimum the scope as possible. If change logic in ggml_sycl_mul_mat(), please test CI and llama-bench, make sure the performance on Intel GPU not be impacted. It's more work. SCYL support IQ2_S is a task, refactor ggml_sycl_mul_mat() is another task. |
I agree, the conditional branch is complex , reverted to older legacy code. Testing in progress. Later if refactoring is needed then will engage in separate PR. Yes conversion to fp 16 wont be helpful much ; changes only allocated for iq2 & adjacent quants are applicable here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some mul_mat quantization tests are failing for intel + nvidia hardware.
Intel failing quantization types(Data Center Max 1100): iq2_xxs, iq2_xs, iq2_s, iq3_xxs, iq1_s, iq3_s
Nvidia failing quantization types(A100): iq1_s
Failing Tests
Failing on PVC:
MUL_MAT(type_a=iq2_xxs,type_b=f32,m=16,n=1,k=256,bs=[1,1],nr=[1,1]): [MUL_MAT] NMSE = inf > 0.000500000 FAIL
MUL_MAT(type_a=iq2_xxs,type_b=f32,m=16,n=1,k=256,bs=[10,1],nr=[1,1]): [MUL_MAT] NMSE = inf > 0.000500000 FAIL
MUL_MAT(type_a=iq2_xxs,type_b=f32,m=16,n=1,k=256,bs=[10,1],nr=[2,1]): [MUL_MAT] NMSE = inf > 0.000500000 FAIL
MUL_MAT(type_a=iq2_xxs,type_b=f32,m=16,n=1,k=256,bs=[10,10],nr=[1,1]): [MUL_MAT] NMSE = inf > 0.000500000 FAIL
MUL_MAT(type_a=iq2_xxs,type_b=f32,m=16,n=1,k=256,bs=[10,10],nr=[2,1]): [MUL_MAT] NMSE = inf > 0.000500000 FAIL
MUL_MAT(type_a=iq2_xs,type_b=f32,m=16,n=1,k=256,bs=[1,1],nr=[1,1]): [MUL_MAT] NMSE = inf > 0.000500000 FAIL
MUL_MAT(type_a=iq2_xs,type_b=f32,m=16,n=1,k=256,bs=[10,1],nr=[1,1]): [MUL_MAT] NMSE = inf > 0.000500000 FAIL
MUL_MAT(type_a=iq2_xs,type_b=f32,m=16,n=1,k=256,bs=[10,1],nr=[2,1]): [MUL_MAT] NMSE = inf > 0.000500000 FAIL
MUL_MAT(type_a=iq2_xs,type_b=f32,m=16,n=1,k=256,bs=[10,10],nr=[1,1]): [MUL_MAT] NMSE = inf > 0.000500000 FAIL
MUL_MAT(type_a=iq2_xs,type_b=f32,m=16,n=1,k=256,bs=[10,10],nr=[2,1]): [MUL_MAT] NMSE = inf > 0.000500000 FAIL
MUL_MAT(type_a=iq2_s,type_b=f32,m=16,n=1,k=256,bs=[1,1],nr=[1,1]): [MUL_MAT] NMSE = inf > 0.000500000 FAIL
MUL_MAT(type_a=iq2_s,type_b=f32,m=16,n=1,k=256,bs=[10,1],nr=[1,1]): [MUL_MAT] NMSE = inf > 0.000500000 FAIL
MUL_MAT(type_a=iq2_s,type_b=f32,m=16,n=1,k=256,bs=[10,1],nr=[2,1]): [MUL_MAT] NMSE = inf > 0.000500000 FAIL
MUL_MAT(type_a=iq2_s,type_b=f32,m=16,n=1,k=256,bs=[10,10],nr=[1,1]): [MUL_MAT] NMSE = inf > 0.000500000 FAIL
MUL_MAT(type_a=iq2_s,type_b=f32,m=16,n=1,k=256,bs=[10,10],nr=[2,1]): [MUL_MAT] NMSE = inf > 0.000500000 FAIL
MUL_MAT(type_a=iq3_xxs,type_b=f32,m=16,n=1,k=256,bs=[1,1],nr=[1,1]): [MUL_MAT] NMSE = inf > 0.000500000 FAIL
MUL_MAT(type_a=iq3_xxs,type_b=f32,m=16,n=1,k=256,bs=[10,1],nr=[1,1]): [MUL_MAT] NMSE = inf > 0.000500000 FAIL
MUL_MAT(type_a=iq3_xxs,type_b=f32,m=16,n=1,k=256,bs=[10,1],nr=[2,1]): [MUL_MAT] NMSE = inf > 0.000500000 FAIL
MUL_MAT(type_a=iq3_xxs,type_b=f32,m=16,n=1,k=256,bs=[10,10],nr=[1,1]): [MUL_MAT] NMSE = inf > 0.000500000 FAIL
MUL_MAT(type_a=iq3_xxs,type_b=f32,m=16,n=1,k=256,bs=[10,10],nr=[2,1]): [MUL_MAT] NMSE = inf > 0.000500000 FAIL
MUL_MAT(type_a=iq1_s,type_b=f32,m=16,n=1,k=256,bs=[1,1],nr=[1,1]): [MUL_MAT] NMSE = inf > 0.000500000 FAIL
MUL_MAT(type_a=iq1_s,type_b=f32,m=16,n=1,k=256,bs=[10,1],nr=[1,1]): [MUL_MAT] NMSE = inf > 0.000500000 FAIL
MUL_MAT(type_a=iq1_s,type_b=f32,m=16,n=1,k=256,bs=[10,1],nr=[2,1]): [MUL_MAT] NMSE = inf > 0.000500000 FAIL
MUL_MAT(type_a=iq1_s,type_b=f32,m=16,n=1,k=256,bs=[10,10],nr=[1,1]): [MUL_MAT] NMSE = inf > 0.000500000 FAIL
MUL_MAT(type_a=iq1_s,type_b=f32,m=16,n=1,k=256,bs=[10,10],nr=[2,1]): [MUL_MAT] NMSE = inf > 0.000500000 FAIL
MUL_MAT(type_a=iq3_s,type_b=f32,m=16,n=1,k=256,bs=[1,1],nr=[1,1]): [MUL_MAT] NMSE = inf > 0.000500000 FAIL
MUL_MAT(type_a=iq3_s,type_b=f32,m=16,n=1,k=256,bs=[10,1],nr=[1,1]): [MUL_MAT] NMSE = inf > 0.000500000 FAIL
MUL_MAT(type_a=iq3_s,type_b=f32,m=16,n=1,k=256,bs=[10,1],nr=[2,1]): [MUL_MAT] NMSE = inf > 0.000500000 FAIL
MUL_MAT(type_a=iq3_s,type_b=f32,m=16,n=1,k=256,bs=[10,10],nr=[1,1]): [MUL_MAT] NMSE = inf > 0.000500000 FAIL
MUL_MAT(type_a=iq3_s,type_b=f32,m=16,n=1,k=256,bs=[10,10],nr=[2,1]): [MUL_MAT] NMSE = inf > 0.000500000 FAIL
Failing on NVidia:
MUL_MAT(type_a=iq1_s,type_b=f32,m=16,n=1,k=256,bs=[1,1],nr=[1,1]): [MUL_MAT] NMSE = 14.391691029 > 0.000500000 FAIL
MUL_MAT(type_a=iq1_s,type_b=f32,m=16,n=1,k=256,bs=[10,1],nr=[1,1]): [MUL_MAT] NMSE = 17.976997390 > 0.000500000 FAIL
MUL_MAT(type_a=iq1_s,type_b=f32,m=16,n=1,k=256,bs=[10,1],nr=[2,1]): [MUL_MAT] NMSE = 9.404314750 > 0.000500000 FAIL
MUL_MAT(type_a=iq1_s,type_b=f32,m=16,n=1,k=256,bs=[10,10],nr=[1,1]): [MUL_MAT] NMSE = 8.215036844 > 0.000500000 FAIL
MUL_MAT(type_a=iq1_s,type_b=f32,m=16,n=1,k=256,bs=[10,10],nr=[2,1]): [MUL_MAT] NMSE = 10.141738006 > 0.000500000 FAIL
@@ -4436,6 +4436,24 @@ static void dequantize_block_q6_K(const void * __restrict__ vx, dst_t * __restri | |||
#endif | |||
} | |||
|
|||
inline bool ggml_sycl_supports_mmq(enum ggml_type type) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an unused function, not sure if it should be added.
//GGML_ASSERT(ggml_nrows(src1) == 1); | ||
//GGML_ASSERT(ne10 % QK8_1 == 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Delete comments
Co-authored-by: AidanBeltonS <[email protected]>
Yes I see the failing tests in iGPU, the failure in iq1s is known as it is buggy . |
@abhilash1910 Thank your support! |
Fixed in #6521 , thanks @NeoZhangJianyu |
Initial Support for IQ2_S . Requires validation before merge.
@NeoZhangJianyu @airMeng @ggerganov @AidanBeltonS