New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

[GPU] Implement per-token FC dyn-quan #27763

Open

byungilm wants to merge 17 commits into openvinotoolkit:master from byungilm:validate_per_token_dyn_quan

+212 −98

Contributor

byungilm commented Nov 26, 2024 •

edited by isanghao

Loading

Details:

item1
...

Tickets:

158513

github-actions bot added the category: GPU label

byungilm force-pushed the validate_per_token_dyn_quan branch 3 times, most recently from ca6918d to ae38233 Compare

November 29, 2024 01:59

byungilm changed the title ~~[GPU][TEMP] Implement per-token FC dyn-quan~~ [GPU] Implement per-token FC dyn-quan

byungilm force-pushed the validate_per_token_dyn_quan branch from 73dfeff to 2812544 Compare

December 5, 2024 01:53

byungilm marked this pull request as ready for review

December 9, 2024 21:20

byungilm requested review from a team as code owners

December 9, 2024 21:20

byungilm self-assigned this

byungilm requested review from geunhwan, isanghao, sungeunk and jade-cho

December 9, 2024 21:20

isanghao reviewed

View reviewed changes

src/plugins/intel_gpu/src/kernel_selector/cl_kernels/fully_connected_gpu_bf_tiled.cl Outdated

               #if FC_KERNEL_DYNAMIC_QUANTIZE
               KERNEL(quantize_input)(
                   const __global INPUT0_TYPE* input,
                   __global DQ_TYPE* quantized_input,
-                  __global INPUT0_TYPE* quan_var
+                  __global float* quan_var

Contributor

isanghao Dec 11, 2024

is it necessary to change into float?

Contributor Author

byungilm Dec 11, 2024

quan_var contains activation_sum. In case of per-token, sum can not be in a range of half.

src/plugins/intel_gpu/src/kernel_selector/cl_kernels/fully_connected_gpu_bf_tiled.cl Outdated

               #if FC_KERNEL_DYNAMIC_QUANTIZE
               KERNEL(quantize_input)(
                   const __global INPUT0_TYPE* input,
                   __global DQ_TYPE* quantized_input,
-                  __global INPUT0_TYPE* quan_var
+                  __global float* quan_var

Contributor

isanghao Dec 11, 2024

did you review the perf impact of processing data in a single work-item?

Contributor Author

byungilm Dec 11, 2024

Yes, so I listed-up optimizing quantizing kernel. But still quantizing kernel is much smaller than FC kernel.

src/plugins/intel_gpu/src/kernel_selector/cl_kernels/fully_connected_gpu_bf_tiled.cl

    
                              de_quantize_scale[bi * 2] = quan_var[scale_offset * 2];

                              de_quantize_scale[bi * 2 + 1] = quan_var[scale_offset * 2 + scale_pitch * 2];

                              de_quantize_scale[bi * 2] = TO_INPUT0_TYPE(quan_var[scale_offset * 2]);

                              de_quantize_scale[bi * 2 + 1] = TO_INPUT0_TYPE(quan_var[scale_offset * 2 + scale_pitch * 2]);

Contributor

isanghao Dec 11, 2024

quan_var is converted to INPUT0_TYPE here. Do we need to use float for quan_var?

Contributor Author

byungilm Dec 11, 2024

main issue is activation_sum.

...ns/intel_gpu/src/kernel_selector/kernels/fully_connected/fully_connected_kernel_bf_tiled.cpp Outdated

               // DYNAMIC_QUANTIZE
-              static size_t get_dynamic_quantize_group_size(const fully_connected_params& params) {
+              static size_t get_dynamic_quantize_group_size(const fully_connected_params& params, bool print_log = false) {

Contributor

isanghao Dec 11, 2024

please do not introduce unused argument like print_log

...ns/intel_gpu/src/kernel_selector/kernels/fully_connected/fully_connected_kernel_bf_tiled.cpp

+                      zp_group_size = params.weights.IFM().v / params.decompression_zero_point.Feature().v;
+                  // Per-token dyn-quan
+                  if (dynamic_quantization_group_size >= min_quantize_grp_size && is_per_token_dynamic_quantize(params)) {

Contributor

isanghao Dec 11, 2024

why do we need dynamic_quantization_group_size >= min_quantize_grp_size here? is_per_token... includes that I guess.

Contributor Author

byungilm Dec 11, 2024

It is not to ignore debugging feature 'dynamic_quantize_layers_without_onednn'.
dynamic_quantization_group_size can be 0 at this line by the feature.

...ns/intel_gpu/src/kernel_selector/kernels/fully_connected/fully_connected_kernel_bf_tiled.cpp Outdated

               }
               static bool should_dynamic_quantize(const fully_connected_params& params, bool print_log = false) {
-                  size_t dynamic_quantization_group_size = get_dynamic_quantize_group_size(params);
+                  size_t dynamic_quantization_group_size = get_dynamic_quantize_group_size(params, print_log);

Contributor

isanghao Dec 11, 2024

if this is a meaningful log, just print it with TRACE_DETAIL.

Contributor Author

byungilm Dec 11, 2024

Applied

...ns/intel_gpu/src/kernel_selector/kernels/fully_connected/fully_connected_kernel_bf_tiled.cpp Outdated


		jit.AddConstant(MakeJitConstant("INPUT_LOAD_SIZE", act_load_size));

Contributor

isanghao Dec 11, 2024

not sure INPUT_LOAD_SIZE is necessary or not, but what about using single name instead of three? act_load_size, INPUT_LOAD_SIZE and QUAN_BLOCK_SIZE

Contributor Author

byungilm Dec 11, 2024

Applied

...ns/intel_gpu/src/kernel_selector/kernels/fully_connected/fully_connected_kernel_bf_tiled.cpp Outdated

                                   kd.kernels[0].skip_execution = false;
                                   size_t input_f = get_input_bf_size(prim_params).second;
                                   size_t input_size = input_f * dispatchData.tile_m * dispatchData.gws[2];
+                                  size_t quan_var_size = (input_size / quantize_grp_size) * 4 * 2;

Contributor

isanghao Dec 11, 2024

what about (input_size / quantize_grp_size) * sizeof(float) * 2 for better readability?

...ns/intel_gpu/src/kernel_selector/kernels/fully_connected/fully_connected_kernel_bf_tiled.cpp Outdated

-                                  if (kd.internalBufferSizes[0] < input_size) {
+                                  if (kd.internalBufferSizes[0] < input_size ||
+                                      kd.internalBufferSizes[1] < quan_var_size || true) {

Contributor

isanghao Dec 11, 2024

unnecessary true

src/plugins/intel_gpu/tests/unit/test_cases/fully_connected_gpu_test.cpp Outdated Show resolved Hide resolved

isanghao reviewed

View reviewed changes

...ns/intel_gpu/src/kernel_selector/kernels/fully_connected/fully_connected_kernel_bf_tiled.cpp Outdated

+                          GPU_DEBUG_TRACE_DETAIL << "FC dyn-quantize by per-token. Actual dyn_quan_group_size(" << dynamic_quantization_group_size
+                                                  << ") : From scale_group_size (" << scale_group_size << ", zp_group_size("  << zp_group_size
+                                                  << "), zp_group_num(" << zp_group_num << "), ifm_size (" << get_input_bf_size(params).second << ")" << std::endl;

Contributor

isanghao Dec 12, 2024

Please use LOG instead of TRACE_DETAIL

Contributor Author

byungilm Dec 12, 2024

Applied

byungilm added 13 commits

December 12, 2024 19:40


          [GPU] Implemente per-token FC dyn-quan

41b2f2b

Signed-off-by: Min, Byungil <[email protected]>


          [GPU] Bugfix Per-token dyn-quan

690957a

+ Resolved accuracy issue
+ Cleared OOR error

Signed-off-by: Min, Byungil <[email protected]>


          Clear unused lines

02dfdbb

Signed-off-by: Min, Byungil <[email protected]>


          Improve perf and fix for weight zp

e2c4732

Signed-off-by: Min, Byungil <[email protected]>


          [GPU] Implement execution failure of sd1.5

Signed-off-by: Min, Byungil <[email protected]>


          [GPU] Improve per-token perf

8e4495f

Signed-off-by: Min, Byungil <[email protected]>


          [GPU] Update unit-tests

298594e

Signed-off-by: Min, Byungil <[email protected]>


          [GPU] Fix CI failure

7527bbd

+ Fixed CI issue
+ Added unit-tests

Signed-off-by: Min, Byungil <[email protected]>


          [GPU] Resolve unit-tests failure

75614fd

Signed-off-by: Min, Byungil <[email protected]>


          [GPU] Fix unit-test failure

1b47bbb

Signed-off-by: Min, Byungil <[email protected]>


          [GPU] Remove debugging code in unit-tests

aafb6cf

Signed-off-by: Min, Byungil <[email protected]>


          [GPU] Applied comments

30442a1

Signed-off-by: Min, Byungil <[email protected]>


          [GPU] Revert quantizing variable data-type

d83debc

Signed-off-by: Min, Byungil <[email protected]>

byungilm force-pushed the validate_per_token_dyn_quan branch from c7b123f to d83debc Compare

December 12, 2024 11:48

Contributor Author

byungilm commented Dec 12, 2024

Resolved some conflicts from fc_gpu_bf_tile.cpp

Contributor Author

byungilm commented Dec 12, 2024 •

edited

Loading

Reverted change of data type 'quan_var'. (It assums that activation sum of sym 8bit model input would not overflow from half range.
Changed important log from TRACE_DETAIL to _LOG level about setting group size.
Moved out scale calculation from inner loops if token size is per-token

byungilm added 4 commits

December 13, 2024 16:43


          [GPU] Add optimization for per-token to reduce calculation

33e33de

Signed-off-by: Min, Byungil <[email protected]>


          [GPU] fix cpplint

Signed-off-by: Min, Byungil <[email protected]>


          [GPU] Fix unit-tests failure

227d9a1

Signed-off-by: Min, Byungil <[email protected]>


          Merge branch 'master' into validate_per_token_dyn_quan

a21b39e

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels