Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GPU] Implement per-token FC dyn-quan #27763

Open
wants to merge 17 commits into
base: master
Choose a base branch
from

Conversation

byungilm
Copy link
Contributor

@byungilm byungilm commented Nov 26, 2024

Details:

  • item1
  • ...

Tickets:

  • 158513

@github-actions github-actions bot added the category: GPU OpenVINO GPU plugin label Nov 26, 2024
@byungilm byungilm force-pushed the validate_per_token_dyn_quan branch 3 times, most recently from ca6918d to ae38233 Compare November 29, 2024 01:59
@byungilm byungilm changed the title [GPU][TEMP] Implement per-token FC dyn-quan [GPU] Implement per-token FC dyn-quan Nov 29, 2024
@byungilm byungilm force-pushed the validate_per_token_dyn_quan branch from 73dfeff to 2812544 Compare December 5, 2024 01:53
@byungilm byungilm marked this pull request as ready for review December 9, 2024 21:20
@byungilm byungilm requested review from a team as code owners December 9, 2024 21:20
@byungilm byungilm self-assigned this Dec 9, 2024

#if FC_KERNEL_DYNAMIC_QUANTIZE
KERNEL(quantize_input)(
const __global INPUT0_TYPE* input,
__global DQ_TYPE* quantized_input,
__global INPUT0_TYPE* quan_var
__global float* quan_var
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it necessary to change into float?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

quan_var contains activation_sum. In case of per-token, sum can not be in a range of half.


#if FC_KERNEL_DYNAMIC_QUANTIZE
KERNEL(quantize_input)(
const __global INPUT0_TYPE* input,
__global DQ_TYPE* quantized_input,
__global INPUT0_TYPE* quan_var
__global float* quan_var
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did you review the perf impact of processing data in a single work-item?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, so I listed-up optimizing quantizing kernel. But still quantizing kernel is much smaller than FC kernel.

de_quantize_scale[bi * 2] = quan_var[scale_offset * 2];
de_quantize_scale[bi * 2 + 1] = quan_var[scale_offset * 2 + scale_pitch * 2];
de_quantize_scale[bi * 2] = TO_INPUT0_TYPE(quan_var[scale_offset * 2]);
de_quantize_scale[bi * 2 + 1] = TO_INPUT0_TYPE(quan_var[scale_offset * 2 + scale_pitch * 2]);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

quan_var is converted to INPUT0_TYPE here. Do we need to use float for quan_var?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

main issue is activation_sum.

// DYNAMIC_QUANTIZE
static size_t get_dynamic_quantize_group_size(const fully_connected_params& params) {
static size_t get_dynamic_quantize_group_size(const fully_connected_params& params, bool print_log = false) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please do not introduce unused argument like print_log

zp_group_size = params.weights.IFM().v / params.decompression_zero_point.Feature().v;

// Per-token dyn-quan
if (dynamic_quantization_group_size >= min_quantize_grp_size && is_per_token_dynamic_quantize(params)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need dynamic_quantization_group_size >= min_quantize_grp_size here? is_per_token... includes that I guess.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not to ignore debugging feature 'dynamic_quantize_layers_without_onednn'.
dynamic_quantization_group_size can be 0 at this line by the feature.

@@ -105,7 +155,7 @@ static size_t get_dynamic_quantize_group_size(const fully_connected_params& para
}

static bool should_dynamic_quantize(const fully_connected_params& params, bool print_log = false) {
size_t dynamic_quantization_group_size = get_dynamic_quantize_group_size(params);
size_t dynamic_quantization_group_size = get_dynamic_quantize_group_size(params, print_log);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if this is a meaningful log, just print it with TRACE_DETAIL.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Applied


jit.AddConstant(MakeJitConstant("INPUT_LOAD_SIZE", act_load_size));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure INPUT_LOAD_SIZE is necessary or not, but what about using single name instead of three? act_load_size, INPUT_LOAD_SIZE and QUAN_BLOCK_SIZE

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Applied

@@ -840,17 +895,19 @@ void FullyConnected_bf_tiled::GetUpdateDispatchDataFunc(KernelData& kd) const {
kd.kernels[0].skip_execution = false;
size_t input_f = get_input_bf_size(prim_params).second;
size_t input_size = input_f * dispatchData.tile_m * dispatchData.gws[2];
size_t quan_var_size = (input_size / quantize_grp_size) * 4 * 2;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about (input_size / quantize_grp_size) * sizeof(float) * 2 for better readability?


if (kd.internalBufferSizes[0] < input_size) {
if (kd.internalBufferSizes[0] < input_size ||
kd.internalBufferSizes[1] < quan_var_size || true) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unnecessary true


GPU_DEBUG_TRACE_DETAIL << "FC dyn-quantize by per-token. Actual dyn_quan_group_size(" << dynamic_quantization_group_size
<< ") : From scale_group_size (" << scale_group_size << ", zp_group_size(" << zp_group_size
<< "), zp_group_num(" << zp_group_num << "), ifm_size (" << get_input_bf_size(params).second << ")" << std::endl;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Please use LOG instead of TRACE_DETAIL

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Applied

+ Resolved accuracy issue
+ Cleared OOR error

Signed-off-by: Min, Byungil <[email protected]>
Signed-off-by: Min, Byungil <[email protected]>
Signed-off-by: Min, Byungil <[email protected]>
Signed-off-by: Min, Byungil <[email protected]>
+ Fixed CI issue
+ Added unit-tests

Signed-off-by: Min, Byungil <[email protected]>
Signed-off-by: Min, Byungil <[email protected]>
Signed-off-by: Min, Byungil <[email protected]>
@byungilm byungilm force-pushed the validate_per_token_dyn_quan branch from c7b123f to d83debc Compare December 12, 2024 11:48
@byungilm
Copy link
Contributor Author

Resolved some conflicts from fc_gpu_bf_tile.cpp

@byungilm
Copy link
Contributor Author

byungilm commented Dec 12, 2024

  • Reverted change of data type 'quan_var'. (It assums that activation sum of sym 8bit model input would not overflow from half range.
  • Changed important log from TRACE_DETAIL to _LOG level about setting group size.
  • Moved out scale calculation from inner loops if token size is per-token

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category: GPU OpenVINO GPU plugin
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants