-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[GPU] Implement per-token FC dyn-quan #27763
base: master
Are you sure you want to change the base?
[GPU] Implement per-token FC dyn-quan #27763
Conversation
ca6918d
to
ae38233
Compare
73dfeff
to
2812544
Compare
|
||
#if FC_KERNEL_DYNAMIC_QUANTIZE | ||
KERNEL(quantize_input)( | ||
const __global INPUT0_TYPE* input, | ||
__global DQ_TYPE* quantized_input, | ||
__global INPUT0_TYPE* quan_var | ||
__global float* quan_var |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it necessary to change into float?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
quan_var contains activation_sum. In case of per-token, sum can not be in a range of half.
|
||
#if FC_KERNEL_DYNAMIC_QUANTIZE | ||
KERNEL(quantize_input)( | ||
const __global INPUT0_TYPE* input, | ||
__global DQ_TYPE* quantized_input, | ||
__global INPUT0_TYPE* quan_var | ||
__global float* quan_var |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did you review the perf impact of processing data in a single work-item?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, so I listed-up optimizing quantizing kernel. But still quantizing kernel is much smaller than FC kernel.
de_quantize_scale[bi * 2] = quan_var[scale_offset * 2]; | ||
de_quantize_scale[bi * 2 + 1] = quan_var[scale_offset * 2 + scale_pitch * 2]; | ||
de_quantize_scale[bi * 2] = TO_INPUT0_TYPE(quan_var[scale_offset * 2]); | ||
de_quantize_scale[bi * 2 + 1] = TO_INPUT0_TYPE(quan_var[scale_offset * 2 + scale_pitch * 2]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
quan_var
is converted to INPUT0_TYPE
here. Do we need to use float for quan_var
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
main issue is activation_sum.
// DYNAMIC_QUANTIZE | ||
static size_t get_dynamic_quantize_group_size(const fully_connected_params& params) { | ||
static size_t get_dynamic_quantize_group_size(const fully_connected_params& params, bool print_log = false) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please do not introduce unused argument like print_log
zp_group_size = params.weights.IFM().v / params.decompression_zero_point.Feature().v; | ||
|
||
// Per-token dyn-quan | ||
if (dynamic_quantization_group_size >= min_quantize_grp_size && is_per_token_dynamic_quantize(params)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need dynamic_quantization_group_size >= min_quantize_grp_size
here? is_per_token...
includes that I guess.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not to ignore debugging feature 'dynamic_quantize_layers_without_onednn'.
dynamic_quantization_group_size can be 0 at this line by the feature.
@@ -105,7 +155,7 @@ static size_t get_dynamic_quantize_group_size(const fully_connected_params& para | |||
} | |||
|
|||
static bool should_dynamic_quantize(const fully_connected_params& params, bool print_log = false) { | |||
size_t dynamic_quantization_group_size = get_dynamic_quantize_group_size(params); | |||
size_t dynamic_quantization_group_size = get_dynamic_quantize_group_size(params, print_log); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if this is a meaningful log, just print it with TRACE_DETAIL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Applied
|
||
jit.AddConstant(MakeJitConstant("INPUT_LOAD_SIZE", act_load_size)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure INPUT_LOAD_SIZE
is necessary or not, but what about using single name instead of three? act_load_size
, INPUT_LOAD_SIZE
and QUAN_BLOCK_SIZE
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Applied
@@ -840,17 +895,19 @@ void FullyConnected_bf_tiled::GetUpdateDispatchDataFunc(KernelData& kd) const { | |||
kd.kernels[0].skip_execution = false; | |||
size_t input_f = get_input_bf_size(prim_params).second; | |||
size_t input_size = input_f * dispatchData.tile_m * dispatchData.gws[2]; | |||
size_t quan_var_size = (input_size / quantize_grp_size) * 4 * 2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what about (input_size / quantize_grp_size) * sizeof(float) * 2
for better readability?
|
||
if (kd.internalBufferSizes[0] < input_size) { | ||
if (kd.internalBufferSizes[0] < input_size || | ||
kd.internalBufferSizes[1] < quan_var_size || true) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unnecessary true
src/plugins/intel_gpu/tests/unit/test_cases/fully_connected_gpu_test.cpp
Outdated
Show resolved
Hide resolved
|
||
GPU_DEBUG_TRACE_DETAIL << "FC dyn-quantize by per-token. Actual dyn_quan_group_size(" << dynamic_quantization_group_size | ||
<< ") : From scale_group_size (" << scale_group_size << ", zp_group_size(" << zp_group_size | ||
<< "), zp_group_num(" << zp_group_num << "), ifm_size (" << get_input_bf_size(params).second << ")" << std::endl; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Please use
LOG
instead ofTRACE_DETAIL
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Applied
Signed-off-by: Min, Byungil <[email protected]>
+ Resolved accuracy issue + Cleared OOR error Signed-off-by: Min, Byungil <[email protected]>
Signed-off-by: Min, Byungil <[email protected]>
Signed-off-by: Min, Byungil <[email protected]>
Signed-off-by: Min, Byungil <[email protected]>
Signed-off-by: Min, Byungil <[email protected]>
Signed-off-by: Min, Byungil <[email protected]>
+ Fixed CI issue + Added unit-tests Signed-off-by: Min, Byungil <[email protected]>
Signed-off-by: Min, Byungil <[email protected]>
Signed-off-by: Min, Byungil <[email protected]>
Signed-off-by: Min, Byungil <[email protected]>
Signed-off-by: Min, Byungil <[email protected]>
Signed-off-by: Min, Byungil <[email protected]>
c7b123f
to
d83debc
Compare
Resolved some conflicts from fc_gpu_bf_tile.cpp |
|
Signed-off-by: Min, Byungil <[email protected]>
Signed-off-by: Min, Byungil <[email protected]>
Signed-off-by: Min, Byungil <[email protected]>
Details:
Tickets: