[GPU] Dynamic quantization for OneDNN FC #25372

isanghao · 2024-07-04T09:17:13Z

Details:

Integrate OneDNN dynamic quantization
Per-token quantization is only enabled

Tickets:

144522

vladimir-paramuzov · 2024-07-25T09:41:19Z

src/inference/include/openvino/runtime/properties.hpp

@@ -571,7 +571,7 @@ static constexpr Property<ExecutionMode> execution_mode{"EXECUTION_MODE_HINT"};
 * might result in better accuracy, but the drawback is worse performance. Group size equal 0 means dynamic
 * quantization optimization is disabled.
 */
-static constexpr Property<uint64_t, PropertyMutability::RW> dynamic_quantization_group_size{
+static constexpr Property<int64_t, PropertyMutability::RW> dynamic_quantization_group_size{


I'd suggest changing it in a way similar to num_streams property, i.e. introduce special Num class with uint64_t underneath and add 2 special values PER_TOKEN{UINT64_MAX} and DISABLED{0}

num_streams is encapsulated with namespace, but you are not suggesting to introduce a new name space for group_size, right? If we do so, it will break API compatibility.. Then it will look like below. I cannot use general word like Num because it is not within a namespace. Does this look good?

struct GSNum { constexpr GSNum() : gs_num{0} {}; constexpr GSNum(const uint64_t num_) : gs_num{num_} {} constexpr operator uint64_t() const { return gs_num; } uint64_t gs_num = 0; }; static constexpr GSNum PER_TOKEN{UINT64_MAX}; static constexpr GSNum DISABLED{0}; static constexpr Property<GSNum, PropertyMutability::RW> dynamic_quantization_group_size{ "DYNAMIC_QUANTIZATION_GROUP_SIZE"};

Maybe something like this?

namespace dynamic_quantization { struct GroupSize { constexpr GroupSize() : gs_num{0} {}; constexpr GroupSize(const uint64_t num_) : gs_num{num_} {} constexpr operator uint64_t() const { return gs_num; } uint64_t gs_num = 0; }; static constexpr GroupSize PER_TOKEN{UINT64_MAX}; static constexpr GroupSize DISABLED{0}; static constexpr Property<GroupSize, PropertyMutability::RW> group_size{"DYNAMIC_QUANTIZATION_GROUP_SIZE"}; } // namespace dynamic_quantization // keep it for compatibility static constexpr Property<dynamic_quantization::GroupSize, PropertyMutability::RW> dynamic_quantization_group_size{"DYNAMIC_QUANTIZATION_GROUP_SIZE"}; // ... core.set_property(ov::hint::dynamic_quantization_group_size(32)); core.set_property(ov::hint::dynamic_quantization_group_size(ov::hint::dynamic_quantization::DISABLED)); core.set_property(ov::hint::dynamic_quantization_group_size(ov::hint::dynamic_quantization::PER_TOKEN)); core.set_property(ov::hint::dynamic_quantization::group_size(32)); core.set_property(ov::hint::dynamic_quantization::group_size(ov::hint::dynamic_quantization::DISABLED)); core.set_property(ov::hint::dynamic_quantization::group_size(ov::hint::dynamic_quantization::PER_TOKEN));

OK I see. So we will have two hints for group_size. By the way, as this will require change from both CPU and GPU plugin, can I follow-up for this change in next PR? In this PR, I will just use a fixed value UINT64_MAX internally for per-token.

I don't mind. But please revert current changes in the properties and than make PR with proper solution

vladimir-paramuzov · 2024-07-25T09:41:40Z

src/plugins/intel_gpu/include/intel_gpu/op/dynamic_quantize.hpp

+namespace op {
+
+/// \brief Operator performing Dynamic Quantize
+class DynamicQuantize : public ov::op::Op {


Could you move it to common part? CPU plugin will likely reuse it in the future

applied, thanks!

src/plugins/intel_gpu/include/intel_gpu/op/dynamic_quantize.hpp

src/plugins/intel_gpu/src/plugin/transformations/op/dynamic_quantize.cpp

vladimir-paramuzov · 2024-07-25T09:46:09Z

src/plugins/intel_gpu/src/plugin/transformations/op/dynamic_quantize.cpp

+    set_output_type(0, ov::element::Type_t::i8, out_shapes[0]);
+    set_output_type(1, ov::element::Type_t::f16, out_shapes[1]);


I think scales data type should be an op parameter

Could you explain more about that? I expect it to be same as the input data type. Currently it is supporting fp16-input only and that is why it is fixed as fp16.

If we want to make this op generic and reuse later for CPU, then we need to support out type parameterization as they will want to use another type (f32 of bf16)

...ugins/intel_gpu/src/kernel_selector/kernels/dynamic_quantize/dynamic_quantize_kernel_opt.cpp

src/plugins/intel_gpu/src/kernel_selector/cl_kernels/dynamic_quantize_gpu_opt.cl

src/plugins/intel_gpu/src/graph/primitive_inst.cpp

src/plugins/intel_gpu/src/graph/impls/onednn/fully_connected_onednn.cpp

Signed-off-by: Kim, Mingyu <[email protected]> Signed-off-by: Min, Byungil <[email protected]>

Optimize dynamic_quantize_opt kernel Signed-off-by: Min, Byung-il <[email protected]>

basic test passes dyn_quan test is newly introduced with accuracy issue corner_cases fails

src/common/transformations/src/ov_ops/dynamic_quantize.cpp

vladimir-paramuzov · 2024-08-02T05:24:09Z

src/inference/include/openvino/runtime/properties.hpp

@@ -571,7 +571,7 @@ static constexpr Property<ExecutionMode> execution_mode{"EXECUTION_MODE_HINT"};
 * might result in better accuracy, but the drawback is worse performance. Group size equal 0 means dynamic
 * quantization optimization is disabled.
 */
-static constexpr Property<uint64_t, PropertyMutability::RW> dynamic_quantization_group_size{
+static constexpr Property<int64_t, PropertyMutability::RW> dynamic_quantization_group_size{


I don't mind. But please revert current changes in the properties and than make PR with proper solution

src/common/transformations/include/ov_ops/dynamic_quantize.hpp

src/common/transformations/src/ov_ops/dynamic_quantize.cpp

vladimir-paramuzov

Overall, LGTM

src/common/transformations/include/ov_ops/dynamic_quantize.hpp

src/plugins/intel_gpu/src/graph/dynamic_quantize.cpp

src/plugins/intel_gpu/src/plugin/transformations/dynamic_quantize_fully_connected.cpp

### Details: - Integrate OneDNN dynamic quantization - Per-token quantization is only enabled ### Tickets: - 144522 --------- Signed-off-by: Kim, Mingyu <[email protected]> Signed-off-by: Min, Byungil <[email protected]> Signed-off-by: Min, Byung-il <[email protected]> Co-authored-by: Min, Byung-il <[email protected]>

isanghao added do_not_merge category: GPU OpenVINO GPU plugin labels Jul 4, 2024

isanghao force-pushed the dyn_quan_onednn branch from 25369b9 to ff7d28a Compare July 23, 2024 06:10

github-actions bot added category: inference OpenVINO Runtime library - Inference category: CPP API OpenVINO CPP API bindings labels Jul 25, 2024

isanghao force-pushed the dyn_quan_onednn branch from f35d5e0 to 3dd6acd Compare July 25, 2024 08:31

isanghao removed the do_not_merge label Jul 25, 2024

isanghao marked this pull request as ready for review July 25, 2024 08:32

isanghao requested review from a team as code owners July 25, 2024 08:32

isanghao changed the title ~~[GPU][WIP] Dynamic quantization for OneDNN FC~~ [GPU] Dynamic quantization for OneDNN FC Jul 25, 2024

vladimir-paramuzov reviewed Jul 25, 2024

View reviewed changes

isanghao and others added 16 commits July 26, 2024 11:04

[GPU] FC dynamic quantization with OneDNN

2fe52fb

Signed-off-by: Kim, Mingyu <[email protected]> Signed-off-by: Min, Byungil <[email protected]>

Modify dynamic quantize kernels

bc3c4e8

Optimize dynamic_quantize_opt kernel Signed-off-by: Min, Byung-il <[email protected]>

code cleanup & accuracy fix

e7ea310

[GPU] restric dynamic_quantization condition for unittest pass

99a6f1c

New test for dynamic quantization

4d4e1a1

basic test passes dyn_quan test is newly introduced with accuracy issue corner_cases fails

[GPU] option cleanup for per-token quantization

f931fc9

minor fix

3dd6acd

code cleanup

b3629b4

update onednn version

c48e41f

changing group size to size_t

b64e974

code cleanup for review

0268aba

fix for code review

f3cd46e

update for code review

daac4a5

introduce gsnum

d4c3c0d

move dyn_quan to common op

f01a4a2

macro for FC mask

1df4e1f

isanghao requested a review from a team as a code owner August 1, 2024 14:20

isanghao requested review from itikhono and removed request for a team August 1, 2024 14:20

github-actions bot added the category: transformations OpenVINO Runtime library - Transformations label Aug 1, 2024

vladimir-paramuzov reviewed Aug 2, 2024

View reviewed changes

isanghao added 2 commits August 2, 2024 17:18

group_size is made as vector now

845e2a5

update for code review

2b878a4

github-actions bot removed category: inference OpenVINO Runtime library - Inference category: CPP API OpenVINO CPP API bindings labels Aug 2, 2024

isanghao added 4 commits August 3, 2024 09:54

update for code review

35ea02f

reverted property change

dd9eda0

group_size -> group_sizes

4d5a520

ci fix

f02ba03

isanghao force-pushed the dyn_quan_onednn branch from 28a4669 to 334be2d Compare August 5, 2024 14:00

isanghao added 4 commits August 6, 2024 12:22

style fix

4380f49

cpplint fix

b6a15c6

build fix

334be2d

style fix

796ba79

vladimir-paramuzov approved these changes Aug 7, 2024

View reviewed changes

isanghao added 2 commits August 8, 2024 11:03

fix for review

5492244

fix for style

109687c

isanghao added this pull request to the merge queue Aug 8, 2024

change group_size format to uint64_t

43665f5

Merged via the queue into openvinotoolkit:master with commit 7d6ffd3 Aug 8, 2024
138 checks passed

isanghao deleted the dyn_quan_onednn branch August 8, 2024 05:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[GPU] Dynamic quantization for OneDNN FC #25372

[GPU] Dynamic quantization for OneDNN FC #25372

isanghao commented Jul 4, 2024 •

edited

Loading

vladimir-paramuzov Jul 25, 2024

isanghao Jul 31, 2024

vladimir-paramuzov Jul 31, 2024

isanghao Aug 1, 2024

vladimir-paramuzov Aug 2, 2024

vladimir-paramuzov Jul 25, 2024

isanghao Aug 1, 2024

vladimir-paramuzov Jul 25, 2024

isanghao Jul 26, 2024

vladimir-paramuzov Jul 26, 2024

vladimir-paramuzov Aug 2, 2024

vladimir-paramuzov left a comment

		set_output_type(0, ov::element::Type_t::i8, out_shapes[0]);
		set_output_type(1, ov::element::Type_t::f16, out_shapes[1]);

[GPU] Dynamic quantization for OneDNN FC #25372

[GPU] Dynamic quantization for OneDNN FC #25372

Conversation

isanghao commented Jul 4, 2024 • edited Loading

Details:

Tickets:

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vladimir-paramuzov left a comment

Choose a reason for hiding this comment

isanghao commented Jul 4, 2024 •

edited

Loading