-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[GPU] Dynamic quantization for OneDNN FC #25372
Conversation
25369b9
to
ff7d28a
Compare
f35d5e0
to
3dd6acd
Compare
@@ -571,7 +571,7 @@ static constexpr Property<ExecutionMode> execution_mode{"EXECUTION_MODE_HINT"}; | |||
* might result in better accuracy, but the drawback is worse performance. Group size equal 0 means dynamic | |||
* quantization optimization is disabled. | |||
*/ | |||
static constexpr Property<uint64_t, PropertyMutability::RW> dynamic_quantization_group_size{ | |||
static constexpr Property<int64_t, PropertyMutability::RW> dynamic_quantization_group_size{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd suggest changing it in a way similar to num_streams property, i.e. introduce special Num
class with uint64_t
underneath and add 2 special values PER_TOKEN{UINT64_MAX}
and DISABLED{0}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
num_streams is encapsulated with namespace, but you are not suggesting to introduce a new name space for group_size, right? If we do so, it will break API compatibility.. Then it will look like below. I cannot use general word like Num
because it is not within a namespace. Does this look good?
struct GSNum {
constexpr GSNum() : gs_num{0} {};
constexpr GSNum(const uint64_t num_) : gs_num{num_} {}
constexpr operator uint64_t() const {
return gs_num;
}
uint64_t gs_num = 0;
};
static constexpr GSNum PER_TOKEN{UINT64_MAX};
static constexpr GSNum DISABLED{0};
static constexpr Property<GSNum, PropertyMutability::RW> dynamic_quantization_group_size{
"DYNAMIC_QUANTIZATION_GROUP_SIZE"};
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe something like this?
namespace dynamic_quantization {
struct GroupSize {
constexpr GroupSize() : gs_num{0} {};
constexpr GroupSize(const uint64_t num_) : gs_num{num_} {}
constexpr operator uint64_t() const {
return gs_num;
}
uint64_t gs_num = 0;
};
static constexpr GroupSize PER_TOKEN{UINT64_MAX};
static constexpr GroupSize DISABLED{0};
static constexpr Property<GroupSize, PropertyMutability::RW> group_size{"DYNAMIC_QUANTIZATION_GROUP_SIZE"};
} // namespace dynamic_quantization
// keep it for compatibility
static constexpr Property<dynamic_quantization::GroupSize, PropertyMutability::RW> dynamic_quantization_group_size{"DYNAMIC_QUANTIZATION_GROUP_SIZE"};
// ...
core.set_property(ov::hint::dynamic_quantization_group_size(32));
core.set_property(ov::hint::dynamic_quantization_group_size(ov::hint::dynamic_quantization::DISABLED));
core.set_property(ov::hint::dynamic_quantization_group_size(ov::hint::dynamic_quantization::PER_TOKEN));
core.set_property(ov::hint::dynamic_quantization::group_size(32));
core.set_property(ov::hint::dynamic_quantization::group_size(ov::hint::dynamic_quantization::DISABLED));
core.set_property(ov::hint::dynamic_quantization::group_size(ov::hint::dynamic_quantization::PER_TOKEN));
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK I see. So we will have two hints for group_size. By the way, as this will require change from both CPU and GPU plugin, can I follow-up for this change in next PR? In this PR, I will just use a fixed value UINT64_MAX internally for per-token.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't mind. But please revert current changes in the properties and than make PR with proper solution
namespace op { | ||
|
||
/// \brief Operator performing Dynamic Quantize | ||
class DynamicQuantize : public ov::op::Op { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you move it to common part? CPU plugin will likely reuse it in the future
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
applied, thanks!
src/plugins/intel_gpu/include/intel_gpu/op/dynamic_quantize.hpp
Outdated
Show resolved
Hide resolved
src/plugins/intel_gpu/src/plugin/transformations/op/dynamic_quantize.cpp
Outdated
Show resolved
Hide resolved
set_output_type(0, ov::element::Type_t::i8, out_shapes[0]); | ||
set_output_type(1, ov::element::Type_t::f16, out_shapes[1]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think scales data type should be an op parameter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you explain more about that? I expect it to be same as the input data type. Currently it is supporting fp16-input only and that is why it is fixed as fp16.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we want to make this op generic and reuse later for CPU, then we need to support out type parameterization as they will want to use another type (f32 of bf16)
...ugins/intel_gpu/src/kernel_selector/kernels/dynamic_quantize/dynamic_quantize_kernel_opt.cpp
Outdated
Show resolved
Hide resolved
...ugins/intel_gpu/src/kernel_selector/kernels/dynamic_quantize/dynamic_quantize_kernel_opt.cpp
Outdated
Show resolved
Hide resolved
src/plugins/intel_gpu/src/kernel_selector/cl_kernels/dynamic_quantize_gpu_opt.cl
Outdated
Show resolved
Hide resolved
src/plugins/intel_gpu/src/graph/impls/onednn/fully_connected_onednn.cpp
Outdated
Show resolved
Hide resolved
Signed-off-by: Kim, Mingyu <[email protected]> Signed-off-by: Min, Byungil <[email protected]>
Optimize dynamic_quantize_opt kernel Signed-off-by: Min, Byung-il <[email protected]>
basic test passes dyn_quan test is newly introduced with accuracy issue corner_cases fails
@@ -571,7 +571,7 @@ static constexpr Property<ExecutionMode> execution_mode{"EXECUTION_MODE_HINT"}; | |||
* might result in better accuracy, but the drawback is worse performance. Group size equal 0 means dynamic | |||
* quantization optimization is disabled. | |||
*/ | |||
static constexpr Property<uint64_t, PropertyMutability::RW> dynamic_quantization_group_size{ | |||
static constexpr Property<int64_t, PropertyMutability::RW> dynamic_quantization_group_size{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't mind. But please revert current changes in the properties and than make PR with proper solution
28a4669
to
334be2d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, LGTM
src/plugins/intel_gpu/src/plugin/transformations/dynamic_quantize_fully_connected.cpp
Outdated
Show resolved
Hide resolved
### Details: - Integrate OneDNN dynamic quantization - Per-token quantization is only enabled ### Tickets: - 144522 --------- Signed-off-by: Kim, Mingyu <[email protected]> Signed-off-by: Min, Byungil <[email protected]> Signed-off-by: Min, Byung-il <[email protected]> Co-authored-by: Min, Byung-il <[email protected]>
Details:
Tickets: