Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GPU] Dynamic quantization for OneDNN FC #25372

Merged
merged 29 commits into from
Aug 8, 2024

Conversation

isanghao
Copy link
Contributor

@isanghao isanghao commented Jul 4, 2024

Details:

  • Integrate OneDNN dynamic quantization
  • Per-token quantization is only enabled

Tickets:

  • 144522

@isanghao isanghao added do_not_merge category: GPU OpenVINO GPU plugin labels Jul 4, 2024
@github-actions github-actions bot added category: inference OpenVINO Runtime library - Inference category: CPP API OpenVINO CPP API bindings labels Jul 25, 2024
@isanghao isanghao marked this pull request as ready for review July 25, 2024 08:32
@isanghao isanghao requested review from a team as code owners July 25, 2024 08:32
@isanghao isanghao changed the title [GPU][WIP] Dynamic quantization for OneDNN FC [GPU] Dynamic quantization for OneDNN FC Jul 25, 2024
@@ -571,7 +571,7 @@ static constexpr Property<ExecutionMode> execution_mode{"EXECUTION_MODE_HINT"};
* might result in better accuracy, but the drawback is worse performance. Group size equal 0 means dynamic
* quantization optimization is disabled.
*/
static constexpr Property<uint64_t, PropertyMutability::RW> dynamic_quantization_group_size{
static constexpr Property<int64_t, PropertyMutability::RW> dynamic_quantization_group_size{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest changing it in a way similar to num_streams property, i.e. introduce special Num class with uint64_t underneath and add 2 special values PER_TOKEN{UINT64_MAX} and DISABLED{0}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

num_streams is encapsulated with namespace, but you are not suggesting to introduce a new name space for group_size, right? If we do so, it will break API compatibility.. Then it will look like below. I cannot use general word like Num because it is not within a namespace. Does this look good?


struct GSNum  {
    constexpr GSNum() : gs_num{0} {};

    constexpr GSNum(const uint64_t num_) : gs_num{num_} {}

    constexpr operator uint64_t() const {
        return gs_num;
    }

    uint64_t gs_num = 0;
};

static constexpr GSNum PER_TOKEN{UINT64_MAX};

static constexpr GSNum DISABLED{0};

static constexpr Property<GSNum, PropertyMutability::RW> dynamic_quantization_group_size{
    "DYNAMIC_QUANTIZATION_GROUP_SIZE"};

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe something like this?

namespace dynamic_quantization {
struct GroupSize  {
    constexpr GroupSize() : gs_num{0} {};

    constexpr GroupSize(const uint64_t num_) : gs_num{num_} {}

    constexpr operator uint64_t() const {
        return gs_num;
    }

    uint64_t gs_num = 0;
};

static constexpr GroupSize PER_TOKEN{UINT64_MAX};

static constexpr GroupSize DISABLED{0};

static constexpr Property<GroupSize, PropertyMutability::RW> group_size{"DYNAMIC_QUANTIZATION_GROUP_SIZE"};

} // namespace dynamic_quantization

// keep it for compatibility
static constexpr Property<dynamic_quantization::GroupSize, PropertyMutability::RW> dynamic_quantization_group_size{"DYNAMIC_QUANTIZATION_GROUP_SIZE"}; 

// ...

  core.set_property(ov::hint::dynamic_quantization_group_size(32));
  core.set_property(ov::hint::dynamic_quantization_group_size(ov::hint::dynamic_quantization::DISABLED));
  core.set_property(ov::hint::dynamic_quantization_group_size(ov::hint::dynamic_quantization::PER_TOKEN));

  core.set_property(ov::hint::dynamic_quantization::group_size(32));
  core.set_property(ov::hint::dynamic_quantization::group_size(ov::hint::dynamic_quantization::DISABLED));
  core.set_property(ov::hint::dynamic_quantization::group_size(ov::hint::dynamic_quantization::PER_TOKEN));

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK I see. So we will have two hints for group_size. By the way, as this will require change from both CPU and GPU plugin, can I follow-up for this change in next PR? In this PR, I will just use a fixed value UINT64_MAX internally for per-token.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't mind. But please revert current changes in the properties and than make PR with proper solution

namespace op {

/// \brief Operator performing Dynamic Quantize
class DynamicQuantize : public ov::op::Op {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you move it to common part? CPU plugin will likely reuse it in the future

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

applied, thanks!

Comment on lines 28 to 29
set_output_type(0, ov::element::Type_t::i8, out_shapes[0]);
set_output_type(1, ov::element::Type_t::f16, out_shapes[1]);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think scales data type should be an op parameter

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explain more about that? I expect it to be same as the input data type. Currently it is supporting fp16-input only and that is why it is fixed as fp16.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to make this op generic and reuse later for CPU, then we need to support out type parameterization as they will want to use another type (f32 of bf16)

@isanghao isanghao requested a review from a team as a code owner August 1, 2024 14:20
@isanghao isanghao requested review from itikhono and removed request for a team August 1, 2024 14:20
@github-actions github-actions bot added the category: transformations OpenVINO Runtime library - Transformations label Aug 1, 2024
src/common/transformations/src/ov_ops/dynamic_quantize.cpp Outdated Show resolved Hide resolved
@@ -571,7 +571,7 @@ static constexpr Property<ExecutionMode> execution_mode{"EXECUTION_MODE_HINT"};
* might result in better accuracy, but the drawback is worse performance. Group size equal 0 means dynamic
* quantization optimization is disabled.
*/
static constexpr Property<uint64_t, PropertyMutability::RW> dynamic_quantization_group_size{
static constexpr Property<int64_t, PropertyMutability::RW> dynamic_quantization_group_size{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't mind. But please revert current changes in the properties and than make PR with proper solution

src/common/transformations/src/ov_ops/dynamic_quantize.cpp Outdated Show resolved Hide resolved
@github-actions github-actions bot removed category: inference OpenVINO Runtime library - Inference category: CPP API OpenVINO CPP API bindings labels Aug 2, 2024
Copy link
Contributor

@vladimir-paramuzov vladimir-paramuzov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, LGTM

@isanghao isanghao added this pull request to the merge queue Aug 8, 2024
Merged via the queue into openvinotoolkit:master with commit 7d6ffd3 Aug 8, 2024
138 checks passed
@isanghao isanghao deleted the dyn_quan_onednn branch August 8, 2024 05:56
mory91 pushed a commit to mory91/openvino that referenced this pull request Aug 13, 2024
### Details:
 - Integrate OneDNN dynamic quantization
 - Per-token quantization is only enabled

### Tickets:
 - 144522

---------

Signed-off-by: Kim, Mingyu <[email protected]>
Signed-off-by: Min, Byungil <[email protected]>
Signed-off-by: Min, Byung-il <[email protected]>
Co-authored-by: Min, Byung-il <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category: GPU OpenVINO GPU plugin category: transformations OpenVINO Runtime library - Transformations
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants