Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GPU] Dynamic quantization for OneDNN FC #25372

Merged
merged 29 commits into from
Aug 8, 2024
Merged
Show file tree
Hide file tree
Changes from 8 commits
Commits
Show all changes
29 commits
Select commit Hold shift + click to select a range
2fe52fb
[GPU] FC dynamic quantization with OneDNN
isanghao Jun 8, 2024
bc3c4e8
Modify dynamic quantize kernels
byungilm Jul 2, 2024
e7ea310
code cleanup & accuracy fix
isanghao Jul 5, 2024
99a6f1c
[GPU] restric dynamic_quantization condition for unittest pass
isanghao Jul 23, 2024
4d4e1a1
New test for dynamic quantization
isanghao Jul 23, 2024
f931fc9
[GPU] option cleanup for per-token quantization
isanghao Jul 26, 2024
3dd6acd
minor fix
isanghao Jul 26, 2024
b3629b4
code cleanup
isanghao Jul 26, 2024
c48e41f
update onednn version
isanghao Jul 27, 2024
b64e974
changing group size to size_t
isanghao Jul 27, 2024
0268aba
code cleanup for review
isanghao Jul 27, 2024
f3cd46e
fix for code review
isanghao Jul 27, 2024
daac4a5
update for code review
isanghao Jul 27, 2024
d4c3c0d
introduce gsnum
isanghao Aug 1, 2024
f01a4a2
move dyn_quan to common op
isanghao Aug 1, 2024
1df4e1f
macro for FC mask
isanghao Aug 1, 2024
845e2a5
group_size is made as vector now
isanghao Aug 2, 2024
2b878a4
update for code review
isanghao Aug 2, 2024
35ea02f
update for code review
isanghao Aug 3, 2024
dd9eda0
reverted property change
isanghao Aug 3, 2024
4d5a520
group_size -> group_sizes
isanghao Aug 3, 2024
f02ba03
ci fix
isanghao Aug 3, 2024
4380f49
style fix
isanghao Aug 6, 2024
b6a15c6
cpplint fix
isanghao Aug 6, 2024
334be2d
build fix
isanghao Aug 6, 2024
796ba79
style fix
isanghao Aug 7, 2024
5492244
fix for review
isanghao Aug 8, 2024
109687c
fix for style
isanghao Aug 8, 2024
43665f5
change group_size format to uint64_t
isanghao Aug 8, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion src/inference/include/openvino/runtime/properties.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -571,7 +571,7 @@ static constexpr Property<ExecutionMode> execution_mode{"EXECUTION_MODE_HINT"};
* might result in better accuracy, but the drawback is worse performance. Group size equal 0 means dynamic
* quantization optimization is disabled.
*/
static constexpr Property<uint64_t, PropertyMutability::RW> dynamic_quantization_group_size{
static constexpr Property<int64_t, PropertyMutability::RW> dynamic_quantization_group_size{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest changing it in a way similar to num_streams property, i.e. introduce special Num class with uint64_t underneath and add 2 special values PER_TOKEN{UINT64_MAX} and DISABLED{0}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

num_streams is encapsulated with namespace, but you are not suggesting to introduce a new name space for group_size, right? If we do so, it will break API compatibility.. Then it will look like below. I cannot use general word like Num because it is not within a namespace. Does this look good?


struct GSNum  {
    constexpr GSNum() : gs_num{0} {};

    constexpr GSNum(const uint64_t num_) : gs_num{num_} {}

    constexpr operator uint64_t() const {
        return gs_num;
    }

    uint64_t gs_num = 0;
};

static constexpr GSNum PER_TOKEN{UINT64_MAX};

static constexpr GSNum DISABLED{0};

static constexpr Property<GSNum, PropertyMutability::RW> dynamic_quantization_group_size{
    "DYNAMIC_QUANTIZATION_GROUP_SIZE"};

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe something like this?

namespace dynamic_quantization {
struct GroupSize  {
    constexpr GroupSize() : gs_num{0} {};

    constexpr GroupSize(const uint64_t num_) : gs_num{num_} {}

    constexpr operator uint64_t() const {
        return gs_num;
    }

    uint64_t gs_num = 0;
};

static constexpr GroupSize PER_TOKEN{UINT64_MAX};

static constexpr GroupSize DISABLED{0};

static constexpr Property<GroupSize, PropertyMutability::RW> group_size{"DYNAMIC_QUANTIZATION_GROUP_SIZE"};

} // namespace dynamic_quantization

// keep it for compatibility
static constexpr Property<dynamic_quantization::GroupSize, PropertyMutability::RW> dynamic_quantization_group_size{"DYNAMIC_QUANTIZATION_GROUP_SIZE"}; 

// ...

  core.set_property(ov::hint::dynamic_quantization_group_size(32));
  core.set_property(ov::hint::dynamic_quantization_group_size(ov::hint::dynamic_quantization::DISABLED));
  core.set_property(ov::hint::dynamic_quantization_group_size(ov::hint::dynamic_quantization::PER_TOKEN));

  core.set_property(ov::hint::dynamic_quantization::group_size(32));
  core.set_property(ov::hint::dynamic_quantization::group_size(ov::hint::dynamic_quantization::DISABLED));
  core.set_property(ov::hint::dynamic_quantization::group_size(ov::hint::dynamic_quantization::PER_TOKEN));

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK I see. So we will have two hints for group_size. By the way, as this will require change from both CPU and GPU plugin, can I follow-up for this change in next PR? In this PR, I will just use a fixed value UINT64_MAX internally for per-token.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't mind. But please revert current changes in the properties and than make PR with proper solution

"DYNAMIC_QUANTIZATION_GROUP_SIZE"};

/**
Expand Down
37 changes: 37 additions & 0 deletions src/plugins/intel_gpu/include/intel_gpu/op/dynamic_quantize.hpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
// Copyright (C) 2024 Intel Corporation
// SPDX-License-Identifier: Apache-2.0
//

#pragma once

#include "openvino/op/op.hpp"

namespace ov {
namespace intel_gpu {
namespace op {

/// \brief Operator performing Dynamic Quantize
class DynamicQuantize : public ov::op::Op {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you move it to common part? CPU plugin will likely reuse it in the future

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

applied, thanks!

public:
OPENVINO_OP("DynamicQuantize", "gpu_opset");

DynamicQuantize() = default;
/// \brief Constructs an DynamicQuantize operation.
///
/// \param data Input tensor with data
isanghao marked this conversation as resolved.
Show resolved Hide resolved
DynamicQuantize(const Output<Node>& data, int64_t group_size);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's change group_size parameter to std::vector<uint64_t> with the rank equal to rank of input tensor.
dim[i] == 1 -- means per-element scales
1 < dim[i] < group_size[i] -- grouped case
group_size[i] >= dim[i] -- single scale per channel

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then is it necessary to propagate this change to properties too?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so. Even if you change value of property to the vector, it will not make it generic as different operation (or even instances of same op) can have different ranks, so single property will never be aligned with that. My vision is that we know the context where we insert DynamicQuantize op (currently - before compressed FC), so we can understand what is expected tensor rank and what are the requirements in terms of quantization for current op and how to apply this group_size parameter properly

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you elaborate more about why you think such direction is better? I guess one advantage would be the explicit expression of group_size parameter to which axis it will be applied. But if non-innermost-axis will be always 1, I'm not sure whether it is worth the complexity..


void validate_and_infer_types() override;

std::shared_ptr<Node> clone_with_new_inputs(const ov::OutputVector& new_args) const override;
int64_t get_group_size() { return m_group_size; };

private:
int64_t m_group_size;
};

std::vector<ov::PartialShape> shape_infer(const DynamicQuantize* op, std::vector<ov::PartialShape> input_shapes);

} // namespace op
} // namespace intel_gpu
} // namespace ov
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,20 @@ class FullyConnectedCompressed : public FullyConnected {
const ov::Output<Node> &decompression_scale,
const ov::element::Type output_type = ov::element::undefined);

FullyConnectedCompressed(const OutputVector& inputs,
bool has_zp = true,
bool has_activation_scale = false,
const ov::element::Type output_type = ov::element::undefined);
vladimir-paramuzov marked this conversation as resolved.
Show resolved Hide resolved

std::shared_ptr<Node> clone_with_new_inputs(const ov::OutputVector& new_args) const override;

bool get_has_zp() const { return m_has_zp; }
bool get_has_activation_scale() const { return m_has_activation_scale; }


protected:
bool m_has_zp;
bool m_has_activation_scale;
};

} // namespace op
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -287,3 +287,4 @@ REGISTER_FACTORY(internal, Placeholder);
REGISTER_FACTORY(internal, SDPA);
REGISTER_FACTORY(internal, IndirectSDPA);
REGISTER_FACTORY(internal, RoPE);
REGISTER_FACTORY(internal, DynamicQuantize);
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
// Copyright (C) 2024 Intel Corporation
// SPDX-License-Identifier: Apache-2.0
//

#pragma once
#include "primitive.hpp"

namespace cldnn {

/// @brief Dynamic Quantize primitive
/// @details Performs dynamic quantization
struct dynamic_quantize : public primitive_base<dynamic_quantize> {
CLDNN_DECLARE_PRIMITIVE(dynamic_quantize);

dynamic_quantize() : primitive_base("", {}), group_size(0) {}

/// @brief Constructs dynamic_quantize primitive
/// @param id This primitive id
/// @param input Input primitive id
/// @param group_size Quantization group size
/// @param data_type Output data type of quantized
/// @param output_size Output data size of the primitive
dynamic_quantize(const primitive_id& id,
const input_info& input,
const int64_t group_size,
const std::vector<optional_data_type> data_types = {optional_data_type(data_types::f16), optional_data_type(data_types::i8)})
: primitive_base(id, {input}, 2, data_types)
, group_size(group_size) {}

int64_t group_size = 0;

size_t hash() const override {
size_t seed = primitive::hash();
seed = hash_combine(seed, group_size);
return seed;
}

bool operator==(const primitive& rhs) const override {
if (!compare_common_params(rhs))
return false;

auto rhs_casted = downcast<const dynamic_quantize>(rhs);

return group_size == rhs_casted.group_size;
}

void save(BinaryOutputBuffer& ob) const override {
primitive_base<dynamic_quantize>::save(ob);
ob << group_size;
}

void load(BinaryInputBuffer& ib) override {
primitive_base<dynamic_quantize>::load(ib);
ib >> group_size;
}
};
} // namespace cldnn
Original file line number Diff line number Diff line change
Expand Up @@ -95,11 +95,46 @@ struct fully_connected : public primitive_base<fully_connected> {
compressed_weights(true),
decompression_scale(decompression_scale),
decompression_zero_point(decompression_zero_point),
dynamic_quantized_activation(false),
input_size(input_size),
weights_rank(weights_rank) {
OPENVINO_ASSERT(!decompression_scale.empty(), "[GPU] Compressed fully connected requires at least decompression scale input");
}

/// @brief Constructs fully connected compressed layer.
/// @param id This primitive id.
/// @param input Input primitive id.
/// @param weights Primitive id containing weights data.
/// @param bias Primitive id containing bias data.
/// @param compression_scale Primitive id containing scale factors for weights decompression.
/// @param compression_zero_point Primitive id containing zero points for weights decompression.
/// @param activation_scale Primitive id containing scale factor for activation.
fully_connected(const primitive_id& id,
const input_info& input,
const primitive_id& weights,
const primitive_id& bias,
const primitive_id& decompression_scale,
const primitive_id& decompression_zero_point,
const input_info& activation_scale,
const data_types data_type,
const size_t input_size = 2,
const size_t weights_rank = 2)
: primitive_base(id, { input }, 1, {optional_data_type{data_type}}),
weights(weights),
bias(bias),
compressed_weights(true),
decompression_scale(decompression_scale),
decompression_zero_point(decompression_zero_point),
dynamic_quantized_activation(false),
activation_scale(activation_scale),
input_size(input_size),
weights_rank(weights_rank) {
if (activation_scale.is_valid())
dynamic_quantized_activation = true;

OPENVINO_ASSERT(!decompression_scale.empty(), "[GPU] Compressed fully connected requires at least decompression scale input");
}

/// @brief Primitive id containing weights data.
primitive_id weights;
/// @brief Primitive id containing bias data.
Expand All @@ -108,6 +143,8 @@ struct fully_connected : public primitive_base<fully_connected> {
bool compressed_weights = false;
primitive_id decompression_scale = "";
primitive_id decompression_zero_point = "";
bool dynamic_quantized_activation = false;
input_info activation_scale = {"", 0};
optional_value<float> decompression_zero_point_scalar = optional_value<float>();

/// @brief Primitive dimension size.
Expand All @@ -123,6 +160,7 @@ struct fully_connected : public primitive_base<fully_connected> {
seed = hash_combine(seed, compressed_weights);
seed = hash_combine(seed, !decompression_scale.empty());
seed = hash_combine(seed, !decompression_zero_point.empty());
seed = hash_combine(seed, activation_scale.is_valid());
seed = hash_combine(seed, decompression_zero_point_scalar.has_value());
seed = hash_combine(seed, decompression_zero_point_scalar.value_or(0.0f));
return seed;
Expand All @@ -140,6 +178,7 @@ struct fully_connected : public primitive_base<fully_connected> {
compressed_weights == rhs_casted.compressed_weights &&
decompression_scale.empty() == rhs_casted.decompression_scale.empty() &&
decompression_zero_point.empty() == rhs_casted.decompression_zero_point.empty() &&
activation_scale.is_valid() == rhs_casted.activation_scale.is_valid() &&
decompression_zero_point_scalar.value_or(0.0f) == rhs_casted.decompression_zero_point_scalar.value_or(0.0f);
}

Expand All @@ -150,8 +189,10 @@ struct fully_connected : public primitive_base<fully_connected> {
ob << compressed_weights;
ob << decompression_scale;
ob << decompression_zero_point;
ob << activation_scale;
ob << input_size;
ob << weights_rank;
ob << dynamic_quantized_activation;

if (decompression_zero_point_scalar.has_value()) {
ob << true;
Expand All @@ -169,8 +210,10 @@ struct fully_connected : public primitive_base<fully_connected> {
ib >> compressed_weights;
ib >> decompression_scale;
ib >> decompression_zero_point;
ib >> activation_scale;
ib >> input_size;
ib >> weights_rank;
ib >> dynamic_quantized_activation;

bool has_value;
ib >> has_value;
Expand All @@ -197,6 +240,9 @@ struct fully_connected : public primitive_base<fully_connected> {
if (!decompression_zero_point.empty())
ret.push_back(decompression_zero_point);

if (activation_scale.is_valid())
ret.push_back(activation_scale);

return ret;
}
};
Expand Down
63 changes: 63 additions & 0 deletions src/plugins/intel_gpu/src/graph/dynamic_quantize.cpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
// Copyright (C) 2024 Intel Corporation
// SPDX-License-Identifier: Apache-2.0
//

#include "intel_gpu/op/dynamic_quantize.hpp"
#include "dynamic_quantize_inst.h"

#include "primitive_type_base.h"
#include "json_object.h"
#include <string>

namespace cldnn {
GPU_DEFINE_PRIMITIVE_TYPE_ID(dynamic_quantize);

layout dynamic_quantize_inst::calc_output_layout(dynamic_quantize_node const& node, kernel_impl_params const& impl_param) {
auto desc = impl_param.typed_desc<dynamic_quantize>();
auto input_layout = impl_param.get_input_layout();
auto output_type = data_types::i8;
auto output_format = input_layout.format;

return layout(output_type, output_format, input_layout.get_tensor());
}

template<typename ShapeType>
std::vector<layout> dynamic_quantize_inst::__calc_output_layouts(layout &act_layout, int64_t group_size) {
ov::intel_gpu::op::DynamicQuantize op;
auto output_format = act_layout.format;

std::vector<ShapeType> input_shapes = {
act_layout.get<ShapeType>(),
};

auto output_shapes = shape_infer(&op, input_shapes);

return { layout(output_shapes[0], data_types::i8, output_format), layout(output_shapes[1], data_types::f16, output_format) };
}

template std::vector<layout> dynamic_quantize_inst::__calc_output_layouts<ov::PartialShape>(layout &act_layout, int64_t group_size);

template<typename ShapeType>
std::vector<layout> dynamic_quantize_inst::calc_output_layouts(dynamic_quantize_node const& /*node*/, const kernel_impl_params& impl_param) {
auto desc = impl_param.typed_desc<dynamic_quantize>();
auto input_layout = impl_param.get_input_layout();
isanghao marked this conversation as resolved.
Show resolved Hide resolved
return __calc_output_layouts<ov::PartialShape>(input_layout, 0 /* TODO: handle group_size here */);
}

template std::vector<layout> dynamic_quantize_inst::calc_output_layouts<ov::PartialShape>(dynamic_quantize_node const& node,
const kernel_impl_params& impl_param);

std::string dynamic_quantize_inst::to_string(dynamic_quantize_node const& node) {
auto desc = node.get_primitive();
auto node_info = node.desc_to_json();

std::stringstream primitive_description;

node_info->dump(primitive_description);

return primitive_description.str();
}

dynamic_quantize_inst::typed_primitive_inst(network& network, dynamic_quantize_node const& node) : parent(network, node) {}

} // namespace cldnn
3 changes: 3 additions & 0 deletions src/plugins/intel_gpu/src/graph/fully_connected.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -277,6 +277,9 @@ std::string fully_connected_inst::to_string(fully_connected_node const& node) {
fc_info.add("decompression zp value", desc->decompression_zero_point_scalar.value());
}
}
if (desc->dynamic_quantized_activation) {
fc_info.add("activation scale id", desc->activation_scale.pid);
}

node_info->add("fully connected info", fc_info);
node_info->dump(primitive_description);
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -409,6 +409,8 @@ void prepare_primitive_fusing::fuse_bias(program &p) {
fc_with_bias_prim->decompression_zero_point = desc->decompression_zero_point;
if (desc->decompression_zero_point_scalar.has_value())
fc_with_bias_prim->decompression_zero_point_scalar = desc->decompression_zero_point_scalar.value();
fc_with_bias_prim->activation_scale = desc->activation_scale;
fc_with_bias_prim->dynamic_quantized_activation = desc->dynamic_quantized_activation;
}
auto& new_fc_node = p.get_or_create(fc_with_bias_prim);
fuse_bias_f(fc, new_fc_node, bias_node, eltw_node);
Expand Down
73 changes: 73 additions & 0 deletions src/plugins/intel_gpu/src/graph/impls/ocl/dynamic_quantize.cpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
// Copyright (C) 2024 Intel Corporation
// SPDX-License-Identifier: Apache-2.0
//

#include "openvino/core/validation_util.hpp"
#include "primitive_base.hpp"
#include "dynamic_quantize/dynamic_quantize_kernel_ref.h"
#include "dynamic_quantize/dynamic_quantize_kernel_selector.h"
#include "dynamic_quantize_inst.h"

namespace cldnn {
namespace ocl {

struct dynamic_quantize_impl : typed_primitive_impl_ocl<dynamic_quantize> {
using parent = typed_primitive_impl_ocl<dynamic_quantize>;
using parent::parent;
using kernel_selector_t = kernel_selector::dynamic_quantize_kernel_selector;
using kernel_params_t = kernel_selector::dynamic_quantize_params;

DECLARE_OBJECT_TYPE_SERIALIZATION(cldnn::ocl::dynamic_quantize_impl);

std::unique_ptr<primitive_impl> clone() const override {
return make_unique<dynamic_quantize_impl>(*this);
}

void load(BinaryInputBuffer& ib) override {
parent::load(ib);
if (is_dynamic()) {
auto& kernel_selector = kernel_selector_t::Instance();
auto kernel_impl = kernel_selector.GetImplementation(_kernel_data.kernelName);
kernel_impl->GetUpdateDispatchDataFunc(_kernel_data);
}
}

static kernel_params_t get_kernel_params(const kernel_impl_params& impl_param, bool is_shape_agnostic = false) {
/// TODO: handle group_size here
auto params = get_default_params<kernel_selector::dynamic_quantize_params>(impl_param, is_shape_agnostic);
params.outputs.push_back(convert_data_tensor(impl_param.get_output_layout(1)));

return params;
}

void update_dispatch_data(const kernel_impl_params& impl_param) override {
auto kernel_params = get_kernel_params(impl_param, true);
(_kernel_data.update_dispatch_data_func)(kernel_params, _kernel_data);
}
};

namespace detail {

attach_dynamic_quantize_impl::attach_dynamic_quantize_impl() {
auto types = {
data_types::f16,
data_types::i8
};

auto formats = {
format::bfyx,
};

implementation_map<dynamic_quantize>::add(impl_types::ocl,
shape_types::any,
typed_primitive_impl_ocl<dynamic_quantize>::create<dynamic_quantize_impl>,
types,
formats);
}

} // namespace detail
} // namespace ocl
} // namespace cldnn

BIND_BINARY_BUFFER_WITH_TYPE(cldnn::ocl::dynamic_quantize_impl)
BIND_BINARY_BUFFER_WITH_TYPE(cldnn::dynamic_quantize)
1 change: 1 addition & 0 deletions src/plugins/intel_gpu/src/graph/impls/ocl/register.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@ void register_implementations() {
REGISTER_OCL(depth_to_space);
REGISTER_OCL(detection_output);
REGISTER_OCL(dft);
REGISTER_OCL(dynamic_quantize);
REGISTER_OCL(batch_to_space);
REGISTER_OCL(experimental_detectron_detection_output);
REGISTER_OCL(experimental_detectron_generate_proposals_single_image);
Expand Down
Loading
Loading