-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[GPU] Dynamic quantization for OneDNN FC #25372
Changes from 8 commits
2fe52fb
bc3c4e8
e7ea310
99a6f1c
4d4e1a1
f931fc9
3dd6acd
b3629b4
c48e41f
b64e974
0268aba
f3cd46e
daac4a5
d4c3c0d
f01a4a2
1df4e1f
845e2a5
2b878a4
35ea02f
dd9eda0
4d5a520
f02ba03
4380f49
b6a15c6
334be2d
796ba79
5492244
109687c
43665f5
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,37 @@ | ||
// Copyright (C) 2024 Intel Corporation | ||
// SPDX-License-Identifier: Apache-2.0 | ||
// | ||
|
||
#pragma once | ||
|
||
#include "openvino/op/op.hpp" | ||
|
||
namespace ov { | ||
namespace intel_gpu { | ||
namespace op { | ||
|
||
/// \brief Operator performing Dynamic Quantize | ||
class DynamicQuantize : public ov::op::Op { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Could you move it to common part? CPU plugin will likely reuse it in the future There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. applied, thanks! |
||
public: | ||
OPENVINO_OP("DynamicQuantize", "gpu_opset"); | ||
|
||
DynamicQuantize() = default; | ||
/// \brief Constructs an DynamicQuantize operation. | ||
/// | ||
/// \param data Input tensor with data | ||
isanghao marked this conversation as resolved.
Show resolved
Hide resolved
|
||
DynamicQuantize(const Output<Node>& data, int64_t group_size); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Let's change group_size parameter to There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Then is it necessary to propagate this change to properties too? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't think so. Even if you change value of property to the vector, it will not make it generic as different operation (or even instances of same op) can have different ranks, so single property will never be aligned with that. My vision is that we know the context where we insert DynamicQuantize op (currently - before compressed FC), so we can understand what is expected tensor rank and what are the requirements in terms of quantization for current op and how to apply this group_size parameter properly There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Could you elaborate more about why you think such direction is better? I guess one advantage would be the explicit expression of group_size parameter to which axis it will be applied. But if non-innermost-axis will be always 1, I'm not sure whether it is worth the complexity.. |
||
|
||
void validate_and_infer_types() override; | ||
|
||
std::shared_ptr<Node> clone_with_new_inputs(const ov::OutputVector& new_args) const override; | ||
int64_t get_group_size() { return m_group_size; }; | ||
|
||
private: | ||
int64_t m_group_size; | ||
}; | ||
|
||
std::vector<ov::PartialShape> shape_infer(const DynamicQuantize* op, std::vector<ov::PartialShape> input_shapes); | ||
|
||
} // namespace op | ||
} // namespace intel_gpu | ||
} // namespace ov |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,57 @@ | ||
// Copyright (C) 2024 Intel Corporation | ||
// SPDX-License-Identifier: Apache-2.0 | ||
// | ||
|
||
#pragma once | ||
#include "primitive.hpp" | ||
|
||
namespace cldnn { | ||
|
||
/// @brief Dynamic Quantize primitive | ||
/// @details Performs dynamic quantization | ||
struct dynamic_quantize : public primitive_base<dynamic_quantize> { | ||
CLDNN_DECLARE_PRIMITIVE(dynamic_quantize); | ||
|
||
dynamic_quantize() : primitive_base("", {}), group_size(0) {} | ||
|
||
/// @brief Constructs dynamic_quantize primitive | ||
/// @param id This primitive id | ||
/// @param input Input primitive id | ||
/// @param group_size Quantization group size | ||
/// @param data_type Output data type of quantized | ||
/// @param output_size Output data size of the primitive | ||
dynamic_quantize(const primitive_id& id, | ||
const input_info& input, | ||
const int64_t group_size, | ||
const std::vector<optional_data_type> data_types = {optional_data_type(data_types::f16), optional_data_type(data_types::i8)}) | ||
: primitive_base(id, {input}, 2, data_types) | ||
, group_size(group_size) {} | ||
|
||
int64_t group_size = 0; | ||
|
||
size_t hash() const override { | ||
size_t seed = primitive::hash(); | ||
seed = hash_combine(seed, group_size); | ||
return seed; | ||
} | ||
|
||
bool operator==(const primitive& rhs) const override { | ||
if (!compare_common_params(rhs)) | ||
return false; | ||
|
||
auto rhs_casted = downcast<const dynamic_quantize>(rhs); | ||
|
||
return group_size == rhs_casted.group_size; | ||
} | ||
|
||
void save(BinaryOutputBuffer& ob) const override { | ||
primitive_base<dynamic_quantize>::save(ob); | ||
ob << group_size; | ||
} | ||
|
||
void load(BinaryInputBuffer& ib) override { | ||
primitive_base<dynamic_quantize>::load(ib); | ||
ib >> group_size; | ||
} | ||
}; | ||
} // namespace cldnn |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,63 @@ | ||
// Copyright (C) 2024 Intel Corporation | ||
// SPDX-License-Identifier: Apache-2.0 | ||
// | ||
|
||
#include "intel_gpu/op/dynamic_quantize.hpp" | ||
#include "dynamic_quantize_inst.h" | ||
|
||
#include "primitive_type_base.h" | ||
#include "json_object.h" | ||
#include <string> | ||
|
||
namespace cldnn { | ||
GPU_DEFINE_PRIMITIVE_TYPE_ID(dynamic_quantize); | ||
|
||
layout dynamic_quantize_inst::calc_output_layout(dynamic_quantize_node const& node, kernel_impl_params const& impl_param) { | ||
auto desc = impl_param.typed_desc<dynamic_quantize>(); | ||
auto input_layout = impl_param.get_input_layout(); | ||
auto output_type = data_types::i8; | ||
auto output_format = input_layout.format; | ||
|
||
return layout(output_type, output_format, input_layout.get_tensor()); | ||
} | ||
|
||
template<typename ShapeType> | ||
std::vector<layout> dynamic_quantize_inst::__calc_output_layouts(layout &act_layout, int64_t group_size) { | ||
ov::intel_gpu::op::DynamicQuantize op; | ||
auto output_format = act_layout.format; | ||
|
||
std::vector<ShapeType> input_shapes = { | ||
act_layout.get<ShapeType>(), | ||
}; | ||
|
||
auto output_shapes = shape_infer(&op, input_shapes); | ||
|
||
return { layout(output_shapes[0], data_types::i8, output_format), layout(output_shapes[1], data_types::f16, output_format) }; | ||
} | ||
|
||
template std::vector<layout> dynamic_quantize_inst::__calc_output_layouts<ov::PartialShape>(layout &act_layout, int64_t group_size); | ||
|
||
template<typename ShapeType> | ||
std::vector<layout> dynamic_quantize_inst::calc_output_layouts(dynamic_quantize_node const& /*node*/, const kernel_impl_params& impl_param) { | ||
auto desc = impl_param.typed_desc<dynamic_quantize>(); | ||
auto input_layout = impl_param.get_input_layout(); | ||
isanghao marked this conversation as resolved.
Show resolved
Hide resolved
|
||
return __calc_output_layouts<ov::PartialShape>(input_layout, 0 /* TODO: handle group_size here */); | ||
} | ||
|
||
template std::vector<layout> dynamic_quantize_inst::calc_output_layouts<ov::PartialShape>(dynamic_quantize_node const& node, | ||
const kernel_impl_params& impl_param); | ||
|
||
std::string dynamic_quantize_inst::to_string(dynamic_quantize_node const& node) { | ||
auto desc = node.get_primitive(); | ||
auto node_info = node.desc_to_json(); | ||
|
||
std::stringstream primitive_description; | ||
|
||
node_info->dump(primitive_description); | ||
|
||
return primitive_description.str(); | ||
} | ||
|
||
dynamic_quantize_inst::typed_primitive_inst(network& network, dynamic_quantize_node const& node) : parent(network, node) {} | ||
|
||
} // namespace cldnn |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,73 @@ | ||
// Copyright (C) 2024 Intel Corporation | ||
// SPDX-License-Identifier: Apache-2.0 | ||
// | ||
|
||
#include "openvino/core/validation_util.hpp" | ||
#include "primitive_base.hpp" | ||
#include "dynamic_quantize/dynamic_quantize_kernel_ref.h" | ||
#include "dynamic_quantize/dynamic_quantize_kernel_selector.h" | ||
#include "dynamic_quantize_inst.h" | ||
|
||
namespace cldnn { | ||
namespace ocl { | ||
|
||
struct dynamic_quantize_impl : typed_primitive_impl_ocl<dynamic_quantize> { | ||
using parent = typed_primitive_impl_ocl<dynamic_quantize>; | ||
using parent::parent; | ||
using kernel_selector_t = kernel_selector::dynamic_quantize_kernel_selector; | ||
using kernel_params_t = kernel_selector::dynamic_quantize_params; | ||
|
||
DECLARE_OBJECT_TYPE_SERIALIZATION(cldnn::ocl::dynamic_quantize_impl); | ||
|
||
std::unique_ptr<primitive_impl> clone() const override { | ||
return make_unique<dynamic_quantize_impl>(*this); | ||
} | ||
|
||
void load(BinaryInputBuffer& ib) override { | ||
parent::load(ib); | ||
if (is_dynamic()) { | ||
auto& kernel_selector = kernel_selector_t::Instance(); | ||
auto kernel_impl = kernel_selector.GetImplementation(_kernel_data.kernelName); | ||
kernel_impl->GetUpdateDispatchDataFunc(_kernel_data); | ||
} | ||
} | ||
|
||
static kernel_params_t get_kernel_params(const kernel_impl_params& impl_param, bool is_shape_agnostic = false) { | ||
/// TODO: handle group_size here | ||
auto params = get_default_params<kernel_selector::dynamic_quantize_params>(impl_param, is_shape_agnostic); | ||
params.outputs.push_back(convert_data_tensor(impl_param.get_output_layout(1))); | ||
|
||
return params; | ||
} | ||
|
||
void update_dispatch_data(const kernel_impl_params& impl_param) override { | ||
auto kernel_params = get_kernel_params(impl_param, true); | ||
(_kernel_data.update_dispatch_data_func)(kernel_params, _kernel_data); | ||
} | ||
}; | ||
|
||
namespace detail { | ||
|
||
attach_dynamic_quantize_impl::attach_dynamic_quantize_impl() { | ||
auto types = { | ||
data_types::f16, | ||
data_types::i8 | ||
}; | ||
|
||
auto formats = { | ||
format::bfyx, | ||
}; | ||
|
||
implementation_map<dynamic_quantize>::add(impl_types::ocl, | ||
shape_types::any, | ||
typed_primitive_impl_ocl<dynamic_quantize>::create<dynamic_quantize_impl>, | ||
types, | ||
formats); | ||
} | ||
|
||
} // namespace detail | ||
} // namespace ocl | ||
} // namespace cldnn | ||
|
||
BIND_BINARY_BUFFER_WITH_TYPE(cldnn::ocl::dynamic_quantize_impl) | ||
BIND_BINARY_BUFFER_WITH_TYPE(cldnn::dynamic_quantize) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd suggest changing it in a way similar to num_streams property, i.e. introduce special
Num
class withuint64_t
underneath and add 2 special valuesPER_TOKEN{UINT64_MAX}
andDISABLED{0}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
num_streams is encapsulated with namespace, but you are not suggesting to introduce a new name space for group_size, right? If we do so, it will break API compatibility.. Then it will look like below. I cannot use general word like
Num
because it is not within a namespace. Does this look good?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe something like this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK I see. So we will have two hints for group_size. By the way, as this will require change from both CPU and GPU plugin, can I follow-up for this change in next PR? In this PR, I will just use a fixed value UINT64_MAX internally for per-token.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't mind. But please revert current changes in the properties and than make PR with proper solution