-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[GPU] activations scaling to resolve accuracy issues for infer precision of f16 #27265
[GPU] activations scaling to resolve accuracy issues for infer precision of f16 #27265
Conversation
* @brief This property scales down activations to prevent overflows when inference precision is f16. | ||
* @ingroup ov_runtime_cpp_prop_api | ||
*/ | ||
static constexpr Property<float, PropertyMutability::RW> activations_scale_factor{"ACTIVATIONS_SCALE_FACTOR"}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add python bindings for this property
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how users are supposed to understand which value to set ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mainly experimentally, for now. In the future, we plan to have RT Info attribute of ov::Model which can be set from optimum pipelines or NNCF (if they add calibration flow at some point), and this attribute will be converted to plugin property.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we need later to merge this feature, then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Property is enough to solve issues in notebooks or solve issue in customers' pipelines. The features that I mentioned are needed to have better user experience, but those are not mandatory to deliver improvements to the end users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ilya-lavrenov please take a look
src/common/transformations/src/transformations/common_optimizations/activations_scaling.cpp
Outdated
Show resolved
Hide resolved
src/common/transformations/include/transformations/common_optimizations/activations_scaling.hpp
Outdated
Show resolved
Hide resolved
src/common/transformations/src/transformations/common_optimizations/activations_scaling.cpp
Outdated
Show resolved
Hide resolved
src/common/transformations/src/transformations/common_optimizations/activations_scaling.cpp
Outdated
Show resolved
Hide resolved
src/common/transformations/src/transformations/common_optimizations/activations_scaling.cpp
Outdated
Show resolved
Hide resolved
src/common/transformations/src/transformations/common_optimizations/activations_scaling.cpp
Outdated
Show resolved
Hide resolved
src/common/transformations/src/transformations/common_optimizations/activations_scaling.cpp
Outdated
Show resolved
Hide resolved
0d7c7cd
to
bc284f5
Compare
8f22485
to
ebca03d
Compare
ebca03d
to
cc4b37f
Compare
46b17ca
to
6491951
Compare
@e-ddykim, please consider this PR: huggingface/optimum-intel#994 |
cd42c04
to
5820e2b
Compare
1889ef2
to
0874b17
Compare
|
||
float activations_scale_factor = config.get_property(ov::hint::activations_scale_factor); | ||
|
||
if (activations_scale_factor > 0.f && infer_precision == ov::element::f16 && !enableInt8) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why !enableInt8
is needed? What if we run a model with hybrid quantization?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When enableInt8
is true
, activations of Convolution and Matmul are int8. So, I thought that activations scaling cannot be applied in this case. Actually, I met an issue when I tested with a resnet50-int8 model. But, I agree with your comments that we need to support hybrid quantized models. I think we can do it better after ScaleDownSingleLayer
is replaced with updated LPT passes in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As an option, we can move activation scaling pipeline after main LPT and match ScaleDownSingleLayer
only on nodes which are not in low precision
@@ -61,7 +61,7 @@ void ExecutionConfig::set_default() { | |||
std::make_tuple(ov::hint::kv_cache_precision, ov::element::undefined), | |||
std::make_tuple(ov::intel_gpu::hint::enable_kernels_reuse, false), | |||
std::make_tuple(ov::weights_path, ""), | |||
std::make_tuple(ov::hint::activations_scale_factor, 0.f), | |||
std::make_tuple(ov::hint::activations_scale_factor, -1.f), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't we need to re-enable scale factor reading from RT info?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Current implementation of activations scaling makes significant performance drop for LLMs on onednn path, but most LLM IRs already have rt_info now. So, I think it would be safer to re-enable it after resolving the perf. issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But it means that models which really need scaling (flux, sd) won't work out of the box. How big is the perf drop for LLMs with current impl?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My test results showed about 2x perf. drop. The drop was bigger on faster device. So, I'm doing to resolve this issue, and hope to resolve it before the next timeline.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, LGTM. Please enable scaling by default for dGPU and support models with hybrid quantization later
Details:
Tickets: