[GPU] activations scaling to resolve accuracy issues for infer precision of f16 #27265

e-ddykim · 2024-10-28T02:20:21Z

Details:

When a model runs at inference precision of f16, it might be unable to calculate correct results due to limited range of f16.
The purpose of this PR is to avoid situations where overflow occurs during calculation by scaling down the activation, thereby obtaining correct results when the infer precision is f16.
A new config property "ACTIVATIONS_SCALE_FACTOR" is introduced, which holds a single floating-point value. For example, if it is 64, activations are divided by 64 before Convolution and MatMul. If it is smaller than 0, this feature is disabled.
- This property also can be set via rt_info of a model as below.

    <rt_info>
        <runtime_options>
            <ACTIVATIONS_SCALE_FACTOR value="8.0" />
        </runtime_options>
    </rt_info>

Tickets:

147052

vladimir-paramuzov · 2024-10-28T05:36:42Z

src/inference/include/openvino/runtime/properties.hpp

+ * @brief This property scales down activations to prevent overflows when inference precision is f16.
+ * @ingroup ov_runtime_cpp_prop_api
+ */
+static constexpr Property<float, PropertyMutability::RW> activations_scale_factor{"ACTIVATIONS_SCALE_FACTOR"};


Please add python bindings for this property

how users are supposed to understand which value to set ?

Mainly experimentally, for now. In the future, we plan to have RT Info attribute of ov::Model which can be set from optimum pipelines or NNCF (if they add calibration flow at some point), and this attribute will be converted to plugin property.

Maybe we need later to merge this feature, then?

Property is enough to solve issues in notebooks or solve issue in customers' pipelines. The features that I mentioned are needed to have better user experience, but those are not mandatory to deliver improvements to the end users.

@ilya-lavrenov please take a look

src/common/transformations/src/transformations/common_optimizations/activations_scaling.cpp

src/common/transformations/include/transformations/common_optimizations/activations_scaling.hpp

src/common/transformations/src/transformations/common_optimizations/activations_scaling.cpp

AlexKoff88 · 2024-11-12T05:44:45Z

@e-ddykim, please consider this PR: huggingface/optimum-intel#994

vladimir-paramuzov · 2025-01-14T06:18:41Z

src/plugins/intel_gpu/src/plugin/transformations_pipeline.cpp

+
+        float activations_scale_factor = config.get_property(ov::hint::activations_scale_factor);
+
+        if (activations_scale_factor > 0.f && infer_precision == ov::element::f16 && !enableInt8) {


Why !enableInt8 is needed? What if we run a model with hybrid quantization?

When enableInt8 is true, activations of Convolution and Matmul are int8. So, I thought that activations scaling cannot be applied in this case. Actually, I met an issue when I tested with a resnet50-int8 model. But, I agree with your comments that we need to support hybrid quantized models. I think we can do it better after ScaleDownSingleLayer is replaced with updated LPT passes in the future.

As an option, we can move activation scaling pipeline after main LPT and match ScaleDownSingleLayer only on nodes which are not in low precision

vladimir-paramuzov · 2025-01-14T06:35:22Z

src/plugins/intel_gpu/src/runtime/execution_config.cpp

@@ -61,7 +61,7 @@ void ExecutionConfig::set_default() {
        std::make_tuple(ov::hint::kv_cache_precision, ov::element::undefined),
        std::make_tuple(ov::intel_gpu::hint::enable_kernels_reuse, false),
        std::make_tuple(ov::weights_path, ""),
-        std::make_tuple(ov::hint::activations_scale_factor, 0.f),
+        std::make_tuple(ov::hint::activations_scale_factor, -1.f),


Don't we need to re-enable scale factor reading from RT info?

Current implementation of activations scaling makes significant performance drop for LLMs on onednn path, but most LLM IRs already have rt_info now. So, I think it would be safer to re-enable it after resolving the perf. issue.

But it means that models which really need scaling (flux, sd) won't work out of the box. How big is the perf drop for LLMs with current impl?

My test results showed about 2x perf. drop. The drop was bigger on faster device. So, I'm doing to resolve this issue, and hope to resolve it before the next timeline.

vladimir-paramuzov

Overall, LGTM. Please enable scaling by default for dGPU and support models with hybrid quantization later

e-ddykim requested review from a team as code owners October 28, 2024 02:20

e-ddykim requested review from itikhono and removed request for a team October 28, 2024 02:20

github-actions bot added category: inference OpenVINO Runtime library - Inference category: GPU OpenVINO GPU plugin category: transformations OpenVINO Runtime library - Transformations category: CPP API OpenVINO CPP API bindings labels Oct 28, 2024

geunhwan added this to the 2024.5 milestone Oct 28, 2024

geunhwan added Code Freeze priority: high High piority labels Oct 28, 2024

vladimir-paramuzov requested changes Oct 28, 2024

View reviewed changes

geunhwan removed this from the 2024.5 milestone Oct 29, 2024

geunhwan removed priority: high High piority Code Freeze labels Oct 29, 2024

e-ddykim force-pushed the static_scaling branch 2 times, most recently from 0d7c7cd to bc284f5 Compare October 29, 2024 18:59

e-ddykim requested a review from a team as a code owner October 29, 2024 18:59

github-actions bot added the category: Python API OpenVINO Python bindings label Oct 29, 2024

e-ddykim force-pushed the static_scaling branch 2 times, most recently from 8f22485 to ebca03d Compare November 4, 2024 12:40

github-actions bot removed category: inference OpenVINO Runtime library - Inference category: Python API OpenVINO Python bindings category: CPP API OpenVINO CPP API bindings labels Nov 4, 2024

e-ddykim force-pushed the static_scaling branch from ebca03d to cc4b37f Compare November 4, 2024 12:57

e-ddykim force-pushed the static_scaling branch from 46b17ca to 6491951 Compare November 11, 2024 13:56

e-ddykim force-pushed the static_scaling branch from cd42c04 to 5820e2b Compare November 12, 2024 06:51

e-ddykim added 19 commits January 13, 2025 15:57

updated AddTransformation to use output_type instead of fp32

f58c08c

added a new EliminateMultiplyX1 pass

e81ef27

update code style

d06b550

added a new MulMulTransformation

1b1e04a

added MulDownTransformation

2c450f9

fixed code style

966955d

added a functional test

7e53946

applied reviews

d31907f

merged master

83f65b2

applied reviews

a26d87f

updated to preserve the original output precision

6d6f7b0

updated per reviews

000b11d

reverted to apply activations_scale_factor from rt_info

20a7a44

added MulMulTransformationTest

3afa527

updated MulShareTransformation

aa284a8

updated scaling tests

366a1be

applied reviews

ecc48e6

set scalingMode = true

8122cde

disabled scaling for quantized models

0874b17

e-ddykim force-pushed the static_scaling branch from 1889ef2 to 0874b17 Compare January 13, 2025 06:58

yeonbok approved these changes Jan 14, 2025

View reviewed changes

geunhwan added this to the 2025.0 milestone Jan 14, 2025

geunhwan added the Code Freeze label Jan 14, 2025

vladimir-paramuzov reviewed Jan 14, 2025

View reviewed changes

vladimir-paramuzov approved these changes Jan 14, 2025

View reviewed changes

vladimir-paramuzov added this pull request to the merge queue Jan 14, 2025

Merged via the queue into openvinotoolkit:master with commit cc67ad1 Jan 14, 2025
185 checks passed

JamieVC mentioned this pull request Jan 16, 2025

Flux.1 image generation has a relatively unfinished image output that inferenced on iGPU than CPU #28445

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[GPU] activations scaling to resolve accuracy issues for infer precision of f16 #27265

[GPU] activations scaling to resolve accuracy issues for infer precision of f16 #27265

e-ddykim commented Oct 28, 2024 •

edited

Loading

vladimir-paramuzov Oct 28, 2024

ilya-lavrenov Oct 28, 2024

vladimir-paramuzov Oct 28, 2024

ilya-lavrenov Oct 28, 2024

vladimir-paramuzov Oct 28, 2024

p-durandin Jan 10, 2025

AlexKoff88 commented Nov 12, 2024

vladimir-paramuzov Jan 14, 2025

e-ddykim Jan 14, 2025

v-Golubev Jan 14, 2025

vladimir-paramuzov Jan 14, 2025

e-ddykim Jan 14, 2025

vladimir-paramuzov Jan 14, 2025

e-ddykim Jan 14, 2025

vladimir-paramuzov left a comment


		float activations_scale_factor = config.get_property(ov::hint::activations_scale_factor);

		if (activations_scale_factor > 0.f && infer_precision == ov::element::f16 && !enableInt8) {

[GPU] activations scaling to resolve accuracy issues for infer precision of f16 #27265

[GPU] activations scaling to resolve accuracy issues for infer precision of f16 #27265

Conversation

e-ddykim commented Oct 28, 2024 • edited Loading

Details:

Tickets:

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AlexKoff88 commented Nov 12, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vladimir-paramuzov left a comment

Choose a reason for hiding this comment

e-ddykim commented Oct 28, 2024 •

edited

Loading