[GPU] optimize ReduceMax pattern #24073

riverlijunjie · 2024-04-17T08:03:21Z

Details

Optimize ReduceMax pattern to avoid scheduling the whole primitive executed in single EU

Sometimes ReduceMax OP is used to convert 3D/4D shape tensor to a scalar output, which leads to all computation are executed in single EUs due to only one output. It causes very poor performance for some models.
For example: Grounding DINO model
ReduceMax cost 59.24 ms and cosumed 49% execution time out of whole models.

To break this bottleneck, this PR applies more EUs to execute this primitive by doing ReduceMax one dimension by one dimension. We also notice that the ReduceMax OP selects ref-kernel rather than opt-kernel, which may also cause some performance issue. But it seems the ReduceMax OP doesn't need too much computation, ref-kernel should be enough. The key problem should be only one EU is scheduled to do the whole ReduceMax computation, which is the root cause of poor performace.

 Test result shows:
      ReduceMax will be improved from 59.24ms to 2.25ms, fps from 8.24 to 15.55 (+88% improvement)

Tickets:

145690

…n single EU There are many ReduceMax layers are used to convert 3D/4D shape data to scalar output, which leads to all computation is executed in single EUs while other 511 EUs are idle. It causes very poor performance for ReduceMax primitive execution. For example: Grounding DINO models ReduceMax cost 59.24 ms and cosumed 49% execution times out of whole models. To break this bottleneck, this PR applys more EUs execute this primitive by doing reduceMax one dimension by one dimension. Test result show: ReduceMax will be improved from 59.24 ms to 2.25 ms, fps from 8.24 to 15.55 (+88% improvement)

…le EU openvinotoolkit#24073 There are many ReduceMax layers are used to convert 3D/4D shape data to scalar output, which leads to all computation is executed in single EUs while other 511 EUs are idle. It causes very poor performance for ReduceMax primitive execution. For example: Grounding DINO models ReduceMax cost 59.24 ms and cosumed 49% execution times out of whole models. To break this bottleneck, this PR applys more EUs execute this primitive by doing reduceMax one dimension by one dimension. Test result show: ReduceMax will be improved from 59.24 ms to 2.25 ms, fps from 8.24 to 15.55 (+88% improvement) Update to set same keep_dim with origin reReduceMax op Update condition

isanghao

Hi River, this PR is basically running reduce in cascaded manner, right?
this may work for the target pattern, but I'm afraid we may see side effect in some cases. Did you check performance from broad set of networks for both iGPU and dGPU?
For example, the behavior(reduce primitive selection) is different between iGPU and dGPU. Also there is a case where optimized reduce kernel is chosen. I'm afraid this code will interfere with existing code in a complicated way. So I'd like to suggest to check why reduce_ref is chosen, first. Based on that, we may choose proper way to fix it.

riverlijunjie · 2024-05-27T15:12:03Z

Hi River, this PR is basically running reduce in cascaded manner, right? this may work for the target pattern, but I'm afraid we may see side effect in some cases. Did you check performance from broad set of networks for both iGPU and dGPU? For example, the behavior(reduce primitive selection) is different between iGPU and dGPU. Also there is a case where optimized reduce kernel is chosen. I'm afraid this code will interfere with existing code in a complicated way. So I'd like to suggest to check why reduce_ref is chosen, first. Based on that, we may choose proper way to fix it.

This issue only was found in GroundingDINO model, I didn't found the same issue in other model. I will check why reduce_ref is chosen.

github-actions · 2024-06-11T00:19:04Z

This PR will be closed in a week because of 2 weeks of no activity.

riverlijunjie · 2024-06-21T04:12:17Z

Hi River, this PR is basically running reduce in cascaded manner, right? this may work for the target pattern, but I'm afraid we may see side effect in some cases. Did you check performance from broad set of networks for both iGPU and dGPU? For example, the behavior(reduce primitive selection) is different between iGPU and dGPU. Also there is a case where optimized reduce kernel is chosen. I'm afraid this code will interfere with existing code in a complicated way. So I'd like to suggest to check why reduce_ref is chosen, first. Based on that, we may choose proper way to fix it.

This issue only was found in GroundingDINO model, I didn't found the same issue in other model. I will check why reduce_ref is chosen.

@isanghao The Reduce primitive is dynamic shape, so It always chooses ocl kernel, right?

And for cldnn kernel selector, it contains 2 kernel implements: reduce_ref and reduce_b_fs_yx_fsv16, but reduce_b_fs_yx_fsv16 doesn't support dynamic shape, so it has to fallback to choose reduce_ref kernel.

It seems that for reduce with dynamic shape, it only can choose reduce_ref kernel, is it right?

For this case(GroundDINO model) - ReduceMax to 1x1x1 output, it will lead to OCL GWS become (1,1,1), which is very inefficient for GPU. Is any better solution for it?

isanghao · 2024-06-25T12:15:34Z

Hi @riverlijunjie , thanks for the detailed investigation. I think it would work as a temporal workaround, but the formal fix would be to implement(or improve) a regular optimized kernel with dynamic shape support. If you would like to merge it as-is, could you add a functional test to confirm the accuracy of such case?

riverlijunjie · 2024-06-26T06:16:06Z

Hi @riverlijunjie , thanks for the detailed investigation. I think it would work as a temporal workaround, but the formal fix would be to implement(or improve) a regular optimized kernel with dynamic shape support. If you would like to merge it as-is, could you add a functional test to confirm the accuracy of such case?

Thanks @isanghao comments, this fixing can solve one VLM performance issue, I will add functional tests for it.

riverlijunjie · 2024-07-30T07:06:32Z

Hi @riverlijunjie , thanks for the detailed investigation. I think it would work as a temporal workaround, but the formal fix would be to implement(or improve) a regular optimized kernel with dynamic shape support. If you would like to merge it as-is, could you add a functional test to confirm the accuracy of such case?

Thanks @isanghao comments, this fixing can solve one VLM performance issue, I will add functional tests for it.

@isanghao This PR has unit test, would we also need add functional test as well? It seem transformation tests are located in unit test, if needed could you please give some example to add functional tests? Thanks!

vladimir-paramuzov · 2024-08-05T06:01:30Z

src/plugins/intel_gpu/src/plugin/transformations/convert_reducemax_scalar_output.cpp

+
+ov::intel_gpu::ConvertReduceMaxScalarOutput::ConvertReduceMaxScalarOutput() {
+    // Check all Reduce nodes
+    auto m = std::make_shared<ov::pass::pattern::Matcher>(ov::pass::pattern::wrap_type<ov::op::v1::ReduceMax>(),


I believe it's applicable to any reduction mode

yes, good idea!

vladimir-paramuzov · 2024-08-05T06:02:33Z

src/plugins/intel_gpu/src/plugin/transformations/convert_reducemax_scalar_output.cpp

+        auto reduce_shape = reduce_max->input_value(1).get_partial_shape();
+        if (reduce_shape.is_dynamic() || reduce_shape.size() != 1 || reduce_shape.to_shape()[0] != input_shape.size() ||
+            reduce_shape.to_shape()[0] <= 1) {
+            return false;


Please move all applicability checks to predicate for reduce op

vladimir-paramuzov · 2024-08-05T06:05:06Z

src/plugins/intel_gpu/src/plugin/transformations/convert_reducemax_scalar_output.hpp

+namespace ov {
+namespace intel_gpu {
+
+class ConvertReduceMaxScalarOutput : public ov::pass::MatcherPass {


Please add a short description for this pass

vladimir-paramuzov · 2024-08-05T06:06:33Z

src/plugins/intel_gpu/src/plugin/transformations/convert_reducemax_scalar_output.hpp

+class ConvertReduceMaxScalarOutput : public ov::pass::MatcherPass {
+public:
+    OPENVINO_RTTI("ConvertReduceMaxScalarOutput", "0");
+    ConvertReduceMaxScalarOutput();


I think DecomposeReduce... name would better reflect the purpose of the pass

Will rename to DecomposeReduceForScalarOutput

vladimir-paramuzov · 2024-08-05T06:09:14Z

src/plugins/intel_gpu/src/plugin/transformations/convert_reducemax_scalar_output.cpp

+            for (size_t i = 0; i < input_shape.size() - 1; i++) {
+                // Reduce one dimension by one dimension to avoid 1 EU do all work.
+                if (input_shape[i].is_dynamic() || (input_shape[i].is_static() && input_shape[i].get_length() >= 4)) {
+                    if (!reduce_)
+                        reduce_ = std::make_shared<ov::op::v1::ReduceMax>(
+                            reduce_max->input_value(0),
+                            ov::op::v0::Constant::create(ov::element::i64, ov::Shape{1}, {i}),
+                            true);
+                    else
+                        reduce_ = std::make_shared<ov::op::v1::ReduceMax>(
+                            reduce_->get_default_output(),
+                            ov::op::v0::Constant::create(ov::element::i64, ov::Shape{1}, {i}),
+                            true);
+                }


I think it can be simplified like this:

auto input = reduce_max->input_value(0); for (size_t i = 0; i < input_shape.size() - 1; i++) { // Reduce one dimension by one dimension to avoid 1 EU do all work. if (input_shape[i].is_dynamic() || (input_shape[i].is_static() && input_shape[i].get_length() >= 4)) { reduce_ = std::make_shared<ov::op::v1::ReduceMax>( input, ov::op::v0::Constant::create(ov::element::i64, ov::Shape{1}, {i}), true); input = reduce->get_default_output(); }

vladimir-paramuzov · 2024-08-05T06:09:44Z

src/plugins/intel_gpu/src/plugin/transformations/convert_reducemax_scalar_output.cpp

+        }
+
+        std::shared_ptr<ov::op::v1::ReduceMax> reduce_ = nullptr, reduce = nullptr;
+        if (dynamic_shape == false) {


nit: !dynamic_shape

vladimir-paramuzov · 2024-08-05T06:12:55Z

src/plugins/intel_gpu/src/plugin/transformations/convert_reducemax_scalar_output.cpp

+
+        const auto input_shape = reduce_max->input_value(0).get_partial_shape();
+        auto reduce_shape = reduce_max->input_value(1).get_partial_shape();
+        if (reduce_shape.is_dynamic() || reduce_shape.size() != 1 || reduce_shape.to_shape()[0] != input_shape.size() ||


Would be good to check that reduce_shape.is_static() before doing .to_shape()

vladimir-paramuzov · 2024-08-05T06:14:46Z

src/plugins/intel_gpu/tests/unit/transformations/split_reduce_max_test.cpp

+using namespace testing;
+using namespace ov::intel_gpu;
+
+static std::shared_ptr<ov::Model> BuildFunction(const ov::PartialShape& input_shape,


nit: build_model

vladimir-paramuzov · 2024-08-05T06:17:58Z

src/plugins/intel_gpu/tests/unit/transformations/split_reduce_max_test.cpp

+    size_t reduce_count = 0;
+    for (auto& ops : func->get_ops()) {
+        std::string type_name(ops->get_type_name());
+        if (type_name.find("ReduceMax") != std::string::npos) {


Could you implement tests in a similar way to other transformation unit tests? I.e. Something like this:

TEST_F(TransformationTestsF, SplitReduceMaxTest1) { { model = build_model(...); manager.register_pass<ConvertReduceMaxScalarOutput>(); } { // build expected model model_ref = std::make_shared<ov::Model>(ov::ResultVector{...}, ov::ParameterVector{...}); } }

Check for reduces count doesn't guarantee that transformation works correctly

ok, will change to such test cases.

isanghao · 2024-08-05T09:17:40Z

no perf issue from dgpu daily test

vladimir-paramuzov · 2024-08-14T13:44:52Z

src/plugins/intel_gpu/src/plugin/transformations/decompose_reduce_scalar_output.cpp

+
+        const auto input_shape = reduce_orig->input_value(0).get_partial_shape();
+        const auto reduce_shape = reduce_orig->input_value(1).get_partial_shape();
+        if (reduce_shape.to_shape()[0] != input_shape.size())


nit: that can also be a part of reduce predicate I believe

vladimir-paramuzov · 2024-08-14T14:03:36Z

src/plugins/intel_gpu/tests/unit/transformations/decompose_reduce_scalar_output_test.cpp

+using ReduceType = cldnn::reduce_mode;
+
+#define create_reduce(arg, reduction, keep_dims, reduce_type)                         \
+    if (reduce_type == reduce_mode::sum)                                              \


Please don't use cldnn things in transformation unit tests. You can pass op type to this macro instead of enum value:

#define create_reduce(arg, reduction, keep_dims, ReduceType) \ reduce = std::make_shared<ReduceType>(arg, reduction, keep_dims);

Also, I'd suggest replacing macro with template

outdated review

github-actions bot added the category: GPU OpenVINO GPU plugin label Apr 17, 2024

riverlijunjie added 2 commits April 17, 2024 23:34

Update to set same keep_dim with origin reReduceMax op

d858099

Update condition

9e44524

riverlijunjie force-pushed the river/gpu_reducemax_optimization branch from b174f45 to c3a53b8 Compare April 22, 2024 01:30

Add test cases

c3a53b8

riverlijunjie marked this pull request as ready for review April 22, 2024 04:57

riverlijunjie requested review from a team as code owners April 22, 2024 04:57

songbell assigned yeonbok Apr 23, 2024

peterchen-intel mentioned this pull request May 27, 2024

optimize ReduceMax pattern #24692

Closed

isanghao previously requested changes May 27, 2024

View reviewed changes

github-actions bot added the Stale label Jun 11, 2024

wenjiew added no_stale Do not mark as stale and removed Stale labels Jun 11, 2024

Merge branch 'master' into river/gpu_reducemax_optimization

14526ff

songbell approved these changes Jul 15, 2024

View reviewed changes

Merge branch 'master' into river/gpu_reducemax_optimization

8d4fb47

riverlijunjie requested a review from vladimir-paramuzov July 30, 2024 07:09

peterchen-intel requested a review from isanghao August 3, 2024 01:32

vladimir-paramuzov reviewed Aug 5, 2024

View reviewed changes

Merge branch 'master' into river/gpu_reducemax_optimization

0cfcd3c

riverlijunjie added 2 commits August 14, 2024 07:13

Solve reviewer's comments and update test cases

5d98877

Add more test cases

521ddf5

riverlijunjie requested a review from vladimir-paramuzov August 14, 2024 13:13

Merge branch 'master' into river/gpu_reducemax_optimization

498225a

vladimir-paramuzov reviewed Aug 14, 2024

View reviewed changes

riverlijunjie added 2 commits August 15, 2024 13:37

Update test cases with template

a28c826

Merge branch 'master' into river/gpu_reducemax_optimization

b735cfe

vladimir-paramuzov approved these changes Aug 15, 2024

View reviewed changes

isanghao added this pull request to the merge queue Aug 16, 2024

Merged via the queue into openvinotoolkit:master with commit 8b82aae Aug 16, 2024
122 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[GPU] optimize ReduceMax pattern #24073

[GPU] optimize ReduceMax pattern #24073

riverlijunjie commented Apr 17, 2024 •

edited by wenjiew

Loading

isanghao left a comment

riverlijunjie commented May 27, 2024

github-actions bot commented Jun 11, 2024

riverlijunjie commented Jun 21, 2024

isanghao commented Jun 25, 2024 •

edited

Loading

riverlijunjie commented Jun 26, 2024

riverlijunjie commented Jul 30, 2024

vladimir-paramuzov Aug 5, 2024

riverlijunjie Aug 13, 2024

vladimir-paramuzov Aug 5, 2024

riverlijunjie Aug 13, 2024

vladimir-paramuzov Aug 5, 2024

riverlijunjie Aug 13, 2024

vladimir-paramuzov Aug 5, 2024

riverlijunjie Aug 13, 2024

vladimir-paramuzov Aug 5, 2024

riverlijunjie Aug 13, 2024

vladimir-paramuzov Aug 5, 2024

riverlijunjie Aug 13, 2024

vladimir-paramuzov Aug 5, 2024

riverlijunjie Aug 13, 2024

vladimir-paramuzov Aug 5, 2024

riverlijunjie Aug 13, 2024

vladimir-paramuzov Aug 5, 2024

riverlijunjie Aug 13, 2024

riverlijunjie Aug 13, 2024

isanghao commented Aug 5, 2024

vladimir-paramuzov Aug 14, 2024

riverlijunjie Aug 15, 2024

vladimir-paramuzov Aug 14, 2024

riverlijunjie Aug 15, 2024

[GPU] optimize ReduceMax pattern #24073

[GPU] optimize ReduceMax pattern #24073

Conversation

riverlijunjie commented Apr 17, 2024 • edited by wenjiew Loading

Details

Tickets:

isanghao left a comment

Choose a reason for hiding this comment

riverlijunjie commented May 27, 2024

github-actions bot commented Jun 11, 2024

riverlijunjie commented Jun 21, 2024

isanghao commented Jun 25, 2024 • edited Loading

riverlijunjie commented Jun 26, 2024

riverlijunjie commented Jul 30, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

isanghao commented Aug 5, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

riverlijunjie commented Apr 17, 2024 •

edited by wenjiew

Loading

isanghao commented Jun 25, 2024 •

edited

Loading