Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GPU] optimize ReduceMax pattern #24073

Merged

Conversation

riverlijunjie
Copy link
Contributor

@riverlijunjie riverlijunjie commented Apr 17, 2024

Details

  • Optimize ReduceMax pattern to avoid scheduling the whole primitive executed in single EU

    Sometimes ReduceMax OP is used to convert 3D/4D shape tensor to a scalar output, which leads to all computation are executed in single EUs due to only one output. It causes very poor performance for some models.
    For example: Grounding DINO model
    ReduceMax cost 59.24 ms and cosumed 49% execution time out of whole models.

    To break this bottleneck, this PR applies more EUs to execute this primitive by doing ReduceMax one dimension by one dimension. We also notice that the ReduceMax OP selects ref-kernel rather than opt-kernel, which may also cause some performance issue. But it seems the ReduceMax OP doesn't need too much computation, ref-kernel should be enough. The key problem should be only one EU is scheduled to do the whole ReduceMax computation, which is the root cause of poor performace.

image

image

 Test result shows:
      ReduceMax will be improved from 59.24ms to 2.25ms, fps from 8.24 to 15.55 (+88% improvement)

image

Tickets:

  • 145690

…n single EU

There are many ReduceMax layers are used to  convert 3D/4D shape data to scalar output, which leads
to all computation is executed in single EUs while other 511 EUs are idle. It causes very poor performance
for ReduceMax primitive execution.
For example: Grounding DINO models
     ReduceMax cost 59.24 ms and cosumed 49% execution times out of whole models.

To break this bottleneck, this PR applys more EUs execute this primitive by doing reduceMax one dimension by one dimension.
Test result show:
     ReduceMax will be improved from 59.24 ms to 2.25 ms, fps from 8.24 to 15.55 (+88% improvement)
@github-actions github-actions bot added the category: GPU OpenVINO GPU plugin label Apr 17, 2024
@riverlijunjie riverlijunjie force-pushed the river/gpu_reducemax_optimization branch from b174f45 to c3a53b8 Compare April 22, 2024 01:30
@riverlijunjie riverlijunjie marked this pull request as ready for review April 22, 2024 04:57
@riverlijunjie riverlijunjie requested review from a team as code owners April 22, 2024 04:57
peterchen-intel pushed a commit to peterchen-intel/openvino that referenced this pull request May 27, 2024
…le EU

openvinotoolkit#24073
There are many ReduceMax layers are used to  convert 3D/4D shape data to scalar output, which leads
to all computation is executed in single EUs while other 511 EUs are idle. It causes very poor performance
for ReduceMax primitive execution.
For example: Grounding DINO models
     ReduceMax cost 59.24 ms and cosumed 49% execution times out of whole models.

To break this bottleneck, this PR applys more EUs execute this primitive by doing reduceMax one dimension by one dimension.
Test result show:
     ReduceMax will be improved from 59.24 ms to 2.25 ms, fps from 8.24 to 15.55 (+88% improvement)
Update to set same keep_dim with origin reReduceMax op
Update condition
Copy link
Contributor

@isanghao isanghao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi River, this PR is basically running reduce in cascaded manner, right?
this may work for the target pattern, but I'm afraid we may see side effect in some cases. Did you check performance from broad set of networks for both iGPU and dGPU?
For example, the behavior(reduce primitive selection) is different between iGPU and dGPU. Also there is a case where optimized reduce kernel is chosen. I'm afraid this code will interfere with existing code in a complicated way. So I'd like to suggest to check why reduce_ref is chosen, first. Based on that, we may choose proper way to fix it.

@riverlijunjie
Copy link
Contributor Author

Hi River, this PR is basically running reduce in cascaded manner, right? this may work for the target pattern, but I'm afraid we may see side effect in some cases. Did you check performance from broad set of networks for both iGPU and dGPU? For example, the behavior(reduce primitive selection) is different between iGPU and dGPU. Also there is a case where optimized reduce kernel is chosen. I'm afraid this code will interfere with existing code in a complicated way. So I'd like to suggest to check why reduce_ref is chosen, first. Based on that, we may choose proper way to fix it.

This issue only was found in GroundingDINO model, I didn't found the same issue in other model. I will check why reduce_ref is chosen.

Copy link
Contributor

This PR will be closed in a week because of 2 weeks of no activity.

@github-actions github-actions bot added the Stale label Jun 11, 2024
@wenjiew wenjiew added no_stale Do not mark as stale and removed Stale labels Jun 11, 2024
@riverlijunjie
Copy link
Contributor Author

Hi River, this PR is basically running reduce in cascaded manner, right? this may work for the target pattern, but I'm afraid we may see side effect in some cases. Did you check performance from broad set of networks for both iGPU and dGPU? For example, the behavior(reduce primitive selection) is different between iGPU and dGPU. Also there is a case where optimized reduce kernel is chosen. I'm afraid this code will interfere with existing code in a complicated way. So I'd like to suggest to check why reduce_ref is chosen, first. Based on that, we may choose proper way to fix it.

This issue only was found in GroundingDINO model, I didn't found the same issue in other model. I will check why reduce_ref is chosen.

@isanghao The Reduce primitive is dynamic shape, so It always chooses ocl kernel, right?

image

And for cldnn kernel selector, it contains 2 kernel implements: reduce_ref and reduce_b_fs_yx_fsv16, but reduce_b_fs_yx_fsv16 doesn't support dynamic shape, so it has to fallback to choose reduce_ref kernel.

image

It seems that for reduce with dynamic shape, it only can choose reduce_ref kernel, is it right?

For this case(GroundDINO model) - ReduceMax to 1x1x1 output, it will lead to OCL GWS become (1,1,1), which is very inefficient for GPU. Is any better solution for it?

image

@isanghao
Copy link
Contributor

isanghao commented Jun 25, 2024

Hi @riverlijunjie , thanks for the detailed investigation. I think it would work as a temporal workaround, but the formal fix would be to implement(or improve) a regular optimized kernel with dynamic shape support. If you would like to merge it as-is, could you add a functional test to confirm the accuracy of such case?

@riverlijunjie
Copy link
Contributor Author

Hi @riverlijunjie , thanks for the detailed investigation. I think it would work as a temporal workaround, but the formal fix would be to implement(or improve) a regular optimized kernel with dynamic shape support. If you would like to merge it as-is, could you add a functional test to confirm the accuracy of such case?

Thanks @isanghao comments, this fixing can solve one VLM performance issue, I will add functional tests for it.

@riverlijunjie
Copy link
Contributor Author

Hi @riverlijunjie , thanks for the detailed investigation. I think it would work as a temporal workaround, but the formal fix would be to implement(or improve) a regular optimized kernel with dynamic shape support. If you would like to merge it as-is, could you add a functional test to confirm the accuracy of such case?

Thanks @isanghao comments, this fixing can solve one VLM performance issue, I will add functional tests for it.

@isanghao This PR has unit test, would we also need add functional test as well? It seem transformation tests are located in unit test, if needed could you please give some example to add functional tests? Thanks!


ov::intel_gpu::ConvertReduceMaxScalarOutput::ConvertReduceMaxScalarOutput() {
// Check all Reduce nodes
auto m = std::make_shared<ov::pass::pattern::Matcher>(ov::pass::pattern::wrap_type<ov::op::v1::ReduceMax>(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe it's applicable to any reduction mode

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, good idea!

auto reduce_shape = reduce_max->input_value(1).get_partial_shape();
if (reduce_shape.is_dynamic() || reduce_shape.size() != 1 || reduce_shape.to_shape()[0] != input_shape.size() ||
reduce_shape.to_shape()[0] <= 1) {
return false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please move all applicability checks to predicate for reduce op

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

namespace ov {
namespace intel_gpu {

class ConvertReduceMaxScalarOutput : public ov::pass::MatcherPass {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a short description for this pass

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

class ConvertReduceMaxScalarOutput : public ov::pass::MatcherPass {
public:
OPENVINO_RTTI("ConvertReduceMaxScalarOutput", "0");
ConvertReduceMaxScalarOutput();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think DecomposeReduce... name would better reflect the purpose of the pass

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will rename to DecomposeReduceForScalarOutput

Comment on lines 65 to 78
for (size_t i = 0; i < input_shape.size() - 1; i++) {
// Reduce one dimension by one dimension to avoid 1 EU do all work.
if (input_shape[i].is_dynamic() || (input_shape[i].is_static() && input_shape[i].get_length() >= 4)) {
if (!reduce_)
reduce_ = std::make_shared<ov::op::v1::ReduceMax>(
reduce_max->input_value(0),
ov::op::v0::Constant::create(ov::element::i64, ov::Shape{1}, {i}),
true);
else
reduce_ = std::make_shared<ov::op::v1::ReduceMax>(
reduce_->get_default_output(),
ov::op::v0::Constant::create(ov::element::i64, ov::Shape{1}, {i}),
true);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it can be simplified like this:

            auto input = reduce_max->input_value(0);
            for (size_t i = 0; i < input_shape.size() - 1; i++) {
                // Reduce one dimension by one dimension to avoid 1 EU do all work.
                if (input_shape[i].is_dynamic() || (input_shape[i].is_static() && input_shape[i].get_length() >= 4)) {
                    reduce_ = std::make_shared<ov::op::v1::ReduceMax>(
                            input,
                            ov::op::v0::Constant::create(ov::element::i64, ov::Shape{1}, {i}),
                            true);
                            input = reduce->get_default_output();
                }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

}

std::shared_ptr<ov::op::v1::ReduceMax> reduce_ = nullptr, reduce = nullptr;
if (dynamic_shape == false) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: !dynamic_shape

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


const auto input_shape = reduce_max->input_value(0).get_partial_shape();
auto reduce_shape = reduce_max->input_value(1).get_partial_shape();
if (reduce_shape.is_dynamic() || reduce_shape.size() != 1 || reduce_shape.to_shape()[0] != input_shape.size() ||
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be good to check that reduce_shape.is_static() before doing .to_shape()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

using namespace testing;
using namespace ov::intel_gpu;

static std::shared_ptr<ov::Model> BuildFunction(const ov::PartialShape& input_shape,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: build_model

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

size_t reduce_count = 0;
for (auto& ops : func->get_ops()) {
std::string type_name(ops->get_type_name());
if (type_name.find("ReduceMax") != std::string::npos) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you implement tests in a similar way to other transformation unit tests? I.e. Something like this:

TEST_F(TransformationTestsF, SplitReduceMaxTest1) {
    {
        model = build_model(...);
        manager.register_pass<ConvertReduceMaxScalarOutput>();
    }
    {
        // build expected model
        model_ref = std::make_shared<ov::Model>(ov::ResultVector{...}, ov::ParameterVector{...});
    }
}

Check for reduces count doesn't guarantee that transformation works correctly

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, will change to such test cases.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@isanghao
Copy link
Contributor

isanghao commented Aug 5, 2024

no perf issue from dgpu daily test


const auto input_shape = reduce_orig->input_value(0).get_partial_shape();
const auto reduce_shape = reduce_orig->input_value(1).get_partial_shape();
if (reduce_shape.to_shape()[0] != input_shape.size())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: that can also be a part of reduce predicate I believe

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

using ReduceType = cldnn::reduce_mode;

#define create_reduce(arg, reduction, keep_dims, reduce_type) \
if (reduce_type == reduce_mode::sum) \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please don't use cldnn things in transformation unit tests. You can pass op type to this macro instead of enum value:

#define create_reduce(arg, reduction, keep_dims, ReduceType) \
    reduce = std::make_shared<ReduceType>(arg, reduction, keep_dims);

Also, I'd suggest replacing macro with template

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@isanghao isanghao dismissed their stale review August 16, 2024 04:12

outdated review

@isanghao isanghao added this pull request to the merge queue Aug 16, 2024
Merged via the queue into openvinotoolkit:master with commit 8b82aae Aug 16, 2024
122 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category: GPU OpenVINO GPU plugin no_stale Do not mark as stale
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants