-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[GPU] optimize ReduceMax pattern #24073
[GPU] optimize ReduceMax pattern #24073
Conversation
…n single EU There are many ReduceMax layers are used to convert 3D/4D shape data to scalar output, which leads to all computation is executed in single EUs while other 511 EUs are idle. It causes very poor performance for ReduceMax primitive execution. For example: Grounding DINO models ReduceMax cost 59.24 ms and cosumed 49% execution times out of whole models. To break this bottleneck, this PR applys more EUs execute this primitive by doing reduceMax one dimension by one dimension. Test result show: ReduceMax will be improved from 59.24 ms to 2.25 ms, fps from 8.24 to 15.55 (+88% improvement)
b174f45
to
c3a53b8
Compare
…le EU openvinotoolkit#24073 There are many ReduceMax layers are used to convert 3D/4D shape data to scalar output, which leads to all computation is executed in single EUs while other 511 EUs are idle. It causes very poor performance for ReduceMax primitive execution. For example: Grounding DINO models ReduceMax cost 59.24 ms and cosumed 49% execution times out of whole models. To break this bottleneck, this PR applys more EUs execute this primitive by doing reduceMax one dimension by one dimension. Test result show: ReduceMax will be improved from 59.24 ms to 2.25 ms, fps from 8.24 to 15.55 (+88% improvement) Update to set same keep_dim with origin reReduceMax op Update condition
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi River, this PR is basically running reduce in cascaded manner, right?
this may work for the target pattern, but I'm afraid we may see side effect in some cases. Did you check performance from broad set of networks for both iGPU and dGPU?
For example, the behavior(reduce primitive selection) is different between iGPU and dGPU. Also there is a case where optimized reduce kernel is chosen. I'm afraid this code will interfere with existing code in a complicated way. So I'd like to suggest to check why reduce_ref is chosen, first. Based on that, we may choose proper way to fix it.
This issue only was found in GroundingDINO model, I didn't found the same issue in other model. I will check why reduce_ref is chosen. |
This PR will be closed in a week because of 2 weeks of no activity. |
@isanghao The Reduce primitive is dynamic shape, so It always chooses ocl kernel, right? And for cldnn kernel selector, it contains 2 kernel implements: It seems that for reduce with dynamic shape, it only can choose For this case(GroundDINO model) - ReduceMax to 1x1x1 output, it will lead to OCL GWS become (1,1,1), which is very inefficient for GPU. Is any better solution for it? |
Hi @riverlijunjie , thanks for the detailed investigation. I think it would work as a temporal workaround, but the formal fix would be to implement(or improve) a regular optimized kernel with dynamic shape support. If you would like to merge it as-is, could you add a functional test to confirm the accuracy of such case? |
Thanks @isanghao comments, this fixing can solve one VLM performance issue, I will add functional tests for it. |
@isanghao This PR has unit test, would we also need add functional test as well? It seem transformation tests are located in unit test, if needed could you please give some example to add functional tests? Thanks! |
|
||
ov::intel_gpu::ConvertReduceMaxScalarOutput::ConvertReduceMaxScalarOutput() { | ||
// Check all Reduce nodes | ||
auto m = std::make_shared<ov::pass::pattern::Matcher>(ov::pass::pattern::wrap_type<ov::op::v1::ReduceMax>(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe it's applicable to any reduction mode
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, good idea!
auto reduce_shape = reduce_max->input_value(1).get_partial_shape(); | ||
if (reduce_shape.is_dynamic() || reduce_shape.size() != 1 || reduce_shape.to_shape()[0] != input_shape.size() || | ||
reduce_shape.to_shape()[0] <= 1) { | ||
return false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please move all applicability checks to predicate for reduce op
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
namespace ov { | ||
namespace intel_gpu { | ||
|
||
class ConvertReduceMaxScalarOutput : public ov::pass::MatcherPass { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a short description for this pass
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
class ConvertReduceMaxScalarOutput : public ov::pass::MatcherPass { | ||
public: | ||
OPENVINO_RTTI("ConvertReduceMaxScalarOutput", "0"); | ||
ConvertReduceMaxScalarOutput(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think DecomposeReduce...
name would better reflect the purpose of the pass
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will rename to DecomposeReduceForScalarOutput
for (size_t i = 0; i < input_shape.size() - 1; i++) { | ||
// Reduce one dimension by one dimension to avoid 1 EU do all work. | ||
if (input_shape[i].is_dynamic() || (input_shape[i].is_static() && input_shape[i].get_length() >= 4)) { | ||
if (!reduce_) | ||
reduce_ = std::make_shared<ov::op::v1::ReduceMax>( | ||
reduce_max->input_value(0), | ||
ov::op::v0::Constant::create(ov::element::i64, ov::Shape{1}, {i}), | ||
true); | ||
else | ||
reduce_ = std::make_shared<ov::op::v1::ReduceMax>( | ||
reduce_->get_default_output(), | ||
ov::op::v0::Constant::create(ov::element::i64, ov::Shape{1}, {i}), | ||
true); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it can be simplified like this:
auto input = reduce_max->input_value(0);
for (size_t i = 0; i < input_shape.size() - 1; i++) {
// Reduce one dimension by one dimension to avoid 1 EU do all work.
if (input_shape[i].is_dynamic() || (input_shape[i].is_static() && input_shape[i].get_length() >= 4)) {
reduce_ = std::make_shared<ov::op::v1::ReduceMax>(
input,
ov::op::v0::Constant::create(ov::element::i64, ov::Shape{1}, {i}),
true);
input = reduce->get_default_output();
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
} | ||
|
||
std::shared_ptr<ov::op::v1::ReduceMax> reduce_ = nullptr, reduce = nullptr; | ||
if (dynamic_shape == false) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: !dynamic_shape
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
|
||
const auto input_shape = reduce_max->input_value(0).get_partial_shape(); | ||
auto reduce_shape = reduce_max->input_value(1).get_partial_shape(); | ||
if (reduce_shape.is_dynamic() || reduce_shape.size() != 1 || reduce_shape.to_shape()[0] != input_shape.size() || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be good to check that reduce_shape.is_static()
before doing .to_shape()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
using namespace testing; | ||
using namespace ov::intel_gpu; | ||
|
||
static std::shared_ptr<ov::Model> BuildFunction(const ov::PartialShape& input_shape, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: build_model
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
size_t reduce_count = 0; | ||
for (auto& ops : func->get_ops()) { | ||
std::string type_name(ops->get_type_name()); | ||
if (type_name.find("ReduceMax") != std::string::npos) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you implement tests in a similar way to other transformation unit tests? I.e. Something like this:
TEST_F(TransformationTestsF, SplitReduceMaxTest1) {
{
model = build_model(...);
manager.register_pass<ConvertReduceMaxScalarOutput>();
}
{
// build expected model
model_ref = std::make_shared<ov::Model>(ov::ResultVector{...}, ov::ParameterVector{...});
}
}
Check for reduces count doesn't guarantee that transformation works correctly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, will change to such test cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
no perf issue from dgpu daily test |
|
||
const auto input_shape = reduce_orig->input_value(0).get_partial_shape(); | ||
const auto reduce_shape = reduce_orig->input_value(1).get_partial_shape(); | ||
if (reduce_shape.to_shape()[0] != input_shape.size()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: that can also be a part of reduce predicate I believe
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
using ReduceType = cldnn::reduce_mode; | ||
|
||
#define create_reduce(arg, reduction, keep_dims, reduce_type) \ | ||
if (reduce_type == reduce_mode::sum) \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please don't use cldnn things in transformation unit tests. You can pass op type to this macro instead of enum value:
#define create_reduce(arg, reduction, keep_dims, ReduceType) \
reduce = std::make_shared<ReduceType>(arg, reduction, keep_dims);
Also, I'd suggest replacing macro with template
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
Details
Optimize ReduceMax pattern to avoid scheduling the whole primitive executed in single EU
Sometimes ReduceMax OP is used to convert 3D/4D shape tensor to a scalar output, which leads to all computation are executed in single EUs due to only one output. It causes very poor performance for some models.
For example: Grounding DINO model
ReduceMax cost 59.24 ms and cosumed 49% execution time out of whole models.
To break this bottleneck, this PR applies more EUs to execute this primitive by doing ReduceMax one dimension by one dimension. We also notice that the ReduceMax OP selects ref-kernel rather than opt-kernel, which may also cause some performance issue. But it seems the ReduceMax OP doesn't need too much computation, ref-kernel should be enough. The key problem should be only one EU is scheduled to do the whole ReduceMax computation, which is the root cause of poor performace.
Tickets: