Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Snippets] Added Softmax Support #13449

Closed

Conversation

a-sidorova
Copy link
Contributor

@a-sidorova a-sidorova commented Oct 12, 2022

Details:

  • Added Softmax support to Snippets partially: added the corresponding config parameter to disable Softmax in Snippets pipeline to avoid performance regressions and enable in tests for validation
  • Added support of Reshape around Softmax via SoftmaxReshapeElimination pass that removes the Reshape ops

Tickets:

  • 92363
  • 95636

@a-sidorova a-sidorova force-pushed the feature/snippets/softmax branch 2 times, most recently from b819ce0 to f3760b3 Compare October 12, 2022 15:13
@a-sidorova a-sidorova force-pushed the feature/snippets/softmax branch 3 times, most recently from 36ff257 to 3f3ff64 Compare October 31, 2022 13:40
@a-sidorova a-sidorova force-pushed the feature/snippets/softmax branch 2 times, most recently from 250aa6f to 42b0f4d Compare November 4, 2022 14:19
@a-sidorova a-sidorova marked this pull request as ready for review November 4, 2022 14:20
@a-sidorova a-sidorova requested review from a team as code owners November 4, 2022 14:20
@a-sidorova a-sidorova force-pushed the feature/snippets/softmax branch from 42b0f4d to a8ba02a Compare November 7, 2022 11:07
@a-sidorova a-sidorova added the category: CPU OpenVINO CPU plugin label Nov 7, 2022
@a-sidorova a-sidorova added this to the 2022.3 milestone Nov 7, 2022
@a-sidorova a-sidorova force-pushed the feature/snippets/softmax branch from a0a3d7c to 7fe52a9 Compare November 7, 2022 15:41
@dmitry-gorokhov
Copy link
Contributor

@IvanNovoselov could you please start the review?

Copy link
Contributor

@IvanNovoselov IvanNovoselov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost over, but there might be some more

@@ -117,7 +117,7 @@ class Generator {
* @param m model in canonical for for table-based code generation
* @return pointer to generated code
*/
code generate(std::shared_ptr<ov::Model>& m, const void* compile_params = nullptr) const;
code generate(std::shared_ptr<ov::Model>& m, const void* compile_params = nullptr, const bool need_check_rt_info = false) const;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need this flag?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed it before offline. That we should insert for reduction ops Fill op in tail loop if needed. And to avoid load time regression we check for rt_info.
I understand that it's not very beautiful code but we don't have an interface for some config moving for tail loop generation. It's one way which I found
Maybe we should move common config for generation. I mean a common struct { bool, bool, ...}. What do you think? For the future possible new checks

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I see your point totally.
We can just pass the whole snippets config, this would look like a more extendable solution. Probably there is no need for a special generation config at this point.
The original reason for this remark is that we both check rt_info and snippets config. Do we need both? Please see the relevant comment in generator.cpp also.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed now we move whole config. Thanks!

src/common/snippets/src/op/buffer.cpp Outdated Show resolved Hide resolved
src/common/snippets/src/op/buffer.cpp Outdated Show resolved Hide resolved
src/common/snippets/include/snippets/op/buffer.hpp Outdated Show resolved Hide resolved
src/common/snippets/include/snippets/op/fill.hpp Outdated Show resolved Hide resolved
// we should always reset data ptr after this loop because in the next Loop this ptr is used
const auto finalization_offsets_max =
std::vector<int64_t>{ calculate_required_finalization_offsets(inner_size, data->get_shape()[inner_dim]), 0, 0 };
const auto forse_finalization_offsets_max = std::vector<bool>{true, false, false};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To continue force_finalization_offsets discussion: here is a good example, why can't we just set two last finalization_offsets to zero instead of setting force_finalization_offsets to false?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we discussed offline - I disabled one evaluation optimizations for subgraphs with buffers to remove force_finalization

// we should always reset data ptr after this loop because in the next Loop this ptr is used
const auto finalization_offsets_max =
std::vector<int64_t>{ calculate_required_finalization_offsets(inner_size, data->get_shape()[inner_dim]), 0, 0 };
const auto forse_finalization_offsets_max = std::vector<bool>{true, false, false};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To continue with force_finalization_offsets discussion: why don't we just set the last two finalization_offsets to zero here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we discussed offline - I disabled one evaluation optimizations for subgraphs with buffers to remove force_finalization

* @brief Defines if Softmax can be tokenized in Snippets
* @ingroup ie_dev_api_plugin_api
*/
DECLARE_CONFIG_KEY(SNIPPETS_SOFTMAX_ENABLE);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this key exclusively to run Softmax tests? If so, it's worth to add a todo like: remove when Softmax tokenization is fully supported in snippets + Softmax support ticket ID

@@ -159,6 +159,14 @@ void Config::readProperties(const std::map<std::string, std::string> &prop) {
IE_THROW() << "Wrong value for property key " << CPUConfigParams::KEY_CPU_DENORMALS_OPTIMIZATION
<< ". Expected only YES/NO";
}
} else if (key == PluginConfigInternalParams::KEY_SNIPPETS_SOFTMAX_ENABLE) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the same as for src/inference/dev_api/cpp_interfaces/interface/ie_internal_plugin_config.hpp. Leave a comment with a ticket reference maybe?

// CPU Plugin support Swish in Subgraph via conversion to SwichCPU which assumes second input to be constant
if (ov::is_type<const ov::op::v4::Swish>(n)) {
if (n->inputs().size() > 1 && !ov::is_type<const ov::op::v0::Constant>(n->get_input_node_shared_ptr(1)))
return true;
} else if (ov::is_type<const ov::op::v1::Softmax>(n) || ov::is_type<const ov::op::v8::Softmax>(n)) {
return !_tokenizeSoftmaxSnippets;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will allow tokenization even if has_only_const_inputs || bad_input_rank || bad_output_rank is true. It looks like an undesired behavior (the same problem with Swish). Could you fix it please?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! Thanks a lot

@a-sidorova a-sidorova force-pushed the feature/snippets/softmax branch from 7fe52a9 to c32e7d6 Compare November 9, 2022 11:40
Copy link
Contributor

@IvanNovoselov IvanNovoselov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The second part

auto params = m->get_parameters();
auto results = m->get_results();
auto in = params.size();
auto out = results.size();
auto buffer = static_cast<size_t>(std::any_of(ops.begin(), ops.end(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

num_buffers or num_buffer_ops maybe?

Please also note that in the transpose PR I suggest to pass pointer to body directly to op::Kernel, so it'll be accessible from KernelEmitter. We could count buffers insede the emitter in this case. Let's discuss it offline.

const auto& forse_finalization_offsets = loop->get_forse_finalization_offsets();
std::vector<int64_t> new_finalization_offsets(loop->get_finalization_offsets());
for (auto i = 0; i < new_finalization_offsets.size(); i++) {
new_finalization_offsets[i] += increment * apply_increments[i] * (forse_finalization_offsets[i] || force_ptr_increment);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's consider the default scenario: all forse_finalization_offsets[] = true and new_finalization_offsets will be incremented even if force_ptr_increment = false. This contradicts with the previous default behavior where new_finalization_offsets are incremented only if force_ptr_increment || loop->has_outer_loop.
Or is forse_finalization_offsets[] some kind or per-port analog of loop->has_outer_loop() ? If so, then do forse_finalization_offsets[] set to false if there is no outer loop?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we discussed offline - I disabled one evaluation optimizations for subgraphs with buffers to remove force_finalization

src/common/snippets/src/generator.cpp Show resolved Hide resolved
new_finalization_offsets[i] += increment * apply_increments[i];
}
loop->set_finalization_offsets(new_finalization_offsets);
const auto increment = loop->get_increment();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A general comment on op-specefic optimizations:
Sometimes to enable an op we need to extend existing functionality, which is perfectly fine. However, at first it could be useful to keep usage of this extended functionality localized. This way it would be easier for us to understand limitations of existing functionality and to develop a more general and extendable solution in future.
For example, I generally like you implementation of tail_transformations: we check for a flag set during softmax tokenization (or decomposition) and trigger specific pipeline. I think we should try to align with this paradigm for now.

src/common/snippets/include/snippets/op/subgraph.hpp Outdated Show resolved Hide resolved
prepare_table();

const auto shape = n->get_input_partial_shape(0);
is_scalar = shape.is_static() && shape.get_shape().back() == 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not 100% sure that it's safe to rely on node shapes, since we won't do reshape every time (see the transpose PR), e.g when dimensions are collapsed. Could you check please?

const size_t vec_size = vlen / sizeof(float);
h->sub(h->rsp, vlen);
h->uni_vmovups(h->ptr[h->rsp], src_vmm);
h->uni_vmovups(dst_xmm, table_val("float_min"));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we are using aux_gpr here anyway and the table has only one value, maybe it's better to mov immediate to aux_gpr and then broadcast? Then we can ommit using the table. What do you think?
Of course it makes sense only if we won't extend this emitter to multiple table

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds logically but If I correctly understand we should have additional aux_vmm to store broadcasted value. As I understand table is comfortable tool for constants. Correct me please If I'm wrong. Thanks!

src/plugins/intel_cpu/src/nodes/subgraph.cpp Outdated Show resolved Hide resolved
new_shapes.emplace_back(std::move(ns));
}
// Before body reshaping we should scale axis of Softmax
auto ops = snippet->get_body()->get_ops();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We agreed not to perform reshaping and to pass plugin shapes in rt_info. Please refer to the transpose PR.

updateSrcDstPtrs(call_args);

std::vector<jit_snippets_call_args> per_thread_call_args(parallel_get_max_threads(), call_args);
if (buffer_scratchpad_size > 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not entirely sure why we need to copy jit_snipepts call args nthreads time even if we don't need a buffer. Suggestion: let's keep original signature of schedule_6d and create a schedule_6d_per_thread_buffer (also with the same signature) that will create a vector of call args inside. We did pretty much the same in the dynamism PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added. Thanks!

@a-sidorova a-sidorova force-pushed the feature/snippets/softmax branch 4 times, most recently from 5c6ced9 to 437242d Compare November 10, 2022 14:15
@dmitry-gorokhov dmitry-gorokhov removed this from the 2022.3 milestone Nov 16, 2022
@a-sidorova a-sidorova force-pushed the feature/snippets/softmax branch 3 times, most recently from a664c06 to 0aaea67 Compare November 24, 2022 10:38
@a-sidorova a-sidorova force-pushed the feature/snippets/softmax branch 2 times, most recently from feeaf07 to 2caa808 Compare November 29, 2022 06:24
* [Snippets] Dynamic loop snapshot

* [Snippets] Explicit Loop implementation
@a-sidorova a-sidorova force-pushed the feature/snippets/softmax branch from 2caa808 to 26b5889 Compare November 30, 2022 09:51
@a-sidorova a-sidorova force-pushed the feature/snippets/softmax branch 2 times, most recently from a025b91 to 026360c Compare December 5, 2022 09:23
@a-sidorova a-sidorova force-pushed the feature/snippets/softmax branch 2 times, most recently from f916fe2 to 70a7e56 Compare December 6, 2022 10:34
@a-sidorova a-sidorova force-pushed the feature/snippets/softmax branch from 70a7e56 to 53dc219 Compare December 13, 2022 12:47
[Snippets] Added support for Reshape around Softmax

applied comment part

Added config parameter to disable MHA ops tokenization

Buffer 2D

Loops
@a-sidorova a-sidorova force-pushed the feature/snippets/softmax branch from 53dc219 to 9d2d721 Compare December 13, 2022 12:50
@a-sidorova a-sidorova force-pushed the feature/snippets/softmax branch from 339a1b8 to 1cac5d5 Compare December 15, 2022 06:06
@a-sidorova
Copy link
Contributor Author

Closed because of merging into local branch

@a-sidorova a-sidorova closed this Dec 23, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category: CPU OpenVINO CPU plugin
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants