-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Snippets] Added Softmax Support #13449
[Snippets] Added Softmax Support #13449
Conversation
b819ce0
to
f3760b3
Compare
36ff257
to
3f3ff64
Compare
250aa6f
to
42b0f4d
Compare
42b0f4d
to
a8ba02a
Compare
a0a3d7c
to
7fe52a9
Compare
@IvanNovoselov could you please start the review? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Almost over, but there might be some more
@@ -117,7 +117,7 @@ class Generator { | |||
* @param m model in canonical for for table-based code generation | |||
* @return pointer to generated code | |||
*/ | |||
code generate(std::shared_ptr<ov::Model>& m, const void* compile_params = nullptr) const; | |||
code generate(std::shared_ptr<ov::Model>& m, const void* compile_params = nullptr, const bool need_check_rt_info = false) const; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really need this flag?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We discussed it before offline. That we should insert for reduction ops Fill
op in tail loop if needed. And to avoid load time regression we check for rt_info
.
I understand that it's not very beautiful code but we don't have an interface for some config moving for tail loop generation. It's one way which I found
Maybe we should move common config for generation. I mean a common struct { bool, bool, ...}
. What do you think? For the future possible new checks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I see your point totally.
We can just pass the whole snippets config, this would look like a more extendable solution. Probably there is no need for a special generation config at this point.
The original reason for this remark is that we both check rt_info and snippets config. Do we need both? Please see the relevant comment in generator.cpp also.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed now we move whole config. Thanks!
// we should always reset data ptr after this loop because in the next Loop this ptr is used | ||
const auto finalization_offsets_max = | ||
std::vector<int64_t>{ calculate_required_finalization_offsets(inner_size, data->get_shape()[inner_dim]), 0, 0 }; | ||
const auto forse_finalization_offsets_max = std::vector<bool>{true, false, false}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To continue force_finalization_offsets
discussion: here is a good example, why can't we just set two last finalization_offsets to zero instead of setting force_finalization_offsets to false?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we discussed offline - I disabled one evaluation optimizations
for subgraphs with buffers to remove force_finalization
// we should always reset data ptr after this loop because in the next Loop this ptr is used | ||
const auto finalization_offsets_max = | ||
std::vector<int64_t>{ calculate_required_finalization_offsets(inner_size, data->get_shape()[inner_dim]), 0, 0 }; | ||
const auto forse_finalization_offsets_max = std::vector<bool>{true, false, false}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To continue with force_finalization_offsets
discussion: why don't we just set the last two finalization_offsets
to zero here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we discussed offline - I disabled one evaluation optimizations
for subgraphs with buffers to remove force_finalization
* @brief Defines if Softmax can be tokenized in Snippets | ||
* @ingroup ie_dev_api_plugin_api | ||
*/ | ||
DECLARE_CONFIG_KEY(SNIPPETS_SOFTMAX_ENABLE); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this key exclusively to run Softmax tests? If so, it's worth to add a todo like: remove when Softmax tokenization is fully supported in snippets + Softmax support ticket ID
src/plugins/intel_cpu/src/config.cpp
Outdated
@@ -159,6 +159,14 @@ void Config::readProperties(const std::map<std::string, std::string> &prop) { | |||
IE_THROW() << "Wrong value for property key " << CPUConfigParams::KEY_CPU_DENORMALS_OPTIMIZATION | |||
<< ". Expected only YES/NO"; | |||
} | |||
} else if (key == PluginConfigInternalParams::KEY_SNIPPETS_SOFTMAX_ENABLE) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the same as for src/inference/dev_api/cpp_interfaces/interface/ie_internal_plugin_config.hpp. Leave a comment with a ticket reference maybe?
src/plugins/intel_cpu/src/plugin.cpp
Outdated
// CPU Plugin support Swish in Subgraph via conversion to SwichCPU which assumes second input to be constant | ||
if (ov::is_type<const ov::op::v4::Swish>(n)) { | ||
if (n->inputs().size() > 1 && !ov::is_type<const ov::op::v0::Constant>(n->get_input_node_shared_ptr(1))) | ||
return true; | ||
} else if (ov::is_type<const ov::op::v1::Softmax>(n) || ov::is_type<const ov::op::v8::Softmax>(n)) { | ||
return !_tokenizeSoftmaxSnippets; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will allow tokenization even if has_only_const_inputs || bad_input_rank || bad_output_rank
is true. It looks like an undesired behavior (the same problem with Swish). Could you fix it please?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! Thanks a lot
7fe52a9
to
c32e7d6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The second part
auto params = m->get_parameters(); | ||
auto results = m->get_results(); | ||
auto in = params.size(); | ||
auto out = results.size(); | ||
auto buffer = static_cast<size_t>(std::any_of(ops.begin(), ops.end(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
num_buffers or num_buffer_ops maybe?
Please also note that in the transpose PR I suggest to pass pointer to body directly to op::Kernel, so it'll be accessible from KernelEmitter. We could count buffers insede the emitter in this case. Let's discuss it offline.
const auto& forse_finalization_offsets = loop->get_forse_finalization_offsets(); | ||
std::vector<int64_t> new_finalization_offsets(loop->get_finalization_offsets()); | ||
for (auto i = 0; i < new_finalization_offsets.size(); i++) { | ||
new_finalization_offsets[i] += increment * apply_increments[i] * (forse_finalization_offsets[i] || force_ptr_increment); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's consider the default scenario: all forse_finalization_offsets[] = true
and new_finalization_offsets
will be incremented even if force_ptr_increment = false
. This contradicts with the previous default behavior where new_finalization_offsets
are incremented only if force_ptr_increment || loop->has_outer_loop
.
Or is forse_finalization_offsets[]
some kind or per-port analog of loop->has_outer_loop()
? If so, then do forse_finalization_offsets[]
set to false
if there is no outer loop?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we discussed offline - I disabled one evaluation optimizations
for subgraphs with buffers to remove force_finalization
new_finalization_offsets[i] += increment * apply_increments[i]; | ||
} | ||
loop->set_finalization_offsets(new_finalization_offsets); | ||
const auto increment = loop->get_increment(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A general comment on op-specefic optimizations:
Sometimes to enable an op we need to extend existing functionality, which is perfectly fine. However, at first it could be useful to keep usage of this extended functionality localized. This way it would be easier for us to understand limitations of existing functionality and to develop a more general and extendable solution in future.
For example, I generally like you implementation of tail_transformations: we check for a flag set during softmax tokenization (or decomposition) and trigger specific pipeline. I think we should try to align with this paradigm for now.
prepare_table(); | ||
|
||
const auto shape = n->get_input_partial_shape(0); | ||
is_scalar = shape.is_static() && shape.get_shape().back() == 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not 100% sure that it's safe to rely on node shapes, since we won't do reshape every time (see the transpose PR), e.g when dimensions are collapsed. Could you check please?
const size_t vec_size = vlen / sizeof(float); | ||
h->sub(h->rsp, vlen); | ||
h->uni_vmovups(h->ptr[h->rsp], src_vmm); | ||
h->uni_vmovups(dst_xmm, table_val("float_min")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we are using aux_gpr here anyway and the table has only one value, maybe it's better to mov immediate to aux_gpr and then broadcast? Then we can ommit using the table. What do you think?
Of course it makes sense only if we won't extend this emitter to multiple table
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds logically but If I correctly understand we should have additional aux_vmm
to store broadcasted value. As I understand table is comfortable tool for constants. Correct me please If I'm wrong. Thanks!
new_shapes.emplace_back(std::move(ns)); | ||
} | ||
// Before body reshaping we should scale axis of Softmax | ||
auto ops = snippet->get_body()->get_ops(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We agreed not to perform reshaping and to pass plugin shapes in rt_info. Please refer to the transpose PR.
updateSrcDstPtrs(call_args); | ||
|
||
std::vector<jit_snippets_call_args> per_thread_call_args(parallel_get_max_threads(), call_args); | ||
if (buffer_scratchpad_size > 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not entirely sure why we need to copy jit_snipepts call args nthreads time even if we don't need a buffer. Suggestion: let's keep original signature of schedule_6d
and create a schedule_6d_per_thread_buffer
(also with the same signature) that will create a vector of call args inside. We did pretty much the same in the dynamism PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added. Thanks!
5c6ced9
to
437242d
Compare
a664c06
to
0aaea67
Compare
feeaf07
to
2caa808
Compare
* [Snippets] Dynamic loop snapshot * [Snippets] Explicit Loop implementation
2caa808
to
26b5889
Compare
a025b91
to
026360c
Compare
f916fe2
to
70a7e56
Compare
70a7e56
to
53dc219
Compare
[Snippets] Added support for Reshape around Softmax applied comment part Added config parameter to disable MHA ops tokenization Buffer 2D Loops
53dc219
to
9d2d721
Compare
339a1b8
to
1cac5d5
Compare
Closed because of merging into local branch |
Details:
Softmax
support to Snippets partially: added the corresponding config parameter to disableSoftmax
in Snippets pipeline to avoid performance regressions and enable in tests for validationReshape
aroundSoftmax
viaSoftmaxReshapeElimination
pass that removes theReshape
opsTickets: