-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Snippets] Added SplitLoops support for dynamic loops #25957
[Snippets] Added SplitLoops support for dynamic loops #25957
Conversation
25d2119
to
34257fd
Compare
34257fd
to
e339bae
Compare
// In dynamic case it should be handled by RuntimeConfigurator | ||
if (!linear_ir->is_dynamic()) | ||
result.kernel_executor_table->update_state(linear_ir); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Explanation: In dynamic case, some configs of kernel executors should be updated by RuntimeConfigurator
firstly. And only then kernels can be compiled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The thing is that we don't really need the executor table in the static pipeline, just to conform with the dynamic case. So do you think we can make the RuntimeConfigurator handle the static case as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So do you think we can make the RuntimeConfigurator handle the static case as well?
We already do it but before JIT compilation. It's made so because also RuntimeConfigurator
calculates master_shape
and data_offsets
which are used in code generation.
Moreovoer, I think it's good decision to call RuntimeConfigurator
before codegen in static pipeline because in the future we will be able to apply the optimization SplitDimensionM
using RuntimeConfigurator
-> LoopEnd
expressions will be updated and the corresponding emitters will have the updated parameters.
I'm not sure but seems like that in general each jit emitter with kernel executor table can check: if it can compile kernel now, let's generate it. Then we don't need to manually call result.kernel_executor_table->update_state(linear_ir);
in static case 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can also create a convention that kernel executor emitters should be registered only in constructors.
Then we can we create emitters in the RuntimeConfig for the static case and call update_state
there. Then trigger code emission in the generator. What do you think, is it feasible?
My aim here is to restrict kernel executor table management (ideally) to one class - RuntimeConfig.
No changes for this PR, but we need to understand how to create a more universal static/dynamic design. We discussed it earlier, but this discrepancy between static and dynamic code generation still bothering me. We should at least keep all this if (linear_ir->is_dynamic())
in one place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
todo: update 148644
UPD: done
e339bae
to
c0f4282
Compare
for (const auto& p : m_map) { | ||
if (const auto inner_splitted_loop_info = ov::as_type_ptr<InnerSplittedUnifiedLoopInfo>(p.second)) { | ||
const auto outer = inner_splitted_loop_info->get_outer_splitted_loop_info(); | ||
if (utils::one_of(outer, loop_info_upper, loop_info_lower)) | ||
inner_splitted_loop_info->set_outer_splitted_loop_info(m_map[to]); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think we can inform LoopInfo about dependent loops to avoid this kind of traversal?
Since we already introduced loops referring loops (inner splitted), it seems that adding a backward reference won't complicate the design significantly (e.g. we need to pass the map when cloning anyways).
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about that. But for me it looked complex, to be honest.
Let's imagine that we want to introduce backward reference and for that we need to create new class OuterSplittedLoopInfo
.
Before InsertSpecificIterations
pass this class should be derived from UnifiedLoopInfo
. After InsertSpecificIteration
- ExpandedLoopInfo
(because InnerSplittedUnifiedLoopInfo
has work_amount
equal to increment
of ExpandedLoopInfo
after loop decompositions).
Or it might be multiple Inheritance: OuterSplittedUnifiedLoopInfo : UnifiedLoopInfo, OuterSplittedLoopInfo
and OuterSplittedExpandedLoopInfo : ExpandedLoopInfo, OuterSplittedLoopInfo
where the class OuterSplittedLoopInfo
just contains pointer to inner splitted loop.
Moreover, looks like OuterSplittedExpandedLoopInfo
is not needed for now🤔
I'd say that it's not clear at the moment. And for now we have to just handle this case only in FuseLoops
when fuse outer loops and nowhere more.
By the way I hope we will support SplittedLoops via dimension linking feature or similar something. Also we decided to think about new loop handling. So probably we will create something more interesting and powerful 😃
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I'm talking about a broader context: how ideal loop info should look like.
And since loops can be dependent on each other, it makes sense to reflect these dependencies in the class structure.
We do not necessarily need to create a separate OuterSplittedExpandedLoopInfo
class, instead we can store a vector of pointers to connected loops directly in UnifiedLoopInfo
. If no loops are connected, the vector will be empty.
src/common/snippets/src/lowered/pass/validate_expanded_loops.cpp
Outdated
Show resolved
Hide resolved
// In dynamic case it should be handled by RuntimeConfigurator | ||
if (!linear_ir->is_dynamic()) | ||
result.kernel_executor_table->update_state(linear_ir); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The thing is that we don't really need the executor table in the static pipeline, just to conform with the dynamic case. So do you think we can make the RuntimeConfigurator handle the static case as well?
src/common/snippets/include/snippets/lowered/pass/split_loops.hpp
Outdated
Show resolved
Hide resolved
if (!outer_loop_info->get_handlers().get_passes<SpecificLoopIterType::FIRST_ITER>().empty()) { | ||
outer_loop_info->register_pass_to_handler<SpecificLoopIterType::FIRST_ITER, TransformInnerSplitLoop>(); | ||
} | ||
outer_loop_info->register_pass_to_handler<SpecificLoopIterType::MAIN_BODY, TransformInnerSplitLoop>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we register the TransformInnerSplitLoop
we know exactly what loop id should be transformed (since we just splitted it), but instead of using this information directly in the transform pass, we try to restore it by checking all LoopEnd, casting to InnerSplittedUnifiedLoopInfo etc.
Why can't we just pass the relevant loop id here and check it inside the pass?
Theoretically, we can even pass a mode low-level information, like iterator to LoopEnd (since the list iterators are invalidated only on deletion) or pass ov::Node pointer and use get_expr_by_node which is essentially free. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure that this loop_id
won't be changed during the following passes 🤔
I think it's not safe. At the moment, we don't change loop_id
of inner splitted loops.
However if the inner splitted loop is suddenly fused to another loop (L82) or there is another pass in pipeline after SplitLoops
and before InsertSpecificIterations
(for example NormalizeLoopID
- but at the moment it's called after InsertSpecificIterations
) which can update the value of loop_id
, it will lead to bugs.
The developer MUST know that after SplitLoops
all LoopIDs
shouldn't be changed. And only I and you will know about it while we remember. We can create some flag in LinearIR config which not allow to change these values. However, we still call FuseLoops
after SplitLoops
(L82) and LoopID of outer loops will be changed.
Because of that I think it's not good idea to map on LoopID since it's changeable attribute.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I see your point.
Still it looks like a design flaw: loops can be linked together (which BTW would also work in this case if the upper splitted loop knew about inner loops), but transformations can't link/reference any loops. It doesn't seem fair 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
todo: update LoopEnd-> loop info ticket
UPD: done
c0f4282
to
b0d06e4
Compare
@IvanNovoselov after our offline discussion I fixed some comments. Could you please take a look one more time? Than you in advance |
[Snippets] Fixed LoopInfo cloning [Snippets] WA for ComputeBufferAllocationSize::get_allocation_size [Snippets] Fixed dynamic increment and single increment for Eltwise Loops [Snippets] Added dynamic SplitLoops support [Snippets] Added debug prints to RuntimeConfig [Snippets] Fixed 0 case in get_finalization_offset [Snippets] Updated ValidateExpandedLoopInfo + Updated SnippetsUnitTest [Snippets] Supported InnerSplittedLoopInfo in SplitLoops as Unified [Snippets] DISABLED EXTRACT LOOP INVARIANTS [Snippets] Fixed ExtractLoopInvariants [Snippets] Fixed FuseLoops::can_be_fused [Snippets] Fixes
2336757
to
f1275e9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good job 👍
### Details: - *The PR enables dynamic FP32 MHA tokenization on x64 platforms 🎉* - *`std::vector.resize()` which was used for buffer scratchpad allocation is very expensive operation due to default constructor of elements. This PR replace `std::vector.resize()` with CPU Node Scratchpad memory which can be shared between nodes. Also since each thread must have the own scratchpad memory, we allocated `size * threads_max` - however, in execution thread count can be less (depends on parallel work amount). Now we allocate only `size * n_threads` where `nthreads` is real count of working threads.* - *Fixed dimension K validation in `BrgemmBlocking` pass: one of inputs can have dynamic value of this dimension* - *Fixed `utils::broadcast_merge_dim()` and supported broadcasting of integer values in IterHandlers. Added unit tests for `utils::broadcast_merge_dim()`* ### Tickets: - *149900* ### Prerequisites: - [x] #25326 - [x] #25378 - [x] #25623 - [x] #25638 - [x] #25745 - [x] #25957 - [x] #25733
Details:
InnerSplittedUnifiedLoopInfo
which describes inner splitted Loop afterSplitLoops
. Work amount of thus Loop is equal to increment of outer splitted loop. Removed the attributeis_work_amount_const
fromLoopInfo
SplitLoops
optimization for dynamic loops/increment = 1
for dynamic innermost Loops with only Eltwise ops inside - now all dynamicLast
Loop iterations hasincrement = work_amount
(evaluate_once = true
) by default. If there is innermost Loop with only Eltiwse ops, the passSetLoopIncrementOne
will be registered in last iteration handler by defaultExtractLoopInvariants
for splitted loops (which may have the samedim_idx
) - now we separately create vector of ordered loop IDs by execution and iterate through this vector.ValidateExpandedLoopInfo
to support validation of loops iterations of which are executed not consistently (like outer splitted loop iterations)work_amount = 0
ComputeBufferAllocationSize::get_allocation_size
(the ticket 149219) to handle Buffers in cloned OuterSplittedLoops afterInsertSpecificIterations
to_string()
forRuntimeConfig
Tickets:
Prerequisites: