[Snippets] Added SplitLoops support for dynamic loops #25957

a-sidorova · 2024-08-07T13:16:30Z

Details:

Introduced the class InnerSplittedUnifiedLoopInfo which describes inner splitted Loop after SplitLoops. Work amount of thus Loop is equal to increment of outer splitted loop. Removed the attribute is_work_amount_const from LoopInfo
Enabled SplitLoops optimization for dynamic loops/
Fixed finalization offset calculation for dynamic case
Fixed increment = 1 for dynamic innermost Loops with only Eltwise ops inside - now all dynamic Last Loop iterations has increment = work_amount (evaluate_once = true) by default. If there is innermost Loop with only Eltiwse ops, the pass SetLoopIncrementOne will be registered in last iteration handler by default
Fixed LoopManager cloning: now all LoopInfo are cloned recursively and fully - with UnifiedLoopInfo in ExpandedLoopInfo. Before we didn't clone them.
Fixed ExtractLoopInvariants for splitted loops (which may have the same dim_idx) - now we separately create vector of ordered loop IDs by execution and iterate through this vector.
Fixed ValidateExpandedLoopInfo to support validation of loops iterations of which are executed not consistently (like outer splitted loop iterations)
Fixed buffer scratchpad size calculation in RuntimeConfigurator: no need to calculate allocation size of Buffers which are in Loops with work_amount = 0
Added workaround in ComputeBufferAllocationSize::get_allocation_size (the ticket 149219) to handle Buffers in cloned OuterSplittedLoops after InsertSpecificIterations
Added debug method to_string() for RuntimeConfig

Tickets:

141735

Prerequisites:

[Snippets] Added generic check for broadcasting #25772

a-sidorova · 2024-08-12T06:36:23Z

src/common/snippets/src/generator.cpp

+    // In dynamic case it should be handled by RuntimeConfigurator
+    if (!linear_ir->is_dynamic())
+        result.kernel_executor_table->update_state(linear_ir);


Explanation: In dynamic case, some configs of kernel executors should be updated by RuntimeConfigurator firstly. And only then kernels can be compiled.

The thing is that we don't really need the executor table in the static pipeline, just to conform with the dynamic case. So do you think we can make the RuntimeConfigurator handle the static case as well?

So do you think we can make the RuntimeConfigurator handle the static case as well?

We already do it but before JIT compilation. It's made so because also RuntimeConfigurator calculates master_shape and data_offsets which are used in code generation.
Moreovoer, I think it's good decision to call RuntimeConfigurator before codegen in static pipeline because in the future we will be able to apply the optimization SplitDimensionM using RuntimeConfigurator -> LoopEnd expressions will be updated and the corresponding emitters will have the updated parameters.
I'm not sure but seems like that in general each jit emitter with kernel executor table can check: if it can compile kernel now, let's generate it. Then we don't need to manually call result.kernel_executor_table->update_state(linear_ir); in static case 🤔

We can also create a convention that kernel executor emitters should be registered only in constructors.
Then we can we create emitters in the RuntimeConfig for the static case and call update_state there. Then trigger code emission in the generator. What do you think, is it feasible?
My aim here is to restrict kernel executor table management (ideally) to one class - RuntimeConfig.

No changes for this PR, but we need to understand how to create a more universal static/dynamic design. We discussed it earlier, but this discrepancy between static and dynamic code generation still bothering me. We should at least keep all this if (linear_ir->is_dynamic()) in one place.

todo: update 148644
UPD: done

src/common/snippets/src/lowered/loop_info.cpp

src/common/snippets/src/lowered/loop_manager.cpp

IvanNovoselov · 2024-08-15T12:09:20Z

src/common/snippets/src/lowered/loop_manager.cpp

+    for (const auto& p : m_map) {
+        if (const auto inner_splitted_loop_info = ov::as_type_ptr<InnerSplittedUnifiedLoopInfo>(p.second)) {
+            const auto outer = inner_splitted_loop_info->get_outer_splitted_loop_info();
+            if (utils::one_of(outer, loop_info_upper, loop_info_lower))
+                inner_splitted_loop_info->set_outer_splitted_loop_info(m_map[to]);
+        }
+    }


Do you think we can inform LoopInfo about dependent loops to avoid this kind of traversal?
Since we already introduced loops referring loops (inner splitted), it seems that adding a backward reference won't complicate the design significantly (e.g. we need to pass the map when cloning anyways).
What do you think?

I thought about that. But for me it looked complex, to be honest.
Let's imagine that we want to introduce backward reference and for that we need to create new class OuterSplittedLoopInfo.
Before InsertSpecificIterations pass this class should be derived from UnifiedLoopInfo. After InsertSpecificIteration - ExpandedLoopInfo (because InnerSplittedUnifiedLoopInfo has work_amount equal to increment of ExpandedLoopInfo after loop decompositions).
Or it might be multiple Inheritance: OuterSplittedUnifiedLoopInfo : UnifiedLoopInfo, OuterSplittedLoopInfo and OuterSplittedExpandedLoopInfo : ExpandedLoopInfo, OuterSplittedLoopInfo where the class OuterSplittedLoopInfo just contains pointer to inner splitted loop.
Moreover, looks like OuterSplittedExpandedLoopInfo is not needed for now🤔

I'd say that it's not clear at the moment. And for now we have to just handle this case only in FuseLoops when fuse outer loops and nowhere more.

By the way I hope we will support SplittedLoops via dimension linking feature or similar something. Also we decided to think about new loop handling. So probably we will create something more interesting and powerful 😃

Yes, I'm talking about a broader context: how ideal loop info should look like.
And since loops can be dependent on each other, it makes sense to reflect these dependencies in the class structure.
We do not necessarily need to create a separate OuterSplittedExpandedLoopInfo class, instead we can store a vector of pointers to connected loops directly in UnifiedLoopInfo. If no loops are connected, the vector will be empty.

src/common/snippets/src/lowered/pass/brgemm_blocking.cpp

src/common/snippets/src/lowered/pass/validate_expanded_loops.cpp

IvanNovoselov · 2024-08-15T16:08:11Z

src/common/snippets/src/generator.cpp

+    // In dynamic case it should be handled by RuntimeConfigurator
+    if (!linear_ir->is_dynamic())
+        result.kernel_executor_table->update_state(linear_ir);


The thing is that we don't really need the executor table in the static pipeline, just to conform with the dynamic case. So do you think we can make the RuntimeConfigurator handle the static case as well?

src/common/snippets/include/snippets/lowered/pass/split_loops.hpp

IvanNovoselov · 2024-08-15T17:00:50Z

src/common/snippets/src/lowered/pass/split_loops.cpp

+    if (!outer_loop_info->get_handlers().get_passes<SpecificLoopIterType::FIRST_ITER>().empty()) {
+        outer_loop_info->register_pass_to_handler<SpecificLoopIterType::FIRST_ITER, TransformInnerSplitLoop>();
+    }
+    outer_loop_info->register_pass_to_handler<SpecificLoopIterType::MAIN_BODY, TransformInnerSplitLoop>();


When we register the TransformInnerSplitLoop we know exactly what loop id should be transformed (since we just splitted it), but instead of using this information directly in the transform pass, we try to restore it by checking all LoopEnd, casting to InnerSplittedUnifiedLoopInfo etc.
Why can't we just pass the relevant loop id here and check it inside the pass?
Theoretically, we can even pass a mode low-level information, like iterator to LoopEnd (since the list iterators are invalidated only on deletion) or pass ov::Node pointer and use get_expr_by_node which is essentially free. What do you think?

I'm not sure that this loop_id won't be changed during the following passes 🤔
I think it's not safe. At the moment, we don't change loop_id of inner splitted loops.
However if the inner splitted loop is suddenly fused to another loop (L82) or there is another pass in pipeline after SplitLoops and before InsertSpecificIterations (for example NormalizeLoopID - but at the moment it's called after InsertSpecificIterations) which can update the value of loop_id, it will lead to bugs.
The developer MUST know that after SplitLoops all LoopIDs shouldn't be changed. And only I and you will know about it while we remember. We can create some flag in LinearIR config which not allow to change these values. However, we still call FuseLoops after SplitLoops (L82) and LoopID of outer loops will be changed.
Because of that I think it's not good idea to map on LoopID since it's changeable attribute.

Ok, I see your point.
Still it looks like a design flaw: loops can be linked together (which BTW would also work in this case if the upper splitted loop knew about inner loops), but transformations can't link/reference any loops. It doesn't seem fair 🤔

todo: update LoopEnd-> loop info ticket

UPD: done

src/common/snippets/src/lowered/pass/split_loops.cpp

src/common/snippets/src/lowered/loop_info.cpp

a-sidorova · 2024-08-19T10:27:56Z

@IvanNovoselov after our offline discussion I fixed some comments. Could you please take a look one more time? Than you in advance

src/common/snippets/src/lowered/loop_info.cpp

src/common/snippets/src/runtime_configurator.cpp

[Snippets] Fixed LoopInfo cloning [Snippets] WA for ComputeBufferAllocationSize::get_allocation_size [Snippets] Fixed dynamic increment and single increment for Eltwise Loops [Snippets] Added dynamic SplitLoops support [Snippets] Added debug prints to RuntimeConfig [Snippets] Fixed 0 case in get_finalization_offset [Snippets] Updated ValidateExpandedLoopInfo + Updated SnippetsUnitTest [Snippets] Supported InnerSplittedLoopInfo in SplitLoops as Unified [Snippets] DISABLED EXTRACT LOOP INVARIANTS [Snippets] Fixed ExtractLoopInvariants [Snippets] Fixed FuseLoops::can_be_fused [Snippets] Fixes

IvanNovoselov

Good job 👍

### Details: - *The PR enables dynamic FP32 MHA tokenization on x64 platforms 🎉* - *`std::vector.resize()` which was used for buffer scratchpad allocation is very expensive operation due to default constructor of elements. This PR replace `std::vector.resize()` with CPU Node Scratchpad memory which can be shared between nodes. Also since each thread must have the own scratchpad memory, we allocated `size * threads_max` - however, in execution thread count can be less (depends on parallel work amount). Now we allocate only `size * n_threads` where `nthreads` is real count of working threads.* - *Fixed dimension K validation in `BrgemmBlocking` pass: one of inputs can have dynamic value of this dimension* - *Fixed `utils::broadcast_merge_dim()` and supported broadcasting of integer values in IterHandlers. Added unit tests for `utils::broadcast_merge_dim()`* ### Tickets: - *149900* ### Prerequisites: - [x] #25326 - [x] #25378 - [x] #25623 - [x] #25638 - [x] #25745 - [x] #25957 - [x] #25733

github-actions bot added category: IE Tests OpenVINO Test: plugins and common category: CPU OpenVINO CPU plugin labels Aug 7, 2024

a-sidorova force-pushed the feature/snippets/dynamism/split_loops branch from 25d2119 to 34257fd Compare August 9, 2024 13:35

github-actions bot removed the category: IE Tests OpenVINO Test: plugins and common label Aug 9, 2024

a-sidorova force-pushed the feature/snippets/dynamism/split_loops branch from 34257fd to e339bae Compare August 12, 2024 06:24

a-sidorova added this to the 2024.4 milestone Aug 12, 2024

a-sidorova commented Aug 12, 2024

View reviewed changes

a-sidorova force-pushed the feature/snippets/dynamism/split_loops branch from e339bae to c0f4282 Compare August 12, 2024 07:20

a-sidorova marked this pull request as ready for review August 12, 2024 07:20

a-sidorova requested review from a team as code owners August 12, 2024 07:20

a-sidorova mentioned this pull request Aug 12, 2024

[Snippets][CPU] Enabled dynamic MHA FP32 tokenization on x64 #25500

Merged

7 tasks

IvanNovoselov self-assigned this Aug 14, 2024

IvanNovoselov reviewed Aug 15, 2024

View reviewed changes

a-sidorova force-pushed the feature/snippets/dynamism/split_loops branch from c0f4282 to b0d06e4 Compare August 16, 2024 08:36

a-sidorova requested a review from IvanNovoselov August 16, 2024 08:36

IvanNovoselov reviewed Aug 16, 2024

View reviewed changes

src/common/snippets/src/lowered/loop_info.cpp Show resolved Hide resolved

src/common/snippets/src/lowered/loop_info.cpp Show resolved Hide resolved

a-sidorova requested a review from IvanNovoselov August 19, 2024 10:27

IvanNovoselov reviewed Aug 19, 2024

View reviewed changes

src/common/snippets/src/lowered/loop_info.cpp Outdated Show resolved Hide resolved

src/common/snippets/src/runtime_configurator.cpp Outdated Show resolved Hide resolved

src/common/snippets/src/runtime_configurator.cpp Outdated Show resolved Hide resolved

a-sidorova added 4 commits August 19, 2024 17:58

[Snippets] Applied Ivan comments

038e978

[Snippets] Applied Ivan comments 2

bc7010f

[Snippets] Applied Ivan comments 3

f1275e9

a-sidorova requested a review from IvanNovoselov August 19, 2024 13:59

a-sidorova force-pushed the feature/snippets/dynamism/split_loops branch from 2336757 to f1275e9 Compare August 19, 2024 13:59

IvanNovoselov approved these changes Aug 19, 2024

View reviewed changes

IvanNovoselov added this pull request to the merge queue Aug 19, 2024

Merged via the queue into openvinotoolkit:master with commit 00d9800 Aug 19, 2024
136 checks passed

IvanNovoselov deleted the feature/snippets/dynamism/split_loops branch August 19, 2024 19:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Snippets] Added SplitLoops support for dynamic loops #25957

[Snippets] Added SplitLoops support for dynamic loops #25957

a-sidorova commented Aug 7, 2024 •

edited

Loading

a-sidorova Aug 12, 2024 •

edited

Loading

IvanNovoselov Aug 15, 2024

a-sidorova Aug 16, 2024

IvanNovoselov Aug 16, 2024

IvanNovoselov Aug 16, 2024 •

edited

Loading

IvanNovoselov Aug 15, 2024

a-sidorova Aug 16, 2024

IvanNovoselov Aug 16, 2024

IvanNovoselov Aug 15, 2024

IvanNovoselov Aug 15, 2024

a-sidorova Aug 16, 2024

IvanNovoselov Aug 16, 2024

IvanNovoselov Aug 16, 2024 •

edited

Loading

a-sidorova commented Aug 19, 2024

IvanNovoselov left a comment

[Snippets] Added SplitLoops support for dynamic loops #25957

[Snippets] Added SplitLoops support for dynamic loops #25957

Conversation

a-sidorova commented Aug 7, 2024 • edited Loading

Details:

Tickets:

Prerequisites:

a-sidorova Aug 12, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

IvanNovoselov Aug 16, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

IvanNovoselov Aug 16, 2024 • edited Loading

Choose a reason for hiding this comment

a-sidorova commented Aug 19, 2024

IvanNovoselov left a comment

Choose a reason for hiding this comment

a-sidorova commented Aug 7, 2024 •

edited

Loading

a-sidorova Aug 12, 2024 •

edited

Loading

IvanNovoselov Aug 16, 2024 •

edited

Loading

IvanNovoselov Aug 16, 2024 •

edited

Loading