Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Snippets] Added SplitLoops support for dynamic loops #25957

Conversation

a-sidorova
Copy link
Contributor

@a-sidorova a-sidorova commented Aug 7, 2024

Details:

  • Introduced the class InnerSplittedUnifiedLoopInfo which describes inner splitted Loop after SplitLoops. Work amount of thus Loop is equal to increment of outer splitted loop. Removed the attribute is_work_amount_const from LoopInfo
  • Enabled SplitLoops optimization for dynamic loops/
  • Fixed finalization offset calculation for dynamic case
  • Fixed increment = 1 for dynamic innermost Loops with only Eltwise ops inside - now all dynamic Last Loop iterations has increment = work_amount (evaluate_once = true) by default. If there is innermost Loop with only Eltiwse ops, the pass SetLoopIncrementOne will be registered in last iteration handler by default
  • Fixed LoopManager cloning: now all LoopInfo are cloned recursively and fully - with UnifiedLoopInfo in ExpandedLoopInfo. Before we didn't clone them.
  • Fixed ExtractLoopInvariants for splitted loops (which may have the same dim_idx) - now we separately create vector of ordered loop IDs by execution and iterate through this vector.
  • Fixed ValidateExpandedLoopInfo to support validation of loops iterations of which are executed not consistently (like outer splitted loop iterations)
  • Fixed buffer scratchpad size calculation in RuntimeConfigurator: no need to calculate allocation size of Buffers which are in Loops with work_amount = 0
  • Added workaround in ComputeBufferAllocationSize::get_allocation_size (the ticket 149219) to handle Buffers in cloned OuterSplittedLoops after InsertSpecificIterations
  • Added debug method to_string() for RuntimeConfig

Tickets:

Prerequisites:

@github-actions github-actions bot added category: IE Tests OpenVINO Test: plugins and common category: CPU OpenVINO CPU plugin labels Aug 7, 2024
@a-sidorova a-sidorova force-pushed the feature/snippets/dynamism/split_loops branch from 25d2119 to 34257fd Compare August 9, 2024 13:35
@github-actions github-actions bot removed the category: IE Tests OpenVINO Test: plugins and common label Aug 9, 2024
@a-sidorova a-sidorova force-pushed the feature/snippets/dynamism/split_loops branch from 34257fd to e339bae Compare August 12, 2024 06:24
@a-sidorova a-sidorova added this to the 2024.4 milestone Aug 12, 2024
Comment on lines +56 to +58
// In dynamic case it should be handled by RuntimeConfigurator
if (!linear_ir->is_dynamic())
result.kernel_executor_table->update_state(linear_ir);
Copy link
Contributor Author

@a-sidorova a-sidorova Aug 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Explanation: In dynamic case, some configs of kernel executors should be updated by RuntimeConfigurator firstly. And only then kernels can be compiled.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The thing is that we don't really need the executor table in the static pipeline, just to conform with the dynamic case. So do you think we can make the RuntimeConfigurator handle the static case as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So do you think we can make the RuntimeConfigurator handle the static case as well?

We already do it but before JIT compilation. It's made so because also RuntimeConfigurator calculates master_shape and data_offsets which are used in code generation.
Moreovoer, I think it's good decision to call RuntimeConfigurator before codegen in static pipeline because in the future we will be able to apply the optimization SplitDimensionM using RuntimeConfigurator -> LoopEnd expressions will be updated and the corresponding emitters will have the updated parameters.
I'm not sure but seems like that in general each jit emitter with kernel executor table can check: if it can compile kernel now, let's generate it. Then we don't need to manually call result.kernel_executor_table->update_state(linear_ir); in static case 🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can also create a convention that kernel executor emitters should be registered only in constructors.
Then we can we create emitters in the RuntimeConfig for the static case and call update_state there. Then trigger code emission in the generator. What do you think, is it feasible?
My aim here is to restrict kernel executor table management (ideally) to one class - RuntimeConfig.

No changes for this PR, but we need to understand how to create a more universal static/dynamic design. We discussed it earlier, but this discrepancy between static and dynamic code generation still bothering me. We should at least keep all this if (linear_ir->is_dynamic()) in one place.

Copy link
Contributor

@IvanNovoselov IvanNovoselov Aug 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

todo: update 148644
UPD: done

@a-sidorova a-sidorova force-pushed the feature/snippets/dynamism/split_loops branch from e339bae to c0f4282 Compare August 12, 2024 07:20
@a-sidorova a-sidorova marked this pull request as ready for review August 12, 2024 07:20
@a-sidorova a-sidorova requested review from a team as code owners August 12, 2024 07:20
@IvanNovoselov IvanNovoselov self-assigned this Aug 14, 2024
src/common/snippets/src/lowered/loop_info.cpp Show resolved Hide resolved
src/common/snippets/src/lowered/loop_info.cpp Outdated Show resolved Hide resolved
src/common/snippets/src/lowered/loop_manager.cpp Outdated Show resolved Hide resolved
Comment on lines +311 to +295
for (const auto& p : m_map) {
if (const auto inner_splitted_loop_info = ov::as_type_ptr<InnerSplittedUnifiedLoopInfo>(p.second)) {
const auto outer = inner_splitted_loop_info->get_outer_splitted_loop_info();
if (utils::one_of(outer, loop_info_upper, loop_info_lower))
inner_splitted_loop_info->set_outer_splitted_loop_info(m_map[to]);
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think we can inform LoopInfo about dependent loops to avoid this kind of traversal?
Since we already introduced loops referring loops (inner splitted), it seems that adding a backward reference won't complicate the design significantly (e.g. we need to pass the map when cloning anyways).
What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about that. But for me it looked complex, to be honest.
Let's imagine that we want to introduce backward reference and for that we need to create new class OuterSplittedLoopInfo.
Before InsertSpecificIterations pass this class should be derived from UnifiedLoopInfo. After InsertSpecificIteration - ExpandedLoopInfo (because InnerSplittedUnifiedLoopInfo has work_amount equal to increment of ExpandedLoopInfo after loop decompositions).
Or it might be multiple Inheritance: OuterSplittedUnifiedLoopInfo : UnifiedLoopInfo, OuterSplittedLoopInfo and OuterSplittedExpandedLoopInfo : ExpandedLoopInfo, OuterSplittedLoopInfo where the class OuterSplittedLoopInfo just contains pointer to inner splitted loop.
Moreover, looks like OuterSplittedExpandedLoopInfo is not needed for now🤔

I'd say that it's not clear at the moment. And for now we have to just handle this case only in FuseLoops when fuse outer loops and nowhere more.

By the way I hope we will support SplittedLoops via dimension linking feature or similar something. Also we decided to think about new loop handling. So probably we will create something more interesting and powerful 😃

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I'm talking about a broader context: how ideal loop info should look like.
And since loops can be dependent on each other, it makes sense to reflect these dependencies in the class structure.
We do not necessarily need to create a separate OuterSplittedExpandedLoopInfo class, instead we can store a vector of pointers to connected loops directly in UnifiedLoopInfo. If no loops are connected, the vector will be empty.

Comment on lines +56 to +58
// In dynamic case it should be handled by RuntimeConfigurator
if (!linear_ir->is_dynamic())
result.kernel_executor_table->update_state(linear_ir);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The thing is that we don't really need the executor table in the static pipeline, just to conform with the dynamic case. So do you think we can make the RuntimeConfigurator handle the static case as well?

if (!outer_loop_info->get_handlers().get_passes<SpecificLoopIterType::FIRST_ITER>().empty()) {
outer_loop_info->register_pass_to_handler<SpecificLoopIterType::FIRST_ITER, TransformInnerSplitLoop>();
}
outer_loop_info->register_pass_to_handler<SpecificLoopIterType::MAIN_BODY, TransformInnerSplitLoop>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we register the TransformInnerSplitLoop we know exactly what loop id should be transformed (since we just splitted it), but instead of using this information directly in the transform pass, we try to restore it by checking all LoopEnd, casting to InnerSplittedUnifiedLoopInfo etc.
Why can't we just pass the relevant loop id here and check it inside the pass?
Theoretically, we can even pass a mode low-level information, like iterator to LoopEnd (since the list iterators are invalidated only on deletion) or pass ov::Node pointer and use get_expr_by_node which is essentially free. What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure that this loop_id won't be changed during the following passes 🤔
I think it's not safe. At the moment, we don't change loop_id of inner splitted loops.
However if the inner splitted loop is suddenly fused to another loop (L82) or there is another pass in pipeline after SplitLoops and before InsertSpecificIterations (for example NormalizeLoopID - but at the moment it's called after InsertSpecificIterations) which can update the value of loop_id, it will lead to bugs.
The developer MUST know that after SplitLoops all LoopIDs shouldn't be changed. And only I and you will know about it while we remember. We can create some flag in LinearIR config which not allow to change these values. However, we still call FuseLoops after SplitLoops (L82) and LoopID of outer loops will be changed.
Because of that I think it's not good idea to map on LoopID since it's changeable attribute.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I see your point.
Still it looks like a design flaw: loops can be linked together (which BTW would also work in this case if the upper splitted loop knew about inner loops), but transformations can't link/reference any loops. It doesn't seem fair 🤔

Copy link
Contributor

@IvanNovoselov IvanNovoselov Aug 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

todo: update LoopEnd-> loop info ticket

UPD: done

@a-sidorova a-sidorova force-pushed the feature/snippets/dynamism/split_loops branch from c0f4282 to b0d06e4 Compare August 16, 2024 08:36
@a-sidorova
Copy link
Contributor Author

@IvanNovoselov after our offline discussion I fixed some comments. Could you please take a look one more time? Than you in advance

[Snippets] Fixed LoopInfo cloning

[Snippets] WA for ComputeBufferAllocationSize::get_allocation_size

[Snippets] Fixed dynamic increment and single increment for Eltwise Loops

[Snippets] Added dynamic SplitLoops support

[Snippets] Added debug prints to RuntimeConfig

[Snippets] Fixed 0 case in get_finalization_offset

[Snippets] Updated ValidateExpandedLoopInfo + Updated SnippetsUnitTest

[Snippets] Supported InnerSplittedLoopInfo in SplitLoops as Unified

[Snippets] DISABLED EXTRACT LOOP INVARIANTS

[Snippets] Fixed ExtractLoopInvariants

[Snippets] Fixed FuseLoops::can_be_fused

[Snippets] Fixes
@a-sidorova a-sidorova force-pushed the feature/snippets/dynamism/split_loops branch from 2336757 to f1275e9 Compare August 19, 2024 13:59
Copy link
Contributor

@IvanNovoselov IvanNovoselov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good job 👍

@IvanNovoselov IvanNovoselov added this pull request to the merge queue Aug 19, 2024
Merged via the queue into openvinotoolkit:master with commit 00d9800 Aug 19, 2024
136 checks passed
@IvanNovoselov IvanNovoselov deleted the feature/snippets/dynamism/split_loops branch August 19, 2024 19:43
akladiev pushed a commit that referenced this pull request Aug 21, 2024
### Details:
- *The PR enables dynamic FP32 MHA tokenization on x64 platforms 🎉*
- *`std::vector.resize()` which was used for buffer scratchpad
allocation is very expensive operation due to default constructor of
elements. This PR replace `std::vector.resize()` with CPU Node
Scratchpad memory which can be shared between nodes. Also since each
thread must have the own scratchpad memory, we allocated `size *
threads_max` - however, in execution thread count can be less (depends
on parallel work amount). Now we allocate only `size * n_threads` where
`nthreads` is real count of working threads.*
- *Fixed dimension K validation in `BrgemmBlocking` pass: one of inputs
can have dynamic value of this dimension*
- *Fixed `utils::broadcast_merge_dim()` and supported broadcasting of
integer values in IterHandlers. Added unit tests for
`utils::broadcast_merge_dim()`*

### Tickets:
 - *149900*


### Prerequisites:
- [x] #25326
- [x] #25378
- [x] #25623
- [x] #25638
- [x] #25745
- [x] #25957
- [x] #25733
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category: CPU OpenVINO CPU plugin
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants