[Snippets] Dynamic SplitDimensionM via runtime configurator #25733

v-Golubev · 2024-07-25T15:28:20Z

Details:

Introduced ParallelWAOptimizer class, which is used in RuntimeConfigurator to split M dimension of LIR with brgemms in order to optimize parallel work amount

Tickets:

147834

src/plugins/intel_cpu/src/emitters/snippets/cpu_runtime_configurator.cpp

src/common/snippets/include/snippets/runtime_configurator.hpp

a-sidorova

First part

src/common/snippets/include/snippets/runtime_configurator.hpp

src/plugins/intel_cpu/src/emitters/snippets/cpu_runtime_configurator.cpp

src/plugins/intel_cpu/src/transformations/transformation_pipeline.cpp

src/plugins/intel_cpu/tests/functional/shared_tests_instances/snippets/mha.cpp

src/common/snippets/include/snippets/pass/split_dimension_m.hpp

src/common/snippets/include/snippets/lowered/pass/init_loops.hpp

src/common/snippets/src/pass/split_dimension_m.cpp

src/common/snippets/include/snippets/runtime_configurator.hpp

src/plugins/intel_cpu/src/emitters/snippets/cpu_runtime_configurator.cpp

src/common/snippets/include/snippets/lowered/pass/init_loops.hpp

src/plugins/intel_cpu/src/emitters/snippets/cpu_runtime_configurator.cpp

src/common/snippets/src/runtime_configurator.cpp

src/plugins/intel_cpu/src/emitters/snippets/cpu_runtime_configurator.cpp

src/common/snippets/include/snippets/runtime_configurator.hpp

src/common/snippets/src/runtime_configurator.cpp

a-sidorova

Status: need to validate performance, will do soon

src/common/snippets/include/snippets/runtime_configurator.hpp

IvanNovoselov · 2024-08-14T12:56:23Z

src/common/snippets/include/snippets/pass/split_dimension_m.hpp

+    /**
+     * @brief Tries to split M dimension in "shape" in accordance to optimal parallel work amount
+     * @param shape Original shape
+     * @param optimal_parallelism_work_amount Optimal work amount
+     * @param batch_m_dim reference on batch's part of the split M
+     * @param new_m_dim reference on new M dim after the split
+     * @return true if split was successfull, otherwise false
+     */
+    static bool split(const ov::Shape& shape, size_t optimal_parallelism_work_amount, size_t& batch_m_dim, size_t& new_m_dim);
+
+    /**
+     * @brief Splits m dimension in order
+     * @param order Original order
+     * @param m_index M dimension index
+     * @return updated order with the split M dimension
+     */
+    static std::vector<size_t> get_updated_order(const std::vector<size_t>& order, size_t m_index);
+    /**
+     * @brief Reshapes m dimension in "shape": separates M in two parts: "batch_m_dim" and "new_m_dim"
+     * @param shape Shape to split
+     * @param m_index M dimension index
+     * @param batch_m_dim batch's part of the split M
+     * @param new_m_dim new M dim after the split
+     * @return the updated shape
+     */
+    static ov::snippets::VectorDims reshape_m_dim(ov::snippets::VectorDims shape, size_t m_index, size_t batch_m_dim, size_t new_m_dim);
+    /**
+     * @brief Unsqueezes m dimension in "shape" (inserts "1" before the dimension)
+     * @param shape Shape to split
+     * @param m_index M dimension index
+     * @return the updated shape
+     */
+    static ov::snippets::VectorDims unsqueeze_m_dim(ov::snippets::VectorDims shape, size_t m_index);
+


This is a rather complicated interface for a transformation, it should be simplified in the scope of 148805.

src/common/snippets/src/utils/utils.cpp

IvanNovoselov · 2024-08-14T13:14:30Z

src/common/snippets/src/utils/utils.cpp

+                std::unordered_set<lowered::ExpressionPtr>& visited,
+                std::function<void(lowered::ExpressionPtr)> func,
+                bool visit_parent_path) {
+    std::deque<lowered::ExpressionPtr> exprs{expr};


Why use deque when all push/pop are performed on front? Why not to use vector instead?
AFAIK deque is needed when we need mixed front/end access.

From my perspective, in this use case (all push/pops are performed on front) std::deque is still better than vector: it has the corresponding API, and the complexity of Insertion or removal of elements at the beginning is O(1)

Right, no objection from the performance perspective.
My question is rather why do we need to use smth exotic like deque, where a simple vector would do just as good (just use push_bask and pop_back).
To be clear, I don't suggest to change anything, it's just sincere curiosity 🙂

I finally understood your question :) Actually you are right, we can easily use vector here, the logic doesn't require elements pop/push exactly from the beginning

src/common/snippets/src/utils/utils.cpp

src/common/snippets/include/snippets/lowered/pass/init_loops.hpp

src/common/snippets/include/snippets/runtime_configurator.hpp

IvanNovoselov · 2024-08-14T14:49:44Z

src/common/snippets/include/snippets/runtime_configurator.hpp

     */
-    void update_loop_info(const lowered::LinearIRCPtr& linear_ir) const;
+    void update_loop_info(const lowered::LinearIRCPtr& linear_ir, LoopInfoRuntimeParamsMap& initializated_info_map) const;


Again, with signatures like this it's unclear whether we want to get the initialized_info_map as a result of this function call or we want to allow users to specify some loop_infos. In other words, is it valid to pass non-empty initialized_info_map?

Please take a look at the description:

/** * @brief Update Loop informations in LinearIR: Unified and ExpandedLoopInfo * @param linear_ir LinearIR * @param initializated_info_map Reference on a map [LoopInfo->RuntimeParams]. * Can be used to pass in the method loop infos which were already initialized, e.g. by parallel domain optimization */ void update_loop_info(const lowered::LinearIRCPtr& linear_ir, LoopInfoRuntimeParamsMap& initializated_info_map) const;

I tried to point out there that user can pass a map with already initialized infos => it is valid to pass non-empty initializated_info_map. Please let me know if the description is not clear enough -- I will try to reformulate it

After reviewing this one and a few subsequent PRs, I realized that probably these kind of interfaces with a map reference as an argument is an inevitable and a lesser evil at this point. So let's leave it as is.
I think that the real problem is how the loop_infos' are stored, especially in the case splitted loops (but for Expanded loop info as well). To make these kind of updates more transparent, we need to reorganize loop_infos' connectivity on the design level.
Please remind me about this on the meeting, I'll share some more details about this. It would be interesting to know your opinion.

IvanNovoselov · 2024-08-14T15:28:10Z

src/common/snippets/src/runtime_configurator.cpp

+    auto shapes = extract_shapes();
+    auto layouts = extract_layouts();
+    if (m_optimizer.enabled()) {
+        m_optimizer.optimize(m_config->master_shape, initialized_info, shapes, layouts, m_in_num);


Why do we need to pass shapes and layouts to the optimizer explicitly here, when we already passes him m_io_desc during the initialization?
The optimizer can extract shapes and layouts internally. It should also be responsible for tensor_rank and data_offsets update. This would allow to simplify signatures and reduce code connectivity.
As a user, I would expect something like:

if (!m_optimizer.optimize(...)) // if m_optimizer.optimize() is true => then the optimizer handled everything correctly update_data_offsets(extract_shapes(), extract_layouts()); // if m_optimizer.optimize() is false => optimizer is disabled, and we need to init offsets manually

I like your suggestion, but currently, update_data_offsets is RuntimeConfigurator's protected API, so we can't just call it inside optimizer. I think it will be easier to implement this idea within 148891, when all the needed logic is performed by lowered passes.

so we can't just call it inside optimizer

I'm pretty sure we can, because ParallelWAOptimizer is an inner class of RuntimeConfigurator: it has access even to the configurators' private members.
So the only thing we need to do to call update_data_offsets from the optimizer is to pass it a pointer to the configurator: m_optimizer.optimize(this)

I applied your suggestion in the follow-up PR. Thanks for the proposal!

IvanNovoselov · 2024-08-14T15:39:44Z

src/common/snippets/src/runtime_configurator.cpp

    if (linear_ir->is_dynamic()) {
-        update_loop_info(linear_ir);
+        update_loop_info(linear_ir, initialized_info);


Can we make the optimizer responsible for calling update_loop_info for affected loops?
For example:

Implement update_loop_info(loop_id) that updates 1 loop

modify update_loop_info(linear_ir) so that it just calls update_loop_info(loop_id) for all loops in loop_map

Update loops before M dim optimization, and make the optimizer to update the necessary loops if the dimension was splitted.
This would enchance modularity and would simplify the logic. Otherwise we have to complicate signatures, and carry some opaque states from optimizer downwards (shapes/layouts/loop_info), and we don't even know if the states were updated or not.

In general, we should think of the optimizer as a pass: it takes lir and updates everything that needs to be updated, so all other passes/functions could work completely independently (not semantically, since passes may depend on each other, but functionally, i.e. the only state that is passes between the passes (sorry 🙃) is lir itself

Again, update_loop_info is RuntimeConfigurator's API, so it should be extracted to apply your suggestion. It seems like the best way here is to extract it as a separate lowered pass. I believe this should be done within work on 148891 ticket (where all RuntimeConfigurator::update work should be done by passes) in order to avoid a mess of 2 approaches in one place

The optimizer is currently implemented as a part of RuntimeConfigurator, so as a quick solution we can pass the configurator to the optimizer to make the necessary updates. The idea is to isolate all the optimization-related logic in one class.
Yes, the pass solution is better, and it's a long-term fix, but it's not clear when it will be addressed. So I would still suggest to update the optimizer a bit now, if it won't take too much time.

The refactoring is done in the follow-up PR

src/common/snippets/include/snippets/runtime_configurator.hpp

IvanNovoselov · 2024-08-15T09:17:34Z

@v-Golubev, please address the remaining comments in a separate PR

### Details: - *The PR enables dynamic FP32 MHA tokenization on x64 platforms 🎉* - *`std::vector.resize()` which was used for buffer scratchpad allocation is very expensive operation due to default constructor of elements. This PR replace `std::vector.resize()` with CPU Node Scratchpad memory which can be shared between nodes. Also since each thread must have the own scratchpad memory, we allocated `size * threads_max` - however, in execution thread count can be less (depends on parallel work amount). Now we allocate only `size * n_threads` where `nthreads` is real count of working threads.* - *Fixed dimension K validation in `BrgemmBlocking` pass: one of inputs can have dynamic value of this dimension* - *Fixed `utils::broadcast_merge_dim()` and supported broadcasting of integer values in IterHandlers. Added unit tests for `utils::broadcast_merge_dim()`* ### Tickets: - *149900* ### Prerequisites: - [x] #25326 - [x] #25378 - [x] #25623 - [x] #25638 - [x] #25745 - [x] #25957 - [x] #25733

### Details: PR with leftovers of #25733.

v-Golubev added the do_not_review label Jul 25, 2024

v-Golubev added this to the 2024.4 milestone Jul 25, 2024

v-Golubev requested review from a team as code owners July 25, 2024 15:28

github-actions bot added category: IE Tests OpenVINO Test: plugins and common category: CPU OpenVINO CPU plugin labels Jul 25, 2024

v-Golubev force-pushed the vg/snippets/split_m_dynamic branch from 0a893e6 to dbd8ad7 Compare August 2, 2024 11:25

v-Golubev removed the do_not_review label Aug 2, 2024

github-actions bot removed the category: IE Tests OpenVINO Test: plugins and common label Aug 2, 2024

v-Golubev force-pushed the vg/snippets/split_m_dynamic branch from 5a53caf to 613d0c9 Compare August 2, 2024 15:25

a-sidorova self-assigned this Aug 4, 2024

v-Golubev commented Aug 4, 2024

View reviewed changes

v-Golubev force-pushed the vg/snippets/split_m_dynamic branch from 456760f to dbb12ed Compare August 4, 2024 16:47

a-sidorova reviewed Aug 5, 2024

View reviewed changes

src/common/snippets/include/snippets/pass/split_dimension_m.hpp Outdated Show resolved Hide resolved

src/common/snippets/include/snippets/lowered/pass/init_loops.hpp Outdated Show resolved Hide resolved

src/common/snippets/src/pass/split_dimension_m.cpp Outdated Show resolved Hide resolved

v-Golubev force-pushed the vg/snippets/split_m_dynamic branch 2 times, most recently from d931765 to 45d2161 Compare August 5, 2024 15:49

v-Golubev commented Aug 5, 2024

View reviewed changes

src/common/snippets/include/snippets/runtime_configurator.hpp Outdated Show resolved Hide resolved

src/plugins/intel_cpu/src/emitters/snippets/cpu_runtime_configurator.cpp Outdated Show resolved Hide resolved

a-sidorova reviewed Aug 6, 2024

View reviewed changes

v-Golubev force-pushed the vg/snippets/split_m_dynamic branch from b1df377 to f2f24e8 Compare August 6, 2024 08:20

a-sidorova reviewed Aug 6, 2024

View reviewed changes

src/common/snippets/src/runtime_configurator.cpp Outdated Show resolved Hide resolved

src/common/snippets/src/runtime_configurator.cpp Show resolved Hide resolved

a-sidorova approved these changes Aug 7, 2024

View reviewed changes

src/common/snippets/include/snippets/runtime_configurator.hpp Show resolved Hide resolved

v-Golubev added 6 commits August 12, 2024 11:23

[Snippets] Dynamic SplitM in RuntimeConfigurator

bf8005f

Extended tests coverage

4379c34

Refactoring

e73301c

Added comments

22e2031

Review comments applied: first part

e270741

Review comments applied: second part

f662fa8

Last comments

e51cd8a

a-sidorova force-pushed the vg/snippets/split_m_dynamic branch from 8fe2950 to e51cd8a Compare August 12, 2024 07:40

a-sidorova mentioned this pull request Aug 12, 2024

[Snippets][CPU] Enabled dynamic MHA FP32 tokenization on x64 #25500

Merged

7 tasks

IvanNovoselov self-assigned this Aug 14, 2024

IvanNovoselov reviewed Aug 14, 2024

View reviewed changes

IvanNovoselov approved these changes Aug 15, 2024

View reviewed changes

IvanNovoselov added this pull request to the merge queue Aug 15, 2024

Merged via the queue into openvinotoolkit:master with commit 848b5ca Aug 15, 2024
135 of 136 checks passed

v-Golubev mentioned this pull request Sep 2, 2024

[Snippets] Dynamic SplitDimensionM leftovers #26354

Merged

github-merge-queue bot pushed a commit that referenced this pull request Sep 10, 2024

[Snippets] Dynamic SplitDimensionM leftovers (#26354)

a04346b

### Details: PR with leftovers of #25733.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Snippets] Dynamic SplitDimensionM via runtime configurator #25733

[Snippets] Dynamic SplitDimensionM via runtime configurator #25733

v-Golubev commented Jul 25, 2024 •

edited

Loading

a-sidorova left a comment

a-sidorova left a comment

IvanNovoselov Aug 14, 2024

IvanNovoselov Aug 14, 2024

v-Golubev Aug 30, 2024

IvanNovoselov Sep 3, 2024 •

edited

Loading

v-Golubev Sep 6, 2024

IvanNovoselov Aug 14, 2024

v-Golubev Aug 30, 2024

IvanNovoselov Sep 3, 2024 •

edited

Loading

IvanNovoselov Aug 14, 2024

v-Golubev Sep 2, 2024 •

edited

Loading

IvanNovoselov Sep 3, 2024 •

edited

Loading

v-Golubev Sep 6, 2024

IvanNovoselov Aug 14, 2024

IvanNovoselov Aug 14, 2024

v-Golubev Sep 2, 2024

IvanNovoselov Sep 3, 2024

v-Golubev Sep 6, 2024

IvanNovoselov commented Aug 15, 2024

[Snippets] Dynamic SplitDimensionM via runtime configurator #25733

[Snippets] Dynamic SplitDimensionM via runtime configurator #25733

Conversation

v-Golubev commented Jul 25, 2024 • edited Loading

Details:

Tickets:

a-sidorova left a comment

Choose a reason for hiding this comment

a-sidorova left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

IvanNovoselov Sep 3, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

IvanNovoselov Sep 3, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

v-Golubev Sep 2, 2024 • edited Loading

Choose a reason for hiding this comment

IvanNovoselov Sep 3, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

IvanNovoselov commented Aug 15, 2024

v-Golubev commented Jul 25, 2024 •

edited

Loading

IvanNovoselov Sep 3, 2024 •

edited

Loading

IvanNovoselov Sep 3, 2024 •

edited

Loading

v-Golubev Sep 2, 2024 •

edited

Loading

IvanNovoselov Sep 3, 2024 •

edited

Loading