Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Snippets] Dynamic SplitDimensionM via runtime configurator #25733

Merged

Conversation

v-Golubev
Copy link
Contributor

@v-Golubev v-Golubev commented Jul 25, 2024

Details:

  • Introduced ParallelWAOptimizer class, which is used in RuntimeConfigurator to split M dimension of LIR with brgemms in order to optimize parallel work amount

Tickets:

  • 147834

@v-Golubev v-Golubev added this to the 2024.4 milestone Jul 25, 2024
@v-Golubev v-Golubev requested review from a team as code owners July 25, 2024 15:28
@github-actions github-actions bot added category: IE Tests OpenVINO Test: plugins and common category: CPU OpenVINO CPU plugin labels Jul 25, 2024
@v-Golubev v-Golubev force-pushed the vg/snippets/split_m_dynamic branch from 0a893e6 to dbd8ad7 Compare August 2, 2024 11:25
@github-actions github-actions bot removed the category: IE Tests OpenVINO Test: plugins and common label Aug 2, 2024
@v-Golubev v-Golubev force-pushed the vg/snippets/split_m_dynamic branch from 5a53caf to 613d0c9 Compare August 2, 2024 15:25
@a-sidorova a-sidorova self-assigned this Aug 4, 2024
@v-Golubev v-Golubev force-pushed the vg/snippets/split_m_dynamic branch from 456760f to dbb12ed Compare August 4, 2024 16:47
Copy link
Contributor

@a-sidorova a-sidorova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First part

@v-Golubev v-Golubev force-pushed the vg/snippets/split_m_dynamic branch 2 times, most recently from d931765 to 45d2161 Compare August 5, 2024 15:49
@v-Golubev v-Golubev force-pushed the vg/snippets/split_m_dynamic branch from b1df377 to f2f24e8 Compare August 6, 2024 08:20
Copy link
Contributor

@a-sidorova a-sidorova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Status: need to validate performance, will do soon

Comment on lines +34 to +67
/**
* @brief Tries to split M dimension in "shape" in accordance to optimal parallel work amount
* @param shape Original shape
* @param optimal_parallelism_work_amount Optimal work amount
* @param batch_m_dim reference on batch's part of the split M
* @param new_m_dim reference on new M dim after the split
* @return true if split was successfull, otherwise false
*/
static bool split(const ov::Shape& shape, size_t optimal_parallelism_work_amount, size_t& batch_m_dim, size_t& new_m_dim);

/**
* @brief Splits m dimension in order
* @param order Original order
* @param m_index M dimension index
* @return updated order with the split M dimension
*/
static std::vector<size_t> get_updated_order(const std::vector<size_t>& order, size_t m_index);
/**
* @brief Reshapes m dimension in "shape": separates M in two parts: "batch_m_dim" and "new_m_dim"
* @param shape Shape to split
* @param m_index M dimension index
* @param batch_m_dim batch's part of the split M
* @param new_m_dim new M dim after the split
* @return the updated shape
*/
static ov::snippets::VectorDims reshape_m_dim(ov::snippets::VectorDims shape, size_t m_index, size_t batch_m_dim, size_t new_m_dim);
/**
* @brief Unsqueezes m dimension in "shape" (inserts "1" before the dimension)
* @param shape Shape to split
* @param m_index M dimension index
* @return the updated shape
*/
static ov::snippets::VectorDims unsqueeze_m_dim(ov::snippets::VectorDims shape, size_t m_index);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a rather complicated interface for a transformation, it should be simplified in the scope of 148805.

src/common/snippets/src/utils/utils.cpp Show resolved Hide resolved
std::unordered_set<lowered::ExpressionPtr>& visited,
std::function<void(lowered::ExpressionPtr)> func,
bool visit_parent_path) {
std::deque<lowered::ExpressionPtr> exprs{expr};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why use deque when all push/pop are performed on front? Why not to use vector instead?
AFAIK deque is needed when we need mixed front/end access.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my perspective, in this use case (all push/pops are performed on front) std::deque is still better than vector: it has the corresponding API, and the complexity of Insertion or removal of elements at the beginning is O(1)

Copy link
Contributor

@IvanNovoselov IvanNovoselov Sep 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, no objection from the performance perspective.
My question is rather why do we need to use smth exotic like deque, where a simple vector would do just as good (just use push_bask and pop_back).
To be clear, I don't suggest to change anything, it's just sincere curiosity 🙂

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I finally understood your question :) Actually you are right, we can easily use vector here, the logic doesn't require elements pop/push exactly from the beginning

src/common/snippets/src/utils/utils.cpp Show resolved Hide resolved
*/
void update_loop_info(const lowered::LinearIRCPtr& linear_ir) const;
void update_loop_info(const lowered::LinearIRCPtr& linear_ir, LoopInfoRuntimeParamsMap& initializated_info_map) const;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, with signatures like this it's unclear whether we want to get the initialized_info_map as a result of this function call or we want to allow users to specify some loop_infos. In other words, is it valid to pass non-empty initialized_info_map?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please take a look at the description:

    /**
     * @brief Update Loop informations in LinearIR: Unified and ExpandedLoopInfo
     * @param linear_ir LinearIR
     * @param initializated_info_map Reference on a map [LoopInfo->RuntimeParams].
     * Can be used to pass in the method loop infos which were already initialized, e.g. by parallel domain optimization
     */
    void update_loop_info(const lowered::LinearIRCPtr& linear_ir, LoopInfoRuntimeParamsMap& initializated_info_map) const;

I tried to point out there that user can pass a map with already initialized infos => it is valid to pass non-empty initializated_info_map. Please let me know if the description is not clear enough -- I will try to reformulate it

Copy link
Contributor

@IvanNovoselov IvanNovoselov Sep 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After reviewing this one and a few subsequent PRs, I realized that probably these kind of interfaces with a map reference as an argument is an inevitable and a lesser evil at this point. So let's leave it as is.
I think that the real problem is how the loop_infos' are stored, especially in the case splitted loops (but for Expanded loop info as well). To make these kind of updates more transparent, we need to reorganize loop_infos' connectivity on the design level.
Please remind me about this on the meeting, I'll share some more details about this. It would be interesting to know your opinion.

auto shapes = extract_shapes();
auto layouts = extract_layouts();
if (m_optimizer.enabled()) {
m_optimizer.optimize(m_config->master_shape, initialized_info, shapes, layouts, m_in_num);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to pass shapes and layouts to the optimizer explicitly here, when we already passes him m_io_desc during the initialization?
The optimizer can extract shapes and layouts internally. It should also be responsible for tensor_rank and data_offsets update. This would allow to simplify signatures and reduce code connectivity.
As a user, I would expect something like:

if (!m_optimizer.optimize(...)) // if m_optimizer.optimize() is true => then the optimizer handled everything correctly
    update_data_offsets(extract_shapes(), extract_layouts());        // if m_optimizer.optimize() is false => optimizer is disabled, and we need to init offsets manually

Copy link
Contributor Author

@v-Golubev v-Golubev Sep 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like your suggestion, but currently, update_data_offsets is RuntimeConfigurator's protected API, so we can't just call it inside optimizer. I think it will be easier to implement this idea within 148891, when all the needed logic is performed by lowered passes.

Copy link
Contributor

@IvanNovoselov IvanNovoselov Sep 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so we can't just call it inside optimizer

I'm pretty sure we can, because ParallelWAOptimizer is an inner class of RuntimeConfigurator: it has access even to the configurators' private members.
So the only thing we need to do to call update_data_offsets from the optimizer is to pass it a pointer to the configurator: m_optimizer.optimize(this)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I applied your suggestion in the follow-up PR. Thanks for the proposal!

if (linear_ir->is_dynamic()) {
update_loop_info(linear_ir);
update_loop_info(linear_ir, initialized_info);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make the optimizer responsible for calling update_loop_info for affected loops?
For example:

  1. Implement update_loop_info(loop_id) that updates 1 loop
  2. modify update_loop_info(linear_ir) so that it just calls update_loop_info(loop_id) for all loops in loop_map
  3. Update loops before M dim optimization, and make the optimizer to update the necessary loops if the dimension was splitted.
    This would enchance modularity and would simplify the logic. Otherwise we have to complicate signatures, and carry some opaque states from optimizer downwards (shapes/layouts/loop_info), and we don't even know if the states were updated or not.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, we should think of the optimizer as a pass: it takes lir and updates everything that needs to be updated, so all other passes/functions could work completely independently (not semantically, since passes may depend on each other, but functionally, i.e. the only state that is passes between the passes (sorry 🙃) is lir itself

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, update_loop_info is RuntimeConfigurator's API, so it should be extracted to apply your suggestion. It seems like the best way here is to extract it as a separate lowered pass. I believe this should be done within work on 148891 ticket (where all RuntimeConfigurator::update work should be done by passes) in order to avoid a mess of 2 approaches in one place

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The optimizer is currently implemented as a part of RuntimeConfigurator, so as a quick solution we can pass the configurator to the optimizer to make the necessary updates. The idea is to isolate all the optimization-related logic in one class.
Yes, the pass solution is better, and it's a long-term fix, but it's not clear when it will be addressed. So I would still suggest to update the optimizer a bit now, if it won't take too much time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The refactoring is done in the follow-up PR

@IvanNovoselov
Copy link
Contributor

@v-Golubev, please address the remaining comments in a separate PR

@IvanNovoselov IvanNovoselov added this pull request to the merge queue Aug 15, 2024
Merged via the queue into openvinotoolkit:master with commit 848b5ca Aug 15, 2024
135 of 136 checks passed
akladiev pushed a commit that referenced this pull request Aug 21, 2024
### Details:
- *The PR enables dynamic FP32 MHA tokenization on x64 platforms 🎉*
- *`std::vector.resize()` which was used for buffer scratchpad
allocation is very expensive operation due to default constructor of
elements. This PR replace `std::vector.resize()` with CPU Node
Scratchpad memory which can be shared between nodes. Also since each
thread must have the own scratchpad memory, we allocated `size *
threads_max` - however, in execution thread count can be less (depends
on parallel work amount). Now we allocate only `size * n_threads` where
`nthreads` is real count of working threads.*
- *Fixed dimension K validation in `BrgemmBlocking` pass: one of inputs
can have dynamic value of this dimension*
- *Fixed `utils::broadcast_merge_dim()` and supported broadcasting of
integer values in IterHandlers. Added unit tests for
`utils::broadcast_merge_dim()`*

### Tickets:
 - *149900*


### Prerequisites:
- [x] #25326
- [x] #25378
- [x] #25623
- [x] #25638
- [x] #25745
- [x] #25957
- [x] #25733
github-merge-queue bot pushed a commit that referenced this pull request Sep 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category: CPU OpenVINO CPU plugin
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants