-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Snippets] Dynamic SplitDimensionM via runtime configurator #25733
[Snippets] Dynamic SplitDimensionM via runtime configurator #25733
Conversation
0a893e6
to
dbd8ad7
Compare
5a53caf
to
613d0c9
Compare
src/plugins/intel_cpu/src/emitters/snippets/cpu_runtime_configurator.cpp
Outdated
Show resolved
Hide resolved
456760f
to
dbb12ed
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First part
src/plugins/intel_cpu/src/emitters/snippets/cpu_runtime_configurator.cpp
Outdated
Show resolved
Hide resolved
src/plugins/intel_cpu/src/transformations/transformation_pipeline.cpp
Outdated
Show resolved
Hide resolved
src/plugins/intel_cpu/src/transformations/transformation_pipeline.cpp
Outdated
Show resolved
Hide resolved
src/plugins/intel_cpu/tests/functional/shared_tests_instances/snippets/mha.cpp
Show resolved
Hide resolved
src/common/snippets/include/snippets/pass/split_dimension_m.hpp
Outdated
Show resolved
Hide resolved
src/common/snippets/include/snippets/lowered/pass/init_loops.hpp
Outdated
Show resolved
Hide resolved
d931765
to
45d2161
Compare
src/plugins/intel_cpu/src/emitters/snippets/cpu_runtime_configurator.cpp
Outdated
Show resolved
Hide resolved
src/common/snippets/include/snippets/lowered/pass/init_loops.hpp
Outdated
Show resolved
Hide resolved
src/plugins/intel_cpu/src/emitters/snippets/cpu_runtime_configurator.cpp
Outdated
Show resolved
Hide resolved
src/plugins/intel_cpu/src/emitters/snippets/cpu_runtime_configurator.cpp
Outdated
Show resolved
Hide resolved
src/plugins/intel_cpu/src/emitters/snippets/cpu_runtime_configurator.cpp
Outdated
Show resolved
Hide resolved
b1df377
to
f2f24e8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Status: need to validate performance, will do soon
8fe2950
to
e51cd8a
Compare
/** | ||
* @brief Tries to split M dimension in "shape" in accordance to optimal parallel work amount | ||
* @param shape Original shape | ||
* @param optimal_parallelism_work_amount Optimal work amount | ||
* @param batch_m_dim reference on batch's part of the split M | ||
* @param new_m_dim reference on new M dim after the split | ||
* @return true if split was successfull, otherwise false | ||
*/ | ||
static bool split(const ov::Shape& shape, size_t optimal_parallelism_work_amount, size_t& batch_m_dim, size_t& new_m_dim); | ||
|
||
/** | ||
* @brief Splits m dimension in order | ||
* @param order Original order | ||
* @param m_index M dimension index | ||
* @return updated order with the split M dimension | ||
*/ | ||
static std::vector<size_t> get_updated_order(const std::vector<size_t>& order, size_t m_index); | ||
/** | ||
* @brief Reshapes m dimension in "shape": separates M in two parts: "batch_m_dim" and "new_m_dim" | ||
* @param shape Shape to split | ||
* @param m_index M dimension index | ||
* @param batch_m_dim batch's part of the split M | ||
* @param new_m_dim new M dim after the split | ||
* @return the updated shape | ||
*/ | ||
static ov::snippets::VectorDims reshape_m_dim(ov::snippets::VectorDims shape, size_t m_index, size_t batch_m_dim, size_t new_m_dim); | ||
/** | ||
* @brief Unsqueezes m dimension in "shape" (inserts "1" before the dimension) | ||
* @param shape Shape to split | ||
* @param m_index M dimension index | ||
* @return the updated shape | ||
*/ | ||
static ov::snippets::VectorDims unsqueeze_m_dim(ov::snippets::VectorDims shape, size_t m_index); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a rather complicated interface for a transformation, it should be simplified in the scope of 148805.
std::unordered_set<lowered::ExpressionPtr>& visited, | ||
std::function<void(lowered::ExpressionPtr)> func, | ||
bool visit_parent_path) { | ||
std::deque<lowered::ExpressionPtr> exprs{expr}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why use deque when all push/pop are performed on front? Why not to use vector instead?
AFAIK deque is needed when we need mixed front/end access.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From my perspective, in this use case (all push/pops are performed on front) std::deque
is still better than vector: it has the corresponding API, and the complexity of Insertion or removal of elements at the beginning is O(1)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, no objection from the performance perspective.
My question is rather why do we need to use smth exotic like deque, where a simple vector would do just as good (just use push_bask and pop_back).
To be clear, I don't suggest to change anything, it's just sincere curiosity 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I finally understood your question :) Actually you are right, we can easily use vector here, the logic doesn't require elements pop/push exactly from the beginning
*/ | ||
void update_loop_info(const lowered::LinearIRCPtr& linear_ir) const; | ||
void update_loop_info(const lowered::LinearIRCPtr& linear_ir, LoopInfoRuntimeParamsMap& initializated_info_map) const; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, with signatures like this it's unclear whether we want to get the initialized_info_map
as a result of this function call or we want to allow users to specify some loop_infos. In other words, is it valid to pass non-empty initialized_info_map
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please take a look at the description:
/**
* @brief Update Loop informations in LinearIR: Unified and ExpandedLoopInfo
* @param linear_ir LinearIR
* @param initializated_info_map Reference on a map [LoopInfo->RuntimeParams].
* Can be used to pass in the method loop infos which were already initialized, e.g. by parallel domain optimization
*/
void update_loop_info(const lowered::LinearIRCPtr& linear_ir, LoopInfoRuntimeParamsMap& initializated_info_map) const;
I tried to point out there that user can pass a map with already initialized infos => it is valid to pass non-empty initializated_info_map
. Please let me know if the description is not clear enough -- I will try to reformulate it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After reviewing this one and a few subsequent PRs, I realized that probably these kind of interfaces with a map reference as an argument is an inevitable and a lesser evil at this point. So let's leave it as is.
I think that the real problem is how the loop_infos' are stored, especially in the case splitted loops (but for Expanded loop info as well). To make these kind of updates more transparent, we need to reorganize loop_infos' connectivity on the design level.
Please remind me about this on the meeting, I'll share some more details about this. It would be interesting to know your opinion.
auto shapes = extract_shapes(); | ||
auto layouts = extract_layouts(); | ||
if (m_optimizer.enabled()) { | ||
m_optimizer.optimize(m_config->master_shape, initialized_info, shapes, layouts, m_in_num); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to pass shapes
and layouts
to the optimizer explicitly here, when we already passes him m_io_desc
during the initialization?
The optimizer can extract shapes and layouts internally. It should also be responsible for tensor_rank
and data_offsets
update. This would allow to simplify signatures and reduce code connectivity.
As a user, I would expect something like:
if (!m_optimizer.optimize(...)) // if m_optimizer.optimize() is true => then the optimizer handled everything correctly
update_data_offsets(extract_shapes(), extract_layouts()); // if m_optimizer.optimize() is false => optimizer is disabled, and we need to init offsets manually
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like your suggestion, but currently, update_data_offsets
is RuntimeConfigurator
's protected API, so we can't just call it inside optimizer. I think it will be easier to implement this idea within 148891, when all the needed logic is performed by lowered passes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so we can't just call it inside optimizer
I'm pretty sure we can, because ParallelWAOptimizer
is an inner class of RuntimeConfigurator: it has access even to the configurators' private members.
So the only thing we need to do to call update_data_offsets
from the optimizer is to pass it a pointer to the configurator: m_optimizer.optimize(this)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I applied your suggestion in the follow-up PR. Thanks for the proposal!
if (linear_ir->is_dynamic()) { | ||
update_loop_info(linear_ir); | ||
update_loop_info(linear_ir, initialized_info); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make the optimizer responsible for calling update_loop_info
for affected loops?
For example:
- Implement
update_loop_info(loop_id)
that updates 1 loop - modify
update_loop_info(linear_ir)
so that it just callsupdate_loop_info(loop_id)
for all loops in loop_map - Update loops before M dim optimization, and make the optimizer to update the necessary loops if the dimension was splitted.
This would enchance modularity and would simplify the logic. Otherwise we have to complicate signatures, and carry some opaque states from optimizer downwards (shapes/layouts/loop_info), and we don't even know if the states were updated or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, we should think of the optimizer as a pass: it takes lir and updates everything that needs to be updated, so all other passes/functions could work completely independently (not semantically, since passes may depend on each other, but functionally, i.e. the only state that is passes between the passes (sorry 🙃) is lir itself
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, update_loop_info
is RuntimeConfigurator
's API, so it should be extracted to apply your suggestion. It seems like the best way here is to extract it as a separate lowered pass. I believe this should be done within work on 148891 ticket (where all RuntimeConfigurator::update
work should be done by passes) in order to avoid a mess of 2 approaches in one place
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The optimizer is currently implemented as a part of RuntimeConfigurator, so as a quick solution we can pass the configurator to the optimizer to make the necessary updates. The idea is to isolate all the optimization-related logic in one class.
Yes, the pass solution is better, and it's a long-term fix, but it's not clear when it will be addressed. So I would still suggest to update the optimizer a bit now, if it won't take too much time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The refactoring is done in the follow-up PR
@v-Golubev, please address the remaining comments in a separate PR |
848b5ca
### Details: - *The PR enables dynamic FP32 MHA tokenization on x64 platforms 🎉* - *`std::vector.resize()` which was used for buffer scratchpad allocation is very expensive operation due to default constructor of elements. This PR replace `std::vector.resize()` with CPU Node Scratchpad memory which can be shared between nodes. Also since each thread must have the own scratchpad memory, we allocated `size * threads_max` - however, in execution thread count can be less (depends on parallel work amount). Now we allocate only `size * n_threads` where `nthreads` is real count of working threads.* - *Fixed dimension K validation in `BrgemmBlocking` pass: one of inputs can have dynamic value of this dimension* - *Fixed `utils::broadcast_merge_dim()` and supported broadcasting of integer values in IterHandlers. Added unit tests for `utils::broadcast_merge_dim()`* ### Tickets: - *149900* ### Prerequisites: - [x] #25326 - [x] #25378 - [x] #25623 - [x] #25638 - [x] #25745 - [x] #25957 - [x] #25733
### Details: PR with leftovers of #25733.
Details:
ParallelWAOptimizer
class, which is used in RuntimeConfigurator to split M dimension of LIR with brgemms in order to optimize parallel work amountTickets: