Skip to content

Commit

Permalink
[Snippets] Applied Ivan comments
Browse files Browse the repository at this point in the history
  • Loading branch information
a-sidorova committed Feb 9, 2024
1 parent 217adb0 commit bf1875e
Showing 1 changed file with 21 additions and 19 deletions.
40 changes: 21 additions & 19 deletions src/common/snippets/docs/snippets_design_guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -567,17 +567,16 @@ Note that if a `PortDescriptor` is required, you can obtain it directly (without

Concluding this section, it's worth mentioning that the `LinearIR` currently provides several debug features: `debug_print()`, serialization, performance counters and segfault detector.
The first one prints input and output `PortConnectors` and `PortDescriptors` for every `Expression` to stderr.
The second one allows users to serialize the `LIR` in two views: as control flow graph of `LinearIR` using the pass `SerializeControlFlow` and
as data flow graph of `LinearIR` using the pass `SerializeDataFlow` (control flow operations (e.g. `LoopBegin`/`LoopEnd`) are not serialized).
The second one allows users to serialize the `LIR` in two representations: as a control flow graph of `LinearIR` using the pass `SerializeControlFlow` and
as a data flow graph of `LinearIR` using the pass `SerializeDataFlow` (control flow operations (e.g. `LoopBegin`/`LoopEnd`) are not serialized).
The both serializations are saved in `xml` OpenVINO graph format, where a lot of useful parameters are displayed for every `Expression`.
Please see [perf_count.md](./debug_capabilities/perf_count.md) and [snippets_segfault_detector.md](./debug_capabilities/snippets_segfault_detector.md) for more info regarding the performance counters and segfault detector.

Also the `LinearIR` provides an interface for work with new `Expression` which encapsulates all the technical details inside.
The method `insert_node(...)` creates new `Expression` based on the passed `ov::Node` and inserts in the specified place in the `LinearIR`
by performing all the necessary actions for full integration into the `LinearIR` (connection with parents and consumers and updates of the corresponding `LoopInfo`).
This helper is used in several control flow transformations `InsertSomething`, for example, in the `InsertLoadStore` pass inside [insert_load_store.cpp](../src/lowered/pass/insert_load_store.cpp).
The method `replace_with_node(...)` creates new `Expression` based on the passed `ov::Node` and replaces the existing `Expressions` in the `LinearIR`
with the created `Expression` with full integration into the `LinearIR`.
The `LinearIR` also provides a a convenient interface for working with new `Expressions`.
This interface includes the `insert_node(...)` method that creates a new `Expression` based on the passed `ov::Node` and inserts it in the specified place in the `LinearIR`.
The `insert_node(...)` method automatically performs all the necessary actions for full integration into the `LinearIR` (connection with parents and consumers and updates of the corresponding `LoopInfo`).
This helper is used in several control flow transformations, that are usually named `InsertSomething`. Consider the `InsertLoadStore` pass as an example [insert_load_store.cpp](../src/lowered/pass/insert_load_store.cpp).
The method `replace_with_node(...)` replaces the existing `Expressions` in the `LinearIR` with the new `Expression`. This `Expression` is created from the passed `ov::Node`.
The example of using the helper might be found in the `LoadMoveBroadcastToBroadcastLoad` pass inside [load_movebroadcast_to_broadcastload.cpp](../src/lowered/pass/load_movebroadcast_to_broadcastload.cpp).
The method `replace_with_expr(...)` replaces the existing `Expressions` in the `LinearIR` with the passed `Expression`.
For more details regarding these helpers please refer to the relevant descriptions in `LinearIR` interface inside [linear_ir.cpp](../src/lowered/linear_ir.cpp).
Expand All @@ -597,23 +596,26 @@ These are needed, so appropriate instructions will be emitted during the code ge
5. `InitLoops` - initialize data pointer shift parameters (pointer increments and finalization offsets for each loop port) in `LoopInfo`.
6. `InsertLoops` - inserts explicit `LoopBegin` and `LoopEnd` (`snippets::op`) operations based on the acquired `LoopInfo`.
Again, the explicit operations are needed to emit appropriate instructions later.
7. `InsertBroadcastMove` insert `MoveBroadcast` before `Expressions` with several inputs since a special broadcasting instruction needs to be
generated to broadcast a single value to fill the whole vector register.
7. `InsertBroadcastMove` insert `MoveBroadcast` before `Expressions` with several inputs since a special broadcasting instruction needs to be generated to broadcast a single value to fill the whole vector register.
8. `LoadMoveBroadcastToBroadcastLoad` fuse `Load->MoveBroadcast` sequences into a single `BroadcastLoad` expressions.
9. `AllocateBuffers` is responsible for a safe `Buffer` data pointer increments and common memory size calculation. For more details refer please to the end of this section.
10. `CleanRepeatedDataPointerShifts` - eliminates redundant pointer increments from the loops.
11. `PropagateLayout` - propagates data layouts to `Parameters` and `Results` (actually to corresponding `Expressions`), so `Kernel` will be able to calculate appropriate data offsets for every iteration of an external parallel loop.

As mentioned above the `op::Buffer` operations are managed by the pass `AllocateBuffers`.
Before describing the algorithm, it is necessary to briefly consider the structure of `Buffer`:
all `Buffers` represent `Buffer scratchpad` together (a common memory that is needed for intermediate results storing);
each `Buffer` has an `offset` relative to the common data pointer (pointer of `Buffer scratchpad`) and `ID` (the `Buffers` with the same `ID` have the same assigned register).
The algorithm supports two modes: optimized (the algorithm calculates minimal memory size and minimal unique `ID` count required to handle all the buffers) and
not optimized (the each buffer has unique `ID` and `offset`). The first mode is default, the second one might be used for debugging in cases when
there is possibility incorrect work of the optimized algorithm.
* All `Buffers` represent `Buffer scratchpad` together (a common memory that is needed for intermediate results storing).
* Each `Buffer` has an `offset` relative to the common data pointer (pointer of `Buffer scratchpad`) and `ID` (the `Buffers` with the same `ID` have the same assigned register).

The algorithm supports two modes: optimized and non-optimized.
The optimized one calculates minimal memory size and minimal unique `ID` count required to handle all the buffers.
The non-optimized version assigns each buffer an unique `ID` and `offset`.
The first mode is the default one, while the second one might be used for debugging the optimized version.
The optimized algorithm `AllocateBuffers` has the main following steps:
1. `IdentifyBuffers` - analyzes `Buffers` using graph coloring algorithm to avoid redundant pointer increments by assigning `ID` to them.
2. `DefineBufferClusters` - creates `BufferClusters` where `Buffers` from the same cluster will have the same `offset` relative to the `Buffer scratchpad` data pointer.
1. `IdentifyBuffers` - analyzes `Buffers` access patterns to avoid redundant pointer increments. A graph coloring algorithm is utilized for this purpose.
2. `DefineBufferClusters` - creates sets of `Buffer` ops - `BufferClusters`.
`Buffers` from one `BufferCluster` refer to the same memory area (they have the same `offset` relative to the `Buffer scratchpad` data pointer).
For example, there is a loop with `Buffer` ops on input and output. If the body of this loop can write data to the memory from which it was read, these `Buffers` are in one `BufferCluster`.
3. `SolveBufferMemory` - calculate the most optimal memory size of `Buffer scratchpad` based on `BufferClusters` and life time of `Buffers`.

More details on control flow optimization passes could be found in the `control_flow_transformations(...)` method inside [subgraph.cpp](../src/op/subgraph.cpp).
Expand Down Expand Up @@ -642,8 +644,8 @@ The `increment` defines how many data entries are processed on every loop iterat
So if a loop's `work_amount` is not evenly divisible by its `increment`, it means that a tail processing is required.
`InsertTailLoop` duplicates the body of such a loop, rescales pointer increments and load/store masks appropriately, and injects these `Ops` immediately after the processed loop.
3. `CleanupLoopOffsets` "fuses" the finalization offsets of loop with an outer loop's pointer increments and zeroes the offsets before `Result` operations.
4. `OptimizeLoopSingleEvaluation` moves all pointer arithmetic to finalization offsets in `LoopEnd` and sets the special flag to `LoopEnd` for the next optimizations in the corresponding emitter
if the loop body can be executed only once.
4. `OptimizeLoopSingleEvaluation` moves all pointer arithmetic to finalization offsets in `LoopEnd`, and marks the loops that will be executed only once.
This information will be used during code emission to eliminate redundant instructions.

Please see [assign_registers.cpp](../src/lowered/pass/assign_registers.cpp) and [insert_tail_loop.cpp](../src/lowered/pass/insert_tail_loop.cpp) for more info regarding the main passes in the `Preparation` stage.
When the `Preparation` is finished, the `Generator` constructs target-specific emitters by calling `init_emitter(target)` method for every `Expression` in the `LinearIR`, where the `target` is a `TargetMachine` instance.
Expand Down

0 comments on commit bf1875e

Please sign in to comment.