Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Snippets] Moved infrastructure to Linear Intermediate Representation #16402

Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
28 commits
Select commit Hold shift + click to select a range
1377883
Introduce linear IR and disable obsolete tests
IvanNovoselov Jan 4, 2023
d5f8fb5
[Snippets] Added Loop markup and Loop Fusion on Linear IR Level
a-sidorova Mar 17, 2023
b675bef
Added support of custom Plugin ops in Linear IR
a-sidorova Mar 29, 2023
feb7bfc
[Snippets] Added Buffer identification
a-sidorova Mar 30, 2023
7d4ce5c
[Snippets] Refactoring
a-sidorova Apr 17, 2023
1fada21
Fixes after rebasing
a-sidorova Apr 17, 2023
c530927
Removed work around for StoreEmitter
a-sidorova Apr 17, 2023
467c7aa
[Snippets] Refactoring of transformations
a-sidorova Apr 17, 2023
508a34b
[Snippets] Rebased on the latest master
a-sidorova Apr 19, 2023
7440994
[Snippets] Added support of Port Descriptor (#106)
a-sidorova May 11, 2023
43936a3
Applied comments by Ivan #1
a-sidorova May 11, 2023
e7ee0d5
Fixed Loads with the same Parent: CleanRepeatedPtrShifts
a-sidorova May 11, 2023
5ef7227
Updated Buffer Identification logic
a-sidorova May 11, 2023
2ea1bf3
Cleaned cmake lists
a-sidorova May 11, 2023
9b45cfb
fixes after rebase
a-sidorova May 11, 2023
979b673
fixed lin build
a-sidorova May 11, 2023
ef6717e
fixed build 2
a-sidorova May 11, 2023
dd0a4e1
added missed file
a-sidorova May 11, 2023
f5d59ce
fixed snippets test build
a-sidorova May 12, 2023
1eb736a
Applied comments by Ivan #2
a-sidorova May 12, 2023
89f99e5
[Snippets] Moved reg_info from Expression to PortDescriptor
a-sidorova May 15, 2023
f71b552
Moved Linear IR transformations from generator to Subgraph
a-sidorova May 16, 2023
ec5920b
Fixed InsertStore for Buffer wo inputs
a-sidorova May 17, 2023
14b8709
Removed incorrect extra copy rt_info which break PortDescriptors
a-sidorova May 17, 2023
13d956f
[Snippets] Moved namespace from ngraph to ov
a-sidorova May 18, 2023
d81287e
Applied comments by Dmitry
a-sidorova May 19, 2023
dbfe69a
[Snippets] Tensor -> PortConnector
a-sidorova May 19, 2023
0e04ae1
[Snippets] Added link to doc
a-sidorova May 19, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 8 additions & 8 deletions src/common/snippets/docs/snippets_design_guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ This document describes the design and rationale for a snippets code generator.

Core **CNN operators (convolution, gemm, fully connected) are limited by compute, the rest is memory bound**. Math approximations (like transcendental functions) are rare in emerging workloads and could be treated with the same machinery. **Snippets are designed to optimize topology for memory**, while leaving compute intensive kernels for backend developers.

The **potential speedup is proportional to shrink in memory-walked bytes**. Therefore, you can transform the problem to a task to optimize for memory walks, whatever pattern snippet has and operations it contains. The number of memory walks should be less or equal to handcrafted optimizations. This guarantees performance improvements over the previous approach (excluding corner cases caused by cache effects). *Shrinkage factor might be encoded to some cost function in future evolution of code generator*. Snippets generator provides diagnostics to estimate this shrinkage factor with `ngraph::snippets::op::Subgraph::print_statistics(bool verbose)` member.
The **potential speedup is proportional to shrink in memory-walked bytes**. Therefore, you can transform the problem to a task to optimize for memory walks, whatever pattern snippet has and operations it contains. The number of memory walks should be less or equal to handcrafted optimizations. This guarantees performance improvements over the previous approach (excluding corner cases caused by cache effects). *Shrinkage factor might be encoded to some cost function in future evolution of code generator*. Snippets generator provides diagnostics to estimate this shrinkage factor with `ov::snippets::op::Subgraph::print_statistics(bool verbose)` member.

The SnippetS generator is designed for back-end developers. The main purpose of inventing the snippets code generator is an **operator fusion**, **register allocation** and **target kernel generation** decomposition. This allows modifications (like new fusion support) and feature extensions (like new operation support) to be done in a single point of modification and avoid combinatorial explosion for fusions/types/architectures etc.

Expand All @@ -28,7 +28,7 @@ Code generation is split into 2 phases, **tokenization** and **lowering**.

### Tokenization

Tokenization runs on full topology nGraph function inside a specific plugin in a stage of common transformations. Input of tokenization is a topology graph. Output is a modified topology graph with `ngraph::snippets::op::Subgraph` operations installed. Each subgraph contains nGraph function (called **body**) which holds a part of original topology legal for snippet generation (can be scheduled with a single schedule).
Tokenization runs on full topology nGraph function inside a specific plugin in a stage of common transformations. Input of tokenization is a topology graph. Output is a modified topology graph with `ov::snippets::op::Subgraph` operations installed. Each subgraph contains nGraph function (called **body**) which holds a part of original topology legal for snippet generation (can be scheduled with a single schedule).

A procedure of finding subgraphs suitable for code generation is called **tokenization**. During tokenization the topology tree is split into subgraphs in the same greedy approach which is used for parsing input stream of characters into the tokens. It may also be seen as and modified into a basic block construction problem, since there is a leader and potentially terminators. See the example of implementation [here](https://github.com/openvinotoolkit/openvino/blob/master/src/common/snippets/src/pass/collapse_subgraph.cpp).

Expand Down Expand Up @@ -94,7 +94,7 @@ The goal of this step is to apply target-independent and schedule-related optimi

All input and output shapes are normalized to 6D for future schedule generation. If shape propagation fails or leads to inconsistent output shapes an exception is raised.

The layout assigned by a user code and passed to a `generate` function is propagated through a subgraph on this step as well. The layout is passed to a `generate` function as a `BlockedShapeVector` which is a `std::vector<BlockedShape>` , while `BlockedShape` is `std::tuple<ngraph::Shape, ngraph::AxisVector, ngraph::element::Type>`. For example, if backend supports `NCHW16c` layout and a tensor has a size of `<1, 42, 17, 31>` and holds single precision floating point, this structure should be `std::make_tuple(ngraph::Shape {1, 3, 17, 31, 16}, ngraph::AxisVector {0, 1, 2, 3, 1}, ngraph::element::f32);`. This allows generic layout representation.
The layout assigned by a user code and passed to a `generate` function is propagated through a subgraph on this step as well. The layout is passed to a `generate` function as a `BlockedShapeVector` which is a `std::vector<BlockedShape>` , while `BlockedShape` is `std::tuple<ov::Shape, ov::AxisVector, ov::element::Type>`. For example, if backend supports `NCHW16c` layout and a tensor has a size of `<1, 42, 17, 31>` and holds single precision floating point, this structure should be `std::make_tuple(ov::Shape {1, 3, 17, 31, 16}, ov::AxisVector {0, 1, 2, 3, 1}, ov::element::f32);`. This allows generic layout representation.

##### Dialect conversion

Expand Down Expand Up @@ -191,17 +191,17 @@ Broadcast and regular streaming vector load is possible from the same pointer. B

#### Target-specific optimizations

Target developers can plug in to the code generation pipeline some specific optimizations with passing `ngraph::pass::Manager` into `generate` function of `subgraph`. **Passes are executed on subgraph in canonical form converted to a snippet dialect**.
Target developers can plug in to the code generation pipeline some specific optimizations with passing `ov::pass::Manager` into `generate` function of `subgraph`. **Passes are executed on subgraph in canonical form converted to a snippet dialect**.

*It might be also extended to provide an interface for target independent optimizations in future*

#### Register allocation

Canonicalized subgraph in a snippets dialect forms a basic block or region inside a snippet (kernel). Registers are allocated globally for the whole subgraph. Since all operations for a subgraph are assumed to be vector, only vector registers are allocated for the first generation of SnippetS. Linear scan register allocation algorithm is used. Register allocator is implemented as the `ngraph::snippets::pass::AssignRegisters` function pass and store allocated registers for each node into `rt_info`. `rt_info` for a node holds a register for Node's output. *However, this part should be refactored better, either to become target independent or to use target-specific abstraction to acquire a new register*
Canonicalized subgraph in a snippets dialect forms a basic block or region inside a snippet (kernel). Registers are allocated globally for the whole subgraph. Since all operations for a subgraph are assumed to be vector, only vector registers are allocated for the first generation of SnippetS. Linear scan register allocation algorithm is used. Register allocator is implemented as the `ov::snippets::pass::AssignRegisters` function pass and store allocated registers for each node into `rt_info`. `rt_info` for a node holds a register for Node's output. *However, this part should be refactored better, either to become target independent or to use target-specific abstraction to acquire a new register*

#### Schedule generation

The goal of this step is to transform subgraphs in a scalar notation into kernel functions callable from user code. The `Kernel` and `Tile` operations are introduced for this purpose. Each of these operations has a constructor from code region described as a collection of operation and operand pairs `Kernel(const std::vector<std::pair<std::shared_ptr<ngraph::snippets::Emitter>, ngraph::snippets::RegInfo>>& region);`.
The goal of this step is to transform subgraphs in a scalar notation into kernel functions callable from user code. The `Kernel` and `Tile` operations are introduced for this purpose. Each of these operations has a constructor from code region described as a collection of operation and operand pairs `Kernel(const std::vector<std::pair<std::shared_ptr<ov::snippets::Emitter>, ov::snippets::RegInfo>>& region);`.

The example above can be used for the following hierarchical IR. If the scope to layout oblivious operations with broadcasting support is limited, `Tile` could be generated as a single loop over the most warning dimension. The second `Tile` is generated to handle tails and can be omitted if not needed. A special pass replaces memory operations on vector with scalar versions for tail subgraph.

Expand Down Expand Up @@ -253,7 +253,7 @@ Where
A target code emission is table based. A target is responsible for filling `jitters` table field in `Generator` class.

```
std::map<const ngraph::DiscreteTypeInfo, std::function<std::shared_ptr<Emitter>(std::shared_ptr<ngraph::Node>)>> jitters;
std::map<const ov::DiscreteTypeInfo, std::function<std::shared_ptr<Emitter>(std::shared_ptr<ov::Node>)>> jitters;
```

##### Interface with a target
Expand All @@ -279,7 +279,7 @@ Once a schedule is generated, a target code is emitted from a kernel in `Generat

A target can potentially extend the snippets dialect with a target-specific operation for code emission. It should implement:

* nGraph operation (for example, `class FMA : public ngraph::op::Op`)
* nGraph operation (for example, `class FMA : public ov::op::Op`)
* Emitter for the operation (for example, `class FmaEmitter : public Emitter` )
* register the pair in `jitters` map

Expand Down
19 changes: 9 additions & 10 deletions src/common/snippets/include/snippets/emitter.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -6,9 +6,10 @@

#include <vector>
#include <cstdint>
#include "ngraph/node.hpp"

namespace ngraph {
#include "openvino/core/node.hpp"

namespace ov {
namespace snippets {

using code = const uint8_t *;
Expand All @@ -24,11 +25,9 @@ class Emitter {
/**
* @brief Default constructor
*/
Emitter(const std::shared_ptr<ngraph::Node>& n) {
}
Emitter(const std::shared_ptr<ov::Node>& n) {}

Emitter(std::vector<std::pair<std::shared_ptr<Emitter>, RegInfo>>& region) {
}
Emitter(std::vector<std::pair<std::shared_ptr<Emitter>, RegInfo>>& region) {}

/**
* @brief called by generator to generate code to produce target code for a specific operation
Expand All @@ -47,12 +46,12 @@ class Emitter {
* @brief called by generator to generate data section, if needed for a specific operation
* @return void
*/
virtual void emit_data() const {
}
virtual void emit_data() const {}

virtual ~Emitter() = default;
};

using AllocatedEmitter = std::pair<std::shared_ptr<Emitter>, ngraph::snippets::RegInfo>;
using AllocatedEmitter = std::pair<std::shared_ptr<Emitter>, ov::snippets::RegInfo>;

} // namespace snippets
} // namespace ngraph
} // namespace ov
94 changes: 12 additions & 82 deletions src/common/snippets/include/snippets/generator.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -9,73 +9,12 @@
#pragma once

#include "snippets_isa.hpp"
#include "emitter.hpp"

namespace ngraph {
namespace snippets {

auto getRegisters(std::shared_ptr<ngraph::Node>& n) -> ngraph::snippets::RegInfo;

typedef std::pair<std::function<std::shared_ptr<Emitter>(const std::shared_ptr<ngraph::Node>&)>,
std::function<std::set<std::vector<element::Type>>(const std::shared_ptr<ngraph::Node>&)>> jitters_value;
/**
* @interface TargetMachine
* @brief Base class Target machine representation. Target derives from this class to provide generator information about supported emitters
* @ingroup snippets
*/
class TargetMachine {
public:
/**
* @brief checks if target is natively supported
* @return true, if supported
*/
virtual bool is_supported() const = 0;

/**
* @brief finalizes code generation
* @return generated kernel binary
*/
virtual code get_snippet() const = 0;

/**
* @brief gets number of lanes supported by target's vector ISA
* @return number of lanes
*/
virtual size_t get_lanes() const = 0;

/**
* @brief called by generator to all the emitter for a target machine
* @return a map by node's type info with callbacks to create an instance of emitter for corresponding operation type
*/
std::function<std::shared_ptr<Emitter>(std::shared_ptr<ngraph::Node>)> get(const ngraph::DiscreteTypeInfo type) const {
auto jitter = jitters.find(type);
if (jitter == jitters.end()) {
OPENVINO_THROW(std::string("Target code emitter is not available for ") + type.name + " operation.");
}
return jitter->second.first;
}

std::function<std::set<std::vector<element::Type>>(const std::shared_ptr<ngraph::Node>&)>
get_supported_precisions(const ngraph::DiscreteTypeInfo type) const {
auto jitter = jitters.find(type);
if (jitter == jitters.end()) {
OPENVINO_THROW(std::string("Target code emitter is not available for ") + type.name + " operation.");
}
return jitter->second.second;
}
#include "snippets/lowered/linear_ir.hpp"
#include "snippets/lowered/pass/pass.hpp"

/**
* @brief checks if emitter for a specific operation is supported
* @return true, if supported
*/
bool has(const ngraph::DiscreteTypeInfo type) const {
return jitters.find(type) != jitters.end();
}
virtual ~TargetMachine() = default;

protected:
std::map<const ngraph::DiscreteTypeInfo, jitters_value> jitters;
};
namespace ov {
namespace snippets {

/**
* @interface Schedule
Expand Down Expand Up @@ -117,7 +56,7 @@ class Generator {
/**
* @brief Default constructor
*/
Generator(const std::shared_ptr<TargetMachine>& t) : target(t) {}
Generator(const std::shared_ptr<TargetMachine>& t) : target(t), lowered_saved{} {}
/**
* @brief Default destructor
*/
Expand All @@ -126,27 +65,18 @@ class Generator {
* @interface GeneratorConfig
* @brief Allows to tweak the lowering process.
*/
class GeneratorConfig {
public:
// True if the lowered Emitters need to be accessed during runtime. Normally they're destroyed after code emission.
bool m_save_lowered_code = false;
// True if we can optimize tails for single evaluation during code generation
// More details with optimization examples you can see in generate() method
// For example, tails with Buffer ops doesn't support single evaluation optimizations
// because of that we should always reset memory pointer using finalization offsets
// after data storing to Buffer
bool m_optimize_single_evaluation = true;
// True if we should check runtime info for nodes to call specific needed transformations
bool m_need_fill_tail_register = false;
};
/**
* @brief virtual method any specific implementation should implement
* @param m model in canonical for for table-based code generation
* @param config config with transformation and optimization parameters
* @param compile_params parameters for generated code
* @return pointer to generated code
*/
code generate(std::shared_ptr<ov::Model>& m, const GeneratorConfig& config, const void* compile_params = nullptr);
struct LoweringResult {
LoweringResult(code c) : binary_code(c) {}
code binary_code = nullptr;
};
LoweringResult generate(lowered::LinearIR& linear_ir, const lowered::Config& config, const void* compile_params = nullptr);

/**
* @brief gets target machine
Expand Down Expand Up @@ -180,8 +110,8 @@ class Generator {
std::shared_ptr<TargetMachine> target;
// todo: we need to save lowered code to access compiled brgemm kernels on execution time (normally lowered is destructed by then).
// This is temporary solution, remove this when kernel caching is implemented. Don't forget to make generate const method.
std::vector<AllocatedEmitter> lowered_saved;
lowered::LinearIR lowered_saved;
};

} // namespace snippets
} // namespace ngraph
} // namespace ov
6 changes: 3 additions & 3 deletions src/common/snippets/include/snippets/itt.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -11,15 +11,15 @@

#include <openvino/cc/ngraph/itt.hpp>

namespace ngraph {
namespace ov {
namespace pass {
namespace itt {
namespace domains {
OV_ITT_DOMAIN(SnippetsTransform);
OV_ITT_DOMAIN(SnippetsTransform);
} // namespace domains
} // namespace itt
} // namespace pass
} // namespace ngraph
} // namespace ov

OV_CC_DOMAINS(internal_op);

Expand Down
99 changes: 99 additions & 0 deletions src/common/snippets/include/snippets/lowered/expression.hpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,99 @@
// Copyright (C) 2023 Intel Corporation
// SPDX-License-Identifier: Apache-2.0
//

#pragma once

#include <openvino/core/node.hpp>
#include <openvino/opsets/opset1.hpp>

#include "snippets/emitter.hpp"
#include "snippets/target_machine.hpp"
#include "snippets/lowered/port_connector.hpp"
#include "snippets/lowered/expression_port.hpp"


namespace ov {
namespace snippets {
namespace lowered {

class LinearIR;

class Expression : public std::enable_shared_from_this<Expression> {
friend class LinearIR;
friend class ExpressionPort;

public:
static size_t LOOP_NULL_ID;

Expression() = default;
virtual ~Expression() = default;

std::shared_ptr<Node> get_node() const;
std::shared_ptr<Emitter> get_emitter() const;

RegInfo get_reg_info() const;
void set_reg_info(RegInfo rinfo);

const PortConnectorPtr& get_input_port_connector(size_t i) const;
const PortConnectorPtr& get_output_port_connector(size_t i) const;
std::vector<PortConnectorPtr> get_input_port_connectors() const { return m_input_port_connectors; }
std::vector<PortConnectorPtr> get_output_port_connectors() const { return m_output_port_connectors; }

const PortDescriptorPtr& get_input_port_descriptor(size_t i) const;
const PortDescriptorPtr& get_output_port_descriptor(size_t i) const;
std::vector<PortDescriptorPtr> get_input_port_descriptors() const { return m_input_port_descriptors; }
std::vector<PortDescriptorPtr> get_output_port_descriptors() const { return m_output_port_descriptors; }

size_t get_input_count() const { return m_input_port_connectors.size(); }
size_t get_output_count() const { return m_output_port_connectors.size(); }

std::vector<size_t> get_loop_ids() const { return m_loop_ids; }
void set_loop_ids(const std::vector<size_t>& loops) { m_loop_ids = loops; }
void set_loop_id(size_t id, size_t idx);
void remove_loop_id(size_t id);

void validate() const;
void init_emitter(const std::shared_ptr<const TargetMachine>& target);

ExpressionPort get_input_port(size_t i);
ExpressionPort get_output_port(size_t i);

protected:
// Note: The constructor initialization is private since an expression can be created only by Linear IR.
// The method must be used only by Linear IR builder of expressions!
explicit Expression(const std::shared_ptr<Node>& n);

void replace_input(size_t port, PortConnectorPtr to);

std::shared_ptr<Node> m_source_node{nullptr};
std::shared_ptr<Emitter> m_emitter{nullptr};
std::vector<PortConnectorPtr> m_input_port_connectors{};
std::vector<PortConnectorPtr> m_output_port_connectors{};
std::vector<PortDescriptorPtr> m_input_port_descriptors{};
std::vector<PortDescriptorPtr> m_output_port_descriptors{};
// The order Loops identifies: Outer ---> Inner
std::vector<size_t> m_loop_ids;
};
using ExpressionPtr = std::shared_ptr<Expression>;

class IOExpression : public Expression {
friend class LinearIR;

public:
enum class io_type {INPUT, OUTPUT, UNDEFINED};

int64_t get_index() const { return m_index; }
io_type get_type() const { return m_type; }

private:
explicit IOExpression(const std::shared_ptr<ov::opset1::Parameter>& n, int64_t index);
explicit IOExpression(const std::shared_ptr<ov::opset1::Result>& n, int64_t index);

int64_t m_index = -1;
io_type m_type = io_type::UNDEFINED;
};

} // namespace lowered
} // namespace snippets
} // namespace ov
Loading