Skip to content

Commit

Permalink
[Snippets] Moved infrastructure to Linear Intermediate Representation (
Browse files Browse the repository at this point in the history
  • Loading branch information
a-sidorova authored May 19, 2023
1 parent 41de4ba commit 9fafcab
Show file tree
Hide file tree
Showing 247 changed files with 7,712 additions and 5,923 deletions.
16 changes: 8 additions & 8 deletions src/common/snippets/docs/snippets_design_guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ This document describes the design and rationale for a snippets code generator.

Core **CNN operators (convolution, gemm, fully connected) are limited by compute, the rest is memory bound**. Math approximations (like transcendental functions) are rare in emerging workloads and could be treated with the same machinery. **Snippets are designed to optimize topology for memory**, while leaving compute intensive kernels for backend developers.

The **potential speedup is proportional to shrink in memory-walked bytes**. Therefore, you can transform the problem to a task to optimize for memory walks, whatever pattern snippet has and operations it contains. The number of memory walks should be less or equal to handcrafted optimizations. This guarantees performance improvements over the previous approach (excluding corner cases caused by cache effects). *Shrinkage factor might be encoded to some cost function in future evolution of code generator*. Snippets generator provides diagnostics to estimate this shrinkage factor with `ngraph::snippets::op::Subgraph::print_statistics(bool verbose)` member.
The **potential speedup is proportional to shrink in memory-walked bytes**. Therefore, you can transform the problem to a task to optimize for memory walks, whatever pattern snippet has and operations it contains. The number of memory walks should be less or equal to handcrafted optimizations. This guarantees performance improvements over the previous approach (excluding corner cases caused by cache effects). *Shrinkage factor might be encoded to some cost function in future evolution of code generator*. Snippets generator provides diagnostics to estimate this shrinkage factor with `ov::snippets::op::Subgraph::print_statistics(bool verbose)` member.

The SnippetS generator is designed for back-end developers. The main purpose of inventing the snippets code generator is an **operator fusion**, **register allocation** and **target kernel generation** decomposition. This allows modifications (like new fusion support) and feature extensions (like new operation support) to be done in a single point of modification and avoid combinatorial explosion for fusions/types/architectures etc.

Expand All @@ -28,7 +28,7 @@ Code generation is split into 2 phases, **tokenization** and **lowering**.

### Tokenization

Tokenization runs on full topology nGraph function inside a specific plugin in a stage of common transformations. Input of tokenization is a topology graph. Output is a modified topology graph with `ngraph::snippets::op::Subgraph` operations installed. Each subgraph contains nGraph function (called **body**) which holds a part of original topology legal for snippet generation (can be scheduled with a single schedule).
Tokenization runs on full topology nGraph function inside a specific plugin in a stage of common transformations. Input of tokenization is a topology graph. Output is a modified topology graph with `ov::snippets::op::Subgraph` operations installed. Each subgraph contains nGraph function (called **body**) which holds a part of original topology legal for snippet generation (can be scheduled with a single schedule).

A procedure of finding subgraphs suitable for code generation is called **tokenization**. During tokenization the topology tree is split into subgraphs in the same greedy approach which is used for parsing input stream of characters into the tokens. It may also be seen as and modified into a basic block construction problem, since there is a leader and potentially terminators. See the example of implementation [here](https://github.com/openvinotoolkit/openvino/blob/master/src/common/snippets/src/pass/collapse_subgraph.cpp).

Expand Down Expand Up @@ -94,7 +94,7 @@ The goal of this step is to apply target-independent and schedule-related optimi

All input and output shapes are normalized to 6D for future schedule generation. If shape propagation fails or leads to inconsistent output shapes an exception is raised.

The layout assigned by a user code and passed to a `generate` function is propagated through a subgraph on this step as well. The layout is passed to a `generate` function as a `BlockedShapeVector` which is a `std::vector<BlockedShape>` , while `BlockedShape` is `std::tuple<ngraph::Shape, ngraph::AxisVector, ngraph::element::Type>`. For example, if backend supports `NCHW16c` layout and a tensor has a size of `<1, 42, 17, 31>` and holds single precision floating point, this structure should be `std::make_tuple(ngraph::Shape {1, 3, 17, 31, 16}, ngraph::AxisVector {0, 1, 2, 3, 1}, ngraph::element::f32);`. This allows generic layout representation.
The layout assigned by a user code and passed to a `generate` function is propagated through a subgraph on this step as well. The layout is passed to a `generate` function as a `BlockedShapeVector` which is a `std::vector<BlockedShape>` , while `BlockedShape` is `std::tuple<ov::Shape, ov::AxisVector, ov::element::Type>`. For example, if backend supports `NCHW16c` layout and a tensor has a size of `<1, 42, 17, 31>` and holds single precision floating point, this structure should be `std::make_tuple(ov::Shape {1, 3, 17, 31, 16}, ov::AxisVector {0, 1, 2, 3, 1}, ov::element::f32);`. This allows generic layout representation.

##### Dialect conversion

Expand Down Expand Up @@ -191,17 +191,17 @@ Broadcast and regular streaming vector load is possible from the same pointer. B

#### Target-specific optimizations

Target developers can plug in to the code generation pipeline some specific optimizations with passing `ngraph::pass::Manager` into `generate` function of `subgraph`. **Passes are executed on subgraph in canonical form converted to a snippet dialect**.
Target developers can plug in to the code generation pipeline some specific optimizations with passing `ov::pass::Manager` into `generate` function of `subgraph`. **Passes are executed on subgraph in canonical form converted to a snippet dialect**.

*It might be also extended to provide an interface for target independent optimizations in future*

#### Register allocation

Canonicalized subgraph in a snippets dialect forms a basic block or region inside a snippet (kernel). Registers are allocated globally for the whole subgraph. Since all operations for a subgraph are assumed to be vector, only vector registers are allocated for the first generation of SnippetS. Linear scan register allocation algorithm is used. Register allocator is implemented as the `ngraph::snippets::pass::AssignRegisters` function pass and store allocated registers for each node into `rt_info`. `rt_info` for a node holds a register for Node's output. *However, this part should be refactored better, either to become target independent or to use target-specific abstraction to acquire a new register*
Canonicalized subgraph in a snippets dialect forms a basic block or region inside a snippet (kernel). Registers are allocated globally for the whole subgraph. Since all operations for a subgraph are assumed to be vector, only vector registers are allocated for the first generation of SnippetS. Linear scan register allocation algorithm is used. Register allocator is implemented as the `ov::snippets::pass::AssignRegisters` function pass and store allocated registers for each node into `rt_info`. `rt_info` for a node holds a register for Node's output. *However, this part should be refactored better, either to become target independent or to use target-specific abstraction to acquire a new register*

#### Schedule generation

The goal of this step is to transform subgraphs in a scalar notation into kernel functions callable from user code. The `Kernel` and `Tile` operations are introduced for this purpose. Each of these operations has a constructor from code region described as a collection of operation and operand pairs `Kernel(const std::vector<std::pair<std::shared_ptr<ngraph::snippets::Emitter>, ngraph::snippets::RegInfo>>& region);`.
The goal of this step is to transform subgraphs in a scalar notation into kernel functions callable from user code. The `Kernel` and `Tile` operations are introduced for this purpose. Each of these operations has a constructor from code region described as a collection of operation and operand pairs `Kernel(const std::vector<std::pair<std::shared_ptr<ov::snippets::Emitter>, ov::snippets::RegInfo>>& region);`.

The example above can be used for the following hierarchical IR. If the scope to layout oblivious operations with broadcasting support is limited, `Tile` could be generated as a single loop over the most warning dimension. The second `Tile` is generated to handle tails and can be omitted if not needed. A special pass replaces memory operations on vector with scalar versions for tail subgraph.

Expand Down Expand Up @@ -253,7 +253,7 @@ Where
A target code emission is table based. A target is responsible for filling `jitters` table field in `Generator` class.

```
std::map<const ngraph::DiscreteTypeInfo, std::function<std::shared_ptr<Emitter>(std::shared_ptr<ngraph::Node>)>> jitters;
std::map<const ov::DiscreteTypeInfo, std::function<std::shared_ptr<Emitter>(std::shared_ptr<ov::Node>)>> jitters;
```

##### Interface with a target
Expand All @@ -279,7 +279,7 @@ Once a schedule is generated, a target code is emitted from a kernel in `Generat

A target can potentially extend the snippets dialect with a target-specific operation for code emission. It should implement:

* nGraph operation (for example, `class FMA : public ngraph::op::Op`)
* nGraph operation (for example, `class FMA : public ov::op::Op`)
* Emitter for the operation (for example, `class FmaEmitter : public Emitter` )
* register the pair in `jitters` map

Expand Down
19 changes: 9 additions & 10 deletions src/common/snippets/include/snippets/emitter.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -6,9 +6,10 @@

#include <vector>
#include <cstdint>
#include "ngraph/node.hpp"

namespace ngraph {
#include "openvino/core/node.hpp"

namespace ov {
namespace snippets {

using code = const uint8_t *;
Expand All @@ -24,11 +25,9 @@ class Emitter {
/**
* @brief Default constructor
*/
Emitter(const std::shared_ptr<ngraph::Node>& n) {
}
Emitter(const std::shared_ptr<ov::Node>& n) {}

Emitter(std::vector<std::pair<std::shared_ptr<Emitter>, RegInfo>>& region) {
}
Emitter(std::vector<std::pair<std::shared_ptr<Emitter>, RegInfo>>& region) {}

/**
* @brief called by generator to generate code to produce target code for a specific operation
Expand All @@ -47,12 +46,12 @@ class Emitter {
* @brief called by generator to generate data section, if needed for a specific operation
* @return void
*/
virtual void emit_data() const {
}
virtual void emit_data() const {}

virtual ~Emitter() = default;
};

using AllocatedEmitter = std::pair<std::shared_ptr<Emitter>, ngraph::snippets::RegInfo>;
using AllocatedEmitter = std::pair<std::shared_ptr<Emitter>, ov::snippets::RegInfo>;

} // namespace snippets
} // namespace ngraph
} // namespace ov
94 changes: 12 additions & 82 deletions src/common/snippets/include/snippets/generator.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -9,73 +9,12 @@
#pragma once

#include "snippets_isa.hpp"
#include "emitter.hpp"

namespace ngraph {
namespace snippets {

auto getRegisters(std::shared_ptr<ngraph::Node>& n) -> ngraph::snippets::RegInfo;

typedef std::pair<std::function<std::shared_ptr<Emitter>(const std::shared_ptr<ngraph::Node>&)>,
std::function<std::set<std::vector<element::Type>>(const std::shared_ptr<ngraph::Node>&)>> jitters_value;
/**
* @interface TargetMachine
* @brief Base class Target machine representation. Target derives from this class to provide generator information about supported emitters
* @ingroup snippets
*/
class TargetMachine {
public:
/**
* @brief checks if target is natively supported
* @return true, if supported
*/
virtual bool is_supported() const = 0;

/**
* @brief finalizes code generation
* @return generated kernel binary
*/
virtual code get_snippet() const = 0;

/**
* @brief gets number of lanes supported by target's vector ISA
* @return number of lanes
*/
virtual size_t get_lanes() const = 0;

/**
* @brief called by generator to all the emitter for a target machine
* @return a map by node's type info with callbacks to create an instance of emitter for corresponding operation type
*/
std::function<std::shared_ptr<Emitter>(std::shared_ptr<ngraph::Node>)> get(const ngraph::DiscreteTypeInfo type) const {
auto jitter = jitters.find(type);
if (jitter == jitters.end()) {
OPENVINO_THROW(std::string("Target code emitter is not available for ") + type.name + " operation.");
}
return jitter->second.first;
}

std::function<std::set<std::vector<element::Type>>(const std::shared_ptr<ngraph::Node>&)>
get_supported_precisions(const ngraph::DiscreteTypeInfo type) const {
auto jitter = jitters.find(type);
if (jitter == jitters.end()) {
OPENVINO_THROW(std::string("Target code emitter is not available for ") + type.name + " operation.");
}
return jitter->second.second;
}
#include "snippets/lowered/linear_ir.hpp"
#include "snippets/lowered/pass/pass.hpp"

/**
* @brief checks if emitter for a specific operation is supported
* @return true, if supported
*/
bool has(const ngraph::DiscreteTypeInfo type) const {
return jitters.find(type) != jitters.end();
}
virtual ~TargetMachine() = default;

protected:
std::map<const ngraph::DiscreteTypeInfo, jitters_value> jitters;
};
namespace ov {
namespace snippets {

/**
* @interface Schedule
Expand Down Expand Up @@ -117,7 +56,7 @@ class Generator {
/**
* @brief Default constructor
*/
Generator(const std::shared_ptr<TargetMachine>& t) : target(t) {}
Generator(const std::shared_ptr<TargetMachine>& t) : target(t), lowered_saved{} {}
/**
* @brief Default destructor
*/
Expand All @@ -126,27 +65,18 @@ class Generator {
* @interface GeneratorConfig
* @brief Allows to tweak the lowering process.
*/
class GeneratorConfig {
public:
// True if the lowered Emitters need to be accessed during runtime. Normally they're destroyed after code emission.
bool m_save_lowered_code = false;
// True if we can optimize tails for single evaluation during code generation
// More details with optimization examples you can see in generate() method
// For example, tails with Buffer ops doesn't support single evaluation optimizations
// because of that we should always reset memory pointer using finalization offsets
// after data storing to Buffer
bool m_optimize_single_evaluation = true;
// True if we should check runtime info for nodes to call specific needed transformations
bool m_need_fill_tail_register = false;
};
/**
* @brief virtual method any specific implementation should implement
* @param m model in canonical for for table-based code generation
* @param config config with transformation and optimization parameters
* @param compile_params parameters for generated code
* @return pointer to generated code
*/
code generate(std::shared_ptr<ov::Model>& m, const GeneratorConfig& config, const void* compile_params = nullptr);
struct LoweringResult {
LoweringResult(code c) : binary_code(c) {}
code binary_code = nullptr;
};
LoweringResult generate(lowered::LinearIR& linear_ir, const lowered::Config& config, const void* compile_params = nullptr);

/**
* @brief gets target machine
Expand Down Expand Up @@ -180,8 +110,8 @@ class Generator {
std::shared_ptr<TargetMachine> target;
// todo: we need to save lowered code to access compiled brgemm kernels on execution time (normally lowered is destructed by then).
// This is temporary solution, remove this when kernel caching is implemented. Don't forget to make generate const method.
std::vector<AllocatedEmitter> lowered_saved;
lowered::LinearIR lowered_saved;
};

} // namespace snippets
} // namespace ngraph
} // namespace ov
6 changes: 3 additions & 3 deletions src/common/snippets/include/snippets/itt.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -11,15 +11,15 @@

#include <openvino/cc/ngraph/itt.hpp>

namespace ngraph {
namespace ov {
namespace pass {
namespace itt {
namespace domains {
OV_ITT_DOMAIN(SnippetsTransform);
OV_ITT_DOMAIN(SnippetsTransform);
} // namespace domains
} // namespace itt
} // namespace pass
} // namespace ngraph
} // namespace ov

OV_CC_DOMAINS(internal_op);

Expand Down
99 changes: 99 additions & 0 deletions src/common/snippets/include/snippets/lowered/expression.hpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,99 @@
// Copyright (C) 2023 Intel Corporation
// SPDX-License-Identifier: Apache-2.0
//

#pragma once

#include <openvino/core/node.hpp>
#include <openvino/opsets/opset1.hpp>

#include "snippets/emitter.hpp"
#include "snippets/target_machine.hpp"
#include "snippets/lowered/port_connector.hpp"
#include "snippets/lowered/expression_port.hpp"


namespace ov {
namespace snippets {
namespace lowered {

class LinearIR;

class Expression : public std::enable_shared_from_this<Expression> {
friend class LinearIR;
friend class ExpressionPort;

public:
static size_t LOOP_NULL_ID;

Expression() = default;
virtual ~Expression() = default;

std::shared_ptr<Node> get_node() const;
std::shared_ptr<Emitter> get_emitter() const;

RegInfo get_reg_info() const;
void set_reg_info(RegInfo rinfo);

const PortConnectorPtr& get_input_port_connector(size_t i) const;
const PortConnectorPtr& get_output_port_connector(size_t i) const;
std::vector<PortConnectorPtr> get_input_port_connectors() const { return m_input_port_connectors; }
std::vector<PortConnectorPtr> get_output_port_connectors() const { return m_output_port_connectors; }

const PortDescriptorPtr& get_input_port_descriptor(size_t i) const;
const PortDescriptorPtr& get_output_port_descriptor(size_t i) const;
std::vector<PortDescriptorPtr> get_input_port_descriptors() const { return m_input_port_descriptors; }
std::vector<PortDescriptorPtr> get_output_port_descriptors() const { return m_output_port_descriptors; }

size_t get_input_count() const { return m_input_port_connectors.size(); }
size_t get_output_count() const { return m_output_port_connectors.size(); }

std::vector<size_t> get_loop_ids() const { return m_loop_ids; }
void set_loop_ids(const std::vector<size_t>& loops) { m_loop_ids = loops; }
void set_loop_id(size_t id, size_t idx);
void remove_loop_id(size_t id);

void validate() const;
void init_emitter(const std::shared_ptr<const TargetMachine>& target);

ExpressionPort get_input_port(size_t i);
ExpressionPort get_output_port(size_t i);

protected:
// Note: The constructor initialization is private since an expression can be created only by Linear IR.
// The method must be used only by Linear IR builder of expressions!
explicit Expression(const std::shared_ptr<Node>& n);

void replace_input(size_t port, PortConnectorPtr to);

std::shared_ptr<Node> m_source_node{nullptr};
std::shared_ptr<Emitter> m_emitter{nullptr};
std::vector<PortConnectorPtr> m_input_port_connectors{};
std::vector<PortConnectorPtr> m_output_port_connectors{};
std::vector<PortDescriptorPtr> m_input_port_descriptors{};
std::vector<PortDescriptorPtr> m_output_port_descriptors{};
// The order Loops identifies: Outer ---> Inner
std::vector<size_t> m_loop_ids;
};
using ExpressionPtr = std::shared_ptr<Expression>;

class IOExpression : public Expression {
friend class LinearIR;

public:
enum class io_type {INPUT, OUTPUT, UNDEFINED};

int64_t get_index() const { return m_index; }
io_type get_type() const { return m_type; }

private:
explicit IOExpression(const std::shared_ptr<ov::opset1::Parameter>& n, int64_t index);
explicit IOExpression(const std::shared_ptr<ov::opset1::Result>& n, int64_t index);

int64_t m_index = -1;
io_type m_type = io_type::UNDEFINED;
};

} // namespace lowered
} // namespace snippets
} // namespace ov
Loading

0 comments on commit 9fafcab

Please sign in to comment.