From 96595eb516b1ec9008b9d626b0b8d50bae76bc09 Mon Sep 17 00:00:00 2001
From: "dependabot[bot]" <49699333+dependabot[bot]@users.noreply.github.com>
Date: Mon, 21 Oct 2024 07:14:37 +0000
Subject: [PATCH 01/24] Bump actions/upload-artifact from 4.4.0 to 4.4.3
 (#27151)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Bumps
[actions/upload-artifact](https://github.com/actions/upload-artifact)
from 4.4.0 to 4.4.3.
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a
href="https://github.com/actions/upload-artifact/releases">actions/upload-artifact's
releases</a>.</em></p>
<blockquote>
<h2>v4.4.3</h2>
<h2>What's Changed</h2>
<ul>
<li>Undo indirect dependency updates from <a
href="https://redirect.github.com/actions/upload-artifact/issues/627">#627</a>
by <a href="https://github.com/joshmgross"><code>@​joshmgross</code></a>
in <a
href="https://redirect.github.com/actions/upload-artifact/pull/632">actions/upload-artifact#632</a></li>
</ul>
<p><strong>Full Changelog</strong>: <a
href="https://github.com/actions/upload-artifact/compare/v4.4.2...v4.4.3">https://github.com/actions/upload-artifact/compare/v4.4.2...v4.4.3</a></p>
<h2>v4.4.2</h2>
<h2>What's Changed</h2>
<ul>
<li>Bump <code>@actions/artifact</code> to 2.1.11 by <a
href="https://github.com/robherley"><code>@​robherley</code></a> in <a
href="https://redirect.github.com/actions/upload-artifact/pull/627">actions/upload-artifact#627</a>
<ul>
<li>Includes fix for relative symlinks not resolving properly</li>
</ul>
</li>
</ul>
<p><strong>Full Changelog</strong>: <a
href="https://github.com/actions/upload-artifact/compare/v4.4.1...v4.4.2">https://github.com/actions/upload-artifact/compare/v4.4.1...v4.4.2</a></p>
<h2>v4.4.1</h2>
<h2>What's Changed</h2>
<ul>
<li>Add a section about hidden files by <a
href="https://github.com/joshmgross"><code>@​joshmgross</code></a> in <a
href="https://redirect.github.com/actions/upload-artifact/pull/607">actions/upload-artifact#607</a></li>
<li>Add workflow file for publishing releases to immutable action
package by <a
href="https://github.com/Jcambass"><code>@​Jcambass</code></a> in <a
href="https://redirect.github.com/actions/upload-artifact/pull/621">actions/upload-artifact#621</a></li>
<li>Update <code>@​actions/artifact</code> to latest version, includes
symlink and timeout fixes by <a
href="https://github.com/robherley"><code>@​robherley</code></a> in <a
href="https://redirect.github.com/actions/upload-artifact/pull/625">actions/upload-artifact#625</a></li>
</ul>
<h2>New Contributors</h2>
<ul>
<li><a href="https://github.com/Jcambass"><code>@​Jcambass</code></a>
made their first contribution in <a
href="https://redirect.github.com/actions/upload-artifact/pull/621">actions/upload-artifact#621</a></li>
</ul>
<p><strong>Full Changelog</strong>: <a
href="https://github.com/actions/upload-artifact/compare/v4.4.0...v4.4.1">https://github.com/actions/upload-artifact/compare/v4.4.0...v4.4.1</a></p>
</blockquote>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="https://github.com/actions/upload-artifact/commit/b4b15b8c7c6ac21ea08fcf65892d2ee8f75cf882"><code>b4b15b8</code></a>
Merge pull request <a
href="https://redirect.github.com/actions/upload-artifact/issues/632">#632</a>
from actions/joshmgross/undo-dependency-changes</li>
<li><a
href="https://github.com/actions/upload-artifact/commit/92b01ebffaf2e2520c64ab2845d3f9bd5c06941a"><code>92b01eb</code></a>
Undo indirect dependency updates from <a
href="https://redirect.github.com/actions/upload-artifact/issues/627">#627</a></li>
<li><a
href="https://github.com/actions/upload-artifact/commit/84480863f228bb9747b473957fcc9e309aa96097"><code>8448086</code></a>
Merge pull request <a
href="https://redirect.github.com/actions/upload-artifact/issues/627">#627</a>
from actions/robherley/v4.4.2</li>
<li><a
href="https://github.com/actions/upload-artifact/commit/b1d4642b699cfe7e338a864cc36849b29ad04a75"><code>b1d4642</code></a>
add explicit relative and absolute symlinks to workflow</li>
<li><a
href="https://github.com/actions/upload-artifact/commit/d50e66084c4d29dc5d3326b7a0e67bed9ef4bb1e"><code>d50e660</code></a>
bump version</li>
<li><a
href="https://github.com/actions/upload-artifact/commit/aabe6f8050b860cae7a9065282dde2b3227836aa"><code>aabe6f8</code></a>
build with <code>@​actions/artifact</code> v2.1.11</li>
<li><a
href="https://github.com/actions/upload-artifact/commit/604373da6381bf24206979c74d06a550515601b9"><code>604373d</code></a>
Merge pull request <a
href="https://redirect.github.com/actions/upload-artifact/issues/625">#625</a>
from actions/robherley/artifact-2.1.10</li>
<li><a
href="https://github.com/actions/upload-artifact/commit/0150148bdf458be2451ee90b000ecdcca8216ed8"><code>0150148</code></a>
paste right core version</li>
<li><a
href="https://github.com/actions/upload-artifact/commit/a009b25faa61b2b26de294984570f1371b13a895"><code>a009b25</code></a>
update licenses</li>
<li><a
href="https://github.com/actions/upload-artifact/commit/9f6f6f402e14cb0fe462513c8fa31e6ec061e8b5"><code>9f6f6f4</code></a>
update <code>@​actions/core</code> and <code>@​actions/artifact</code>
to latest versions</li>
<li>Additional commits viewable in <a
href="https://github.com/actions/upload-artifact/compare/v4.4.0...b4b15b8c7c6ac21ea08fcf65892d2ee8f75cf882">compare
view</a></li>
</ul>
</details>
<br />


[![Dependabot compatibility
score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=actions/upload-artifact&package-manager=github_actions&previous-version=4.4.0&new-version=4.4.3)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)

Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.

[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)

---

<details>
<summary>Dependabot commands and options</summary>
<br />

You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)


</details>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
---
 .github/workflows/linux_sanitizers.yml | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/.github/workflows/linux_sanitizers.yml b/.github/workflows/linux_sanitizers.yml
index f13f3765d4f353..e098b637150834 100644
--- a/.github/workflows/linux_sanitizers.yml
+++ b/.github/workflows/linux_sanitizers.yml
@@ -206,7 +206,7 @@ jobs:
       #
       - name: Upload sccache log
         if: ${{ always() }}
-        uses: actions/upload-artifact@50769540e7f4bd5e21e526ee35c689e35e0d6874 # v4.4.0
+        uses: actions/upload-artifact@b4b15b8c7c6ac21ea08fcf65892d2ee8f75cf882 # v4.4.3
         with:
           name: sccache_log_${{ matrix.SANITIZER }}
           path: ${{ env.SCCACHE_ERROR_LOG }}

From 9ea3beae8bc6660fc948546195c8d68e315a5639 Mon Sep 17 00:00:00 2001
From: Maxim Vafin <maxim.vafin@intel.com>
Date: Mon, 21 Oct 2024 09:23:23 +0200
Subject: [PATCH 02/24] [TESTS] Fix retry mechanism to raise if retry didn't
 help (#27139)

### Details:
 - *item1*
 - *...*

### Tickets:
 - *ticket-id*
---
 .../py_frontend_tests/test_torchvision_preprocessor.py      | 4 +++-
 tests/layer_tests/pytorch_tests/pytorch_layer_test_class.py | 6 ++++--
 2 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/tests/layer_tests/py_frontend_tests/test_torchvision_preprocessor.py b/tests/layer_tests/py_frontend_tests/test_torchvision_preprocessor.py
index 94060bf982ad96..1ec25f6c07f500 100644
--- a/tests/layer_tests/py_frontend_tests/test_torchvision_preprocessor.py
+++ b/tests/layer_tests/py_frontend_tests/test_torchvision_preprocessor.py
@@ -32,15 +32,17 @@ def forward(self, data):
 def _infer_pipelines(test_input, preprocess_pipeline, input_channels=3):
     retries = 0
     max_retries = 3
+    last_e = None
     while retries < max_retries:
         try:
             return _infer_pipelines_impl(test_input, preprocess_pipeline, input_channels)
         except RuntimeError as e:
             # This is a potentially sporadic issue
             print(f"An error occurred: {e}. Retrying...")
+            last_e = e
             retries += 1
     else:
-        print("Max retries reached. Function execution failed.")
+        raise RuntimeError("Max retries reached. Function execution failed.") from last_e
 
 
 def _infer_pipelines_impl(test_input, preprocess_pipeline, input_channels=3):
diff --git a/tests/layer_tests/pytorch_tests/pytorch_layer_test_class.py b/tests/layer_tests/pytorch_tests/pytorch_layer_test_class.py
index 4d1582f0061d59..a44ca8c0117a4b 100644
--- a/tests/layer_tests/pytorch_tests/pytorch_layer_test_class.py
+++ b/tests/layer_tests/pytorch_tests/pytorch_layer_test_class.py
@@ -73,19 +73,21 @@ def _test(self, model, ref_net, kind, ie_device, precision, ir_version, infer_ti
               **kwargs):
         retries = 0
         max_retries = 3
+        last_e = None
         while retries < max_retries:
             try:
                 return self._test_impl(model, ref_net, kind, ie_device, precision, ir_version, infer_timeout, dynamic_shapes, **kwargs)
             except RuntimeError as e:
                 # This is a potentially sporadic issue
                 print(f"An error occurred: {e}. Retrying...")
+                last_e = e
                 retries += 1
         else:
-            print("Max retries reached. Function execution failed.")
+            raise RuntimeError("Max retries reached. Function execution failed.") from last_e
 
 
     def _test_impl(self, model, ref_net, kind, ie_device, precision, ir_version, infer_timeout=60, dynamic_shapes=True,
-              **kwargs):
+                   **kwargs):
         """
         :param enabled_transforms/disabled_transforms: string with idxs of transforms that should be enabled/disabled.
                                                        Example: "transform_1,transform_2"

From ebdf1fc088c02de34d3ed4fd9d411e877fa604e0 Mon Sep 17 00:00:00 2001
From: Xiping Yan <xiping.yan@intel.com>
Date: Mon, 21 Oct 2024 15:28:23 +0800
Subject: [PATCH 03/24] [CPU] Fuse SDPA before/after Reshape+Transpose Node to
 SDPA (#26819)

### Details:
- *Pattern: QKV_Reshape -> QKV_Transpose ->
SDPA->OUT_Transpse->OUT_Reshape*
 - *Fuse this pattern to: SDPA*
- *This hotspot can be observed after
https://github.com/openvinotoolkit/openvino/pull/26130, this PR's
implementation doesn't depend on it.*

### Tickets:
 - *153616*

---------

Signed-off-by: xipingya <xiping.yan@intel.com>
---
 src/plugins/intel_cpu/src/cpu_types.cpp       |   1 +
 src/plugins/intel_cpu/src/extension.cpp       |   1 +
 .../intel_cpu/src/nodes/scaled_attn.cpp       |  49 +++-
 .../cpu_opset/common/op/sdpa.cpp              |  42 +++
 .../cpu_opset/common/op/sdpa.hpp              |  39 ++-
 .../x64/pass/sdpa_fuse_transpose_reshape.cpp  | 188 ++++++++++++++
 .../x64/pass/sdpa_fuse_transpose_reshape.hpp  |  18 ++
 .../transformation_pipeline.cpp               |   2 +
 .../x64/fuse_reshape_transpose_to_sdpa.cpp    | 245 ++++++++++++++++++
 9 files changed, 569 insertions(+), 16 deletions(-)
 create mode 100644 src/plugins/intel_cpu/src/transformations/cpu_opset/x64/pass/sdpa_fuse_transpose_reshape.cpp
 create mode 100644 src/plugins/intel_cpu/src/transformations/cpu_opset/x64/pass/sdpa_fuse_transpose_reshape.hpp
 create mode 100644 src/plugins/intel_cpu/tests/functional/custom/subgraph_tests/src/x64/fuse_reshape_transpose_to_sdpa.cpp

diff --git a/src/plugins/intel_cpu/src/cpu_types.cpp b/src/plugins/intel_cpu/src/cpu_types.cpp
index 8b4ffaefcabfd3..fad6613f36b6cb 100644
--- a/src/plugins/intel_cpu/src/cpu_types.cpp
+++ b/src/plugins/intel_cpu/src/cpu_types.cpp
@@ -245,6 +245,7 @@ static const TypeToNameMap& get_type_to_name_tbl() {
         {"Ngram", Type::Ngram},
         {"ScaledDotProductAttention", Type::ScaledDotProductAttention},
         {"ScaledDotProductAttentionWithKVCache", Type::ScaledDotProductAttention},
+        {"SDPAWithTransposeReshape", Type::ScaledDotProductAttention},
         {"PagedAttentionExtension", Type::PagedAttention},
         {"RoPE", Type::RoPE},
         {"GatherCompressed", Type::Gather},
diff --git a/src/plugins/intel_cpu/src/extension.cpp b/src/plugins/intel_cpu/src/extension.cpp
index f2256d9d03df15..a29282d4af3101 100644
--- a/src/plugins/intel_cpu/src/extension.cpp
+++ b/src/plugins/intel_cpu/src/extension.cpp
@@ -75,6 +75,7 @@ class TypeRelaxedExtension : public ov::OpExtension<ov::op::TypeRelaxed<Op>> {
     OP_EXTENSION(ov::intel_cpu::PowerStaticNode)                            \
     OP_EXTENSION(ov::intel_cpu::CausalMaskPreprocessNode)                   \
     OP_EXTENSION(ov::intel_cpu::SwishNode)                                  \
+    OP_EXTENSION(ov::intel_cpu::SDPAWithTransposeReshape)                   \
     OP_EXTENSION(ov::intel_cpu::NgramNode)                                  \
     OP_EXTENSION(ov::op::internal::GatherCompressed)                        \
     OP_EXTENSION(ov::op::internal::NonMaxSuppressionIEInternal)             \
diff --git a/src/plugins/intel_cpu/src/nodes/scaled_attn.cpp b/src/plugins/intel_cpu/src/nodes/scaled_attn.cpp
index e70a3932b11b1e..e229ff4bb72c57 100644
--- a/src/plugins/intel_cpu/src/nodes/scaled_attn.cpp
+++ b/src/plugins/intel_cpu/src/nodes/scaled_attn.cpp
@@ -866,6 +866,7 @@ struct ScaledDotProductAttention::AttentionExecutor : public ScaledDotProductAtt
     void execute(dnnl::stream strm, const Config& config, const std::vector<MemoryPtr>& inputs, const MemoryPtr output,
                  const MemoryPtr presentk_input, const MemoryPtr presentv_input, const MemoryPtr beam_input,
                  const PlainTensor& k_scale_zp, const PlainTensor& v_scale_zp) override {
+        bool has_in_reshape = config.config.input_BLHxS;
         bool has_out_transpose = config.config.output_BLHxS;
         bool fuse_causal_attn = config.config.fuse_causal_attn;
         bool is_causal = config.config.is_causal;
@@ -881,11 +882,28 @@ struct ScaledDotProductAttention::AttentionExecutor : public ScaledDotProductAtt
         float scale_input = 0.0f;
         size_t B, L1, L0, S, SV;
 
+        // B,L,H*S->B,L,H,S
+        auto get_reshape_shape = [&config](const PlainTensor& input) {
+            // [B,L,H*S]
+            auto inp_shape = input.shape();
+            // [B,L,H,S]
+            return VectorDims{inp_shape[0], inp_shape[1], config.config.order_HS[0], config.config.order_HS[1]};
+        };
+
         q_input.reset(inputs[0]);
         k_input.reset(inputs[1]);
         v_input.reset(inputs[2]);
         present_key.reset(presentk_input);
         present_value.reset(presentv_input);
+        if (has_in_reshape) {
+            q_input = q_input.reshape(get_reshape_shape(q_input));
+            auto kv_shape = get_reshape_shape(k_input);
+            k_input = k_input.reshape(kv_shape);
+            v_input = v_input.reshape(kv_shape);
+            present_key = present_key.reshape(kv_shape);
+            present_value = present_value.reshape(kv_shape);
+        }
+
         if (beam_input)
             beam_table.reset(beam_input);
         if (input_num > 3) {
@@ -985,11 +1003,11 @@ ScaledDotProductAttention::ScaledDotProductAttention(const std::shared_ptr<ov::N
         OPENVINO_THROW("CPU: " + errorMessage);
     }
 
-    const auto node = std::dynamic_pointer_cast<const ov::op::v13::ScaledDotProductAttention>(op);
-    if (node) {
+    if (const auto node = std::dynamic_pointer_cast<const ov::op::v13::ScaledDotProductAttention>(op)) {
         m_config.config.is_causal = node->get_causal();
-    } else {
-        const auto node = std::dynamic_pointer_cast<const ScaledDotProductAttentionWithKVCache>(op);
+    } else if (const auto node = std::dynamic_pointer_cast<const ScaledDotProductAttentionWithKVCache>(op)) {
+        m_config.config = node->get_config();
+    } else if (const auto node = std::dynamic_pointer_cast<const SDPAWithTransposeReshape>(op)) {
         m_config.config = node->get_config();
     }
 }
@@ -1142,17 +1160,28 @@ void ScaledDotProductAttention::execute(dnnl::stream strm) {
 
 bool ScaledDotProductAttention::isSupportedOperation(const std::shared_ptr<const ov::Node>& op, std::string& errorMessage) noexcept {
     try {
+        auto sdpaWithTransposeReshapeOp = std::dynamic_pointer_cast<const SDPAWithTransposeReshape>(op);
         if (!std::dynamic_pointer_cast<const ov::op::v13::ScaledDotProductAttention>(op) &&
-            !std::dynamic_pointer_cast<const ScaledDotProductAttentionWithKVCache>(op)) {
-            errorMessage = "Only ScaledDotProductAttention or ScaledDotProductAttentionWithKVCache operation are supported";
+            !std::dynamic_pointer_cast<const ScaledDotProductAttentionWithKVCache>(op) && !sdpaWithTransposeReshapeOp) {
+            errorMessage = "Only ScaledDotProductAttention, ScaledDotProductAttentionWithKVCache or "
+                           "SDPAWithTransposeReshape operation are supported";
             return false;
         }
-        // expect shape of q: [B, H, L, S]
         auto inRank = op->get_input_partial_shape(0).size();
-        if (inRank != 4u) {
-            errorMessage = "Doesn't support 'data' input with rank: " + std::to_string(inRank);
-            return false;
+        if (sdpaWithTransposeReshapeOp) {
+            // inRank expect shape of q: [B, L, H*S]
+            if (inRank != 3u) {
+                errorMessage = "Doesn't support 'data' input with rank: " + std::to_string(inRank);
+                return false;
+            }
+        } else {
+            // inRank expect shape of q: [B, H, L, S]
+            if (inRank != 4u) {
+                errorMessage = "Doesn't support 'data' input with rank: " + std::to_string(inRank);
+                return false;
+            }
         }
+
         int orgSDPAInput = static_cast<int>(op->get_input_size());
         const auto node = std::dynamic_pointer_cast<const ScaledDotProductAttentionWithKVCache>(op);
         if (node) {
diff --git a/src/plugins/intel_cpu/src/transformations/cpu_opset/common/op/sdpa.cpp b/src/plugins/intel_cpu/src/transformations/cpu_opset/common/op/sdpa.cpp
index 4421499d10204d..bea56e2b8c833f 100644
--- a/src/plugins/intel_cpu/src/transformations/cpu_opset/common/op/sdpa.cpp
+++ b/src/plugins/intel_cpu/src/transformations/cpu_opset/common/op/sdpa.cpp
@@ -99,4 +99,46 @@ bool ov::intel_cpu::ScaledDotProductAttentionWithKVCache::visit_attributes(ov::A
     visitor.on_attribute("permute_axes", m_config.permute_axes);
     visitor.finish_structure();
     return true;
+}
+
+ov::intel_cpu::SDPAWithTransposeReshape::SDPAWithTransposeReshape(const OutputVector& args, const Config& cfg)
+    : Op(args),
+      m_config(cfg) {}
+
+std::shared_ptr<ov::Node> ov::intel_cpu::SDPAWithTransposeReshape::clone_with_new_inputs(
+    const ov::OutputVector& new_args) const {
+    INTERNAL_OP_SCOPE(SDPAWithTransposeReshape_with_new_inputs);
+    check_new_args_count(this, new_args);
+    return std::make_shared<ov::intel_cpu::SDPAWithTransposeReshape>(new_args, m_config);
+}
+
+void ov::intel_cpu::SDPAWithTransposeReshape::validate_and_infer_types() {
+    INTERNAL_OP_SCOPE(SDPAWithTransposeReshape_validate_and_infer_types);
+    // [B,L,H*S]
+    auto q_ps = get_input_partial_shape(0);
+    auto output_ps = q_ps;
+    NODE_VALIDATION_CHECK(this, m_config.output_BLHxS == true);
+    NODE_VALIDATION_CHECK(this, m_config.input_BLHxS == true);
+    NODE_VALIDATION_CHECK(this, q_ps.size() == 3u);
+
+    // permute_axes should be [B, H, L, S]
+    const auto& permute_axes = this->m_config.permute_axes;
+    NODE_VALIDATION_CHECK(this, permute_axes.size() == 4u);
+
+    // order_HS should be [H,S]
+    const auto& order_HS = this->m_config.order_HS;
+    NODE_VALIDATION_CHECK(this, order_HS.size() == 2u);
+
+    set_output_type(0, get_input_element_type(0), output_ps);
+}
+
+bool ov::intel_cpu::SDPAWithTransposeReshape::visit_attributes(ov::AttributeVisitor& visitor) {
+    INTERNAL_OP_SCOPE(SDPAWithTransposeReshape_visit_attributes);
+    visitor.start_structure("config");
+    visitor.on_attribute("input_BLHxS", m_config.input_BLHxS);
+    visitor.on_attribute("output_BLHxS", m_config.output_BLHxS);
+    visitor.on_attribute("permute_axes", m_config.permute_axes);
+    visitor.on_attribute("order_HS", m_config.order_HS);
+    visitor.finish_structure();
+    return true;
 }
\ No newline at end of file
diff --git a/src/plugins/intel_cpu/src/transformations/cpu_opset/common/op/sdpa.hpp b/src/plugins/intel_cpu/src/transformations/cpu_opset/common/op/sdpa.hpp
index 8fe1c9ce4ffa19..8c811f16262734 100644
--- a/src/plugins/intel_cpu/src/transformations/cpu_opset/common/op/sdpa.hpp
+++ b/src/plugins/intel_cpu/src/transformations/cpu_opset/common/op/sdpa.hpp
@@ -21,13 +21,15 @@ class ScaledDotProductAttentionWithKVCache : public ov::op::Op {
     ScaledDotProductAttentionWithKVCache() = default;
 
     struct Config {
-        bool output_BLHxS = false;       // true implies that output is [B,L,H*S]
+        bool input_BLHxS = false;   // true implies that input is [B,L,H*S]
+        bool output_BLHxS = false;  // true implies that output is [B,L,H*S]
 
-        bool fuse_causal_attn = false;   // fuse causal mask and attn mask into attn_mask
-        bool is_causal = false;          // apply causal mask internally
-        bool fuse_concat = false;        // fuse (concat->sdp) ==> sdp
-        std::vector<size_t> permute_axes; // not empty means input has transpose. output of permutation is [B,H,L,S]
-                                         // e.g. [L,B,H,S] -> permute[1, 2, 0, 3] ->[B, H, L, S]
+        bool fuse_causal_attn = false;     // fuse causal mask and attn mask into attn_mask
+        bool is_causal = false;            // apply causal mask internally
+        bool fuse_concat = false;          // fuse (concat->sdp) ==> sdp
+        std::vector<size_t> permute_axes;  // not empty means input has transpose. output of permutation is [B,H,L,S]
+                                           // e.g. [L,B,H,S] -> permute[1, 2, 0, 3] ->[B, H, L, S]
+        std::vector<size_t> order_HS;      // Reshape[B,L,H*S]->B,L,H,S], H,S are fixed value, when input_BLHxS is true.
     };
 
     ScaledDotProductAttentionWithKVCache(const OutputVector& args, const Config& cfg);
@@ -48,5 +50,30 @@ class ScaledDotProductAttentionWithKVCache : public ov::op::Op {
     Config m_config;
 };
 
+class SDPAWithTransposeReshape : public ov::op::Op {
+public:
+    OPENVINO_OP("SDPAWithTransposeReshape", "cpu_plugin_opset");
+    using Config = ScaledDotProductAttentionWithKVCache::Config;
+
+    SDPAWithTransposeReshape() = default;
+
+    SDPAWithTransposeReshape(const OutputVector& args, const Config& cfg);
+
+    std::shared_ptr<Node> clone_with_new_inputs(const OutputVector& new_args) const override;
+    bool visit_attributes(AttributeVisitor& visitor) override;
+    void validate_and_infer_types() override;
+
+    const Config& get_config() const {
+        return m_config;
+    }
+
+    Config& get_config() {
+        return m_config;
+    }
+
+private:
+    Config m_config;
+};
+
 }  // namespace intel_cpu
 }  // namespace ov
\ No newline at end of file
diff --git a/src/plugins/intel_cpu/src/transformations/cpu_opset/x64/pass/sdpa_fuse_transpose_reshape.cpp b/src/plugins/intel_cpu/src/transformations/cpu_opset/x64/pass/sdpa_fuse_transpose_reshape.cpp
new file mode 100644
index 00000000000000..3aa0fd0d08e69b
--- /dev/null
+++ b/src/plugins/intel_cpu/src/transformations/cpu_opset/x64/pass/sdpa_fuse_transpose_reshape.cpp
@@ -0,0 +1,188 @@
+// Copyright (C) 2018-2024 Intel Corporation
+// SPDX-License-Identifier: Apache-2.0
+//
+
+#include "sdpa_fuse_transpose_reshape.hpp"
+
+#include <transformations/utils/utils.hpp>
+
+#include "itt.hpp"
+#include "openvino/core/rt_info.hpp"
+#include "openvino/op/reshape.hpp"
+#include "openvino/op/scaled_dot_product_attention.hpp"
+#include "openvino/op/transpose.hpp"
+#include "openvino/pass/pattern/op/wrap_type.hpp"
+#include "transformations/cpu_opset/common/op/sdpa.hpp"
+
+/*
+ * Description: SDPA fuse transpose and reshape.
+ *           Original pattern                            Fused pattern
+ *
+ *  input1         input2       input3
+ *     |             |             |
+ * q_reshape     k_reshape     v_reshap
+ *     |             |             |                         (qkv transpose and reshape's orders)
+ * q_transpose  k_transpose   v_transpose                                     |
+ *         \         |        /                      input1  input2  input3   |
+ *          \        |       /                          \      |       /      /
+ *       ScaledDotProductAttention   --------->        SDPAWithTransposeReshape
+ *                   |                                         |
+ *              out_transpose                                  |
+ *                   |                                       output
+ *               out_reshpae
+ *                   |
+ *                 output
+ */
+
+using namespace ov;
+using namespace ov::pass::pattern;
+
+intel_cpu::SDPAFuseTransposeReshape::SDPAFuseTransposeReshape() {
+    MATCHER_SCOPE(SDPAFuseTransposeReshape);
+
+    auto q_reshape_node = wrap_type<op::v1::Reshape>({any_input(), any_input()});
+    auto k_reshape_node = wrap_type<op::v1::Reshape>({any_input(), any_input()});
+    auto v_reshape_node = wrap_type<op::v1::Reshape>({any_input(), any_input()});
+
+    auto q_transpose_order_node = wrap_type<op::v0::Constant>();
+    auto k_transpose_order_node = wrap_type<op::v0::Constant>();
+    auto v_transpose_order_node = wrap_type<op::v0::Constant>();
+    auto q_transpose_node = wrap_type<op::v1::Transpose>({q_reshape_node, q_transpose_order_node});
+    auto k_transpose_node = wrap_type<op::v1::Transpose>({k_reshape_node, k_transpose_order_node});
+    auto v_transpose_node = wrap_type<op::v1::Transpose>({v_reshape_node, v_transpose_order_node});
+
+    auto sdpa_node =
+        wrap_type<op::v13::ScaledDotProductAttention>({q_transpose_node, k_transpose_node, v_transpose_node});
+
+    auto out_transpose_order_node = wrap_type<op::v0::Constant>();
+    auto out_transpose_node = wrap_type<op::v1::Transpose>({sdpa_node, out_transpose_order_node});
+    auto out_reshape_node = wrap_type<op::v1::Reshape>({out_transpose_node, wrap_type<op::v0::Constant>()});
+
+    matcher_pass_callback callback = [OV_CAPTURE_CPY_AND_THIS](pass::pattern::Matcher& m) {
+        auto& pattern_map = m.get_pattern_value_map();
+        auto sdpa = as_type_ptr<op::v13::ScaledDotProductAttention>(pattern_map.at(sdpa_node).get_node_shared_ptr());
+        if (sdpa == nullptr || transformation_callback(sdpa)) {
+            return false;
+        }
+
+        // Order=[0, 2, 1, 3]
+        auto is_expected_transpose = [&](std::shared_ptr<op::v1::Transpose>& transpose) {
+            if (transpose) {
+                const auto orders = as_type_ptr<op::v0::Constant>(transpose->get_input_node_shared_ptr(1));
+                return orders && (std::vector<int32_t>({0, 2, 1, 3}) == orders->cast_vector<int32_t>());
+            }
+            return false;
+        };
+
+        // Reshape [B,L,H*S] -> [B,L,H,S]
+        auto is_expected_reshape = [&](std::shared_ptr<op::v1::Reshape>& reshape_node, bool reverse = false) {
+            if (reshape_node) {
+                auto inp_shape = reshape_node->get_input_partial_shape(0);
+                auto outp_shape = reshape_node->get_output_partial_shape(0);
+                // Expect shape: [?, ?, val]
+                auto check_dim_3 = [](ov::PartialShape shape) {
+                    return shape.rank().is_static() && shape.rank() == 3 && shape[2].is_static();
+                };
+                // Expect shape: [?, ?, val, val]
+                auto check_dim_4 = [](ov::PartialShape shape) {
+                    return shape.rank().is_static() && shape.rank() == 4 && shape[2].is_static() &&
+                           shape[3].is_static();
+                };
+
+                if (reverse) {
+                    return check_dim_4(inp_shape) && check_dim_3(outp_shape) &&
+                           (outp_shape[2] == inp_shape[2] * inp_shape[3]);
+                } else {
+                    return check_dim_3(inp_shape) && check_dim_4(outp_shape) &&
+                           (inp_shape[2] == outp_shape[2] * outp_shape[3]);
+                }
+            }
+            return false;
+        };
+
+        // Pattern: Reshape->Transpose->SDPA
+        auto q_reshape = as_type_ptr<op::v1::Reshape>(pattern_map.at(q_reshape_node).get_node_shared_ptr());
+        auto k_reshape = as_type_ptr<op::v1::Reshape>(pattern_map.at(k_reshape_node).get_node_shared_ptr());
+        auto v_reshape = as_type_ptr<op::v1::Reshape>(pattern_map.at(v_reshape_node).get_node_shared_ptr());
+
+        if (!(is_expected_reshape(q_reshape) && is_expected_reshape(k_reshape) && is_expected_reshape(v_reshape))) {
+            return false;
+        }
+        // K,V Reshape's order should be same node.
+        auto k_reshape_order = as_type_ptr<op::v0::Constant>(k_reshape->get_input_node_shared_ptr(1));
+        auto v_reshape_order = as_type_ptr<op::v0::Constant>(v_reshape->get_input_node_shared_ptr(1));
+        if (k_reshape_order && v_reshape_order) {
+            if (k_reshape_order->cast_vector<int32_t>() != v_reshape_order->cast_vector<int32_t>()) {
+                return false;
+            }
+        } else if (k_reshape->get_input_node_shared_ptr(1) != v_reshape->get_input_node_shared_ptr(1)) {
+            return false;
+        }
+
+        std::shared_ptr<op::v1::Transpose> qkv_transpose[3] = {};
+        std::shared_ptr<op::v0::Constant> qkv_transpose_order[3] = {};
+        qkv_transpose[0] = as_type_ptr<op::v1::Transpose>(pattern_map.at(q_transpose_node).get_node_shared_ptr());
+        qkv_transpose[1] = as_type_ptr<op::v1::Transpose>(pattern_map.at(k_transpose_node).get_node_shared_ptr());
+        qkv_transpose[2] = as_type_ptr<op::v1::Transpose>(pattern_map.at(v_transpose_node).get_node_shared_ptr());
+        qkv_transpose_order[0] = as_type_ptr<op::v0::Constant>(pattern_map.at(q_transpose_order_node).get_node_shared_ptr());
+        qkv_transpose_order[1] = as_type_ptr<op::v0::Constant>(pattern_map.at(k_transpose_order_node).get_node_shared_ptr());
+        qkv_transpose_order[2] = as_type_ptr<op::v0::Constant>(pattern_map.at(v_transpose_order_node).get_node_shared_ptr());
+        auto out_tranpose = as_type_ptr<op::v1::Transpose>(pattern_map.at(out_transpose_node).get_node_shared_ptr());
+        auto out_transpose_order = as_type_ptr<op::v0::Constant>(pattern_map.at(out_transpose_order_node).get_node_shared_ptr());
+
+        if (!(is_expected_transpose(qkv_transpose[0]) && is_expected_transpose(qkv_transpose[1]) &&
+              is_expected_transpose(qkv_transpose[2]))) {
+            return false;
+        }
+        if (!is_expected_transpose(out_tranpose)) {
+            return false;
+        }
+
+        auto out_reshape = as_type_ptr<op::v1::Reshape>(pattern_map.at(out_reshape_node).get_node_shared_ptr());
+        if (!is_expected_reshape(out_reshape, true)) {
+            return false;
+        }
+
+        OutputVector args = {q_reshape->get_input_node_shared_ptr(0),
+                             k_reshape->get_input_node_shared_ptr(0),
+                             v_reshape->get_input_node_shared_ptr(0)};
+
+        // Config
+        intel_cpu::SDPAWithTransposeReshape::Config config;
+        config.is_causal = sdpa->get_causal();
+        config.fuse_concat = false;
+        config.output_BLHxS = true;
+
+        // Config::permute_axes
+        const auto& permute_q = qkv_transpose_order[0]->cast_vector<int32_t>();
+        config.permute_axes.resize(permute_q.size());
+        for (size_t i = 0; i < permute_q.size(); i++) {
+            config.permute_axes[i] = static_cast<size_t>(permute_q[i]);
+        }
+
+        // Config::order_HS
+        config.order_HS.resize(2);
+        auto reshape_out_shape = q_reshape->get_output_partial_shape(0).get_min_shape();  // [?,?,H,S]
+        config.order_HS[0] = reshape_out_shape[2];
+        config.order_HS[1] = reshape_out_shape[3];
+        config.input_BLHxS = true;
+
+        auto new_sdpa = std::make_shared<intel_cpu::SDPAWithTransposeReshape>(args, config);
+        new_sdpa->set_friendly_name(sdpa->get_friendly_name() + "/fused_reshape_transpose");
+        NodeVector replaced_nodes = {q_reshape,
+                                     k_reshape,
+                                     v_reshape,
+                                     qkv_transpose[0],
+                                     qkv_transpose[1],
+                                     qkv_transpose[2],
+                                     sdpa,
+                                     out_tranpose,
+                                     out_reshape};
+        copy_runtime_info(replaced_nodes, new_sdpa);
+        ov::replace_node(out_reshape, new_sdpa);
+        return true;
+    };
+
+    auto m = std::make_shared<ov::pass::pattern::Matcher>(out_reshape_node, matcher_name);
+    register_matcher(m, callback);
+}
diff --git a/src/plugins/intel_cpu/src/transformations/cpu_opset/x64/pass/sdpa_fuse_transpose_reshape.hpp b/src/plugins/intel_cpu/src/transformations/cpu_opset/x64/pass/sdpa_fuse_transpose_reshape.hpp
new file mode 100644
index 00000000000000..74ba6ec6221d1e
--- /dev/null
+++ b/src/plugins/intel_cpu/src/transformations/cpu_opset/x64/pass/sdpa_fuse_transpose_reshape.hpp
@@ -0,0 +1,18 @@
+// Copyright (C) 2018-2024 Intel Corporation
+// SPDX-License-Identifier: Apache-2.0
+//
+
+#pragma once
+
+#include <openvino/pass/graph_rewrite.hpp>
+
+namespace ov {
+namespace intel_cpu {
+class SDPAFuseTransposeReshape : public ov::pass::MatcherPass {
+public:
+    OPENVINO_RTTI("SDPAFuseTransposeReshape", "0");
+    SDPAFuseTransposeReshape();
+};
+
+}   // namespace intel_cpu
+}   // namespace ov
diff --git a/src/plugins/intel_cpu/src/transformations/transformation_pipeline.cpp b/src/plugins/intel_cpu/src/transformations/transformation_pipeline.cpp
index 04808baaebec54..e45b6379d1e968 100644
--- a/src/plugins/intel_cpu/src/transformations/transformation_pipeline.cpp
+++ b/src/plugins/intel_cpu/src/transformations/transformation_pipeline.cpp
@@ -139,6 +139,7 @@
 #include "transformations/cpu_opset/common/pass/swap_convert_transpose.hpp"
 #include "transformations/cpu_opset/common/pass/causal_mask_preprocess_fusion.hpp"
 #include "transformations/cpu_opset/common/pass/stateful_sdpa_fusion.hpp"
+#include "transformations/cpu_opset/x64/pass/sdpa_fuse_transpose_reshape.hpp"
 
 // Snippets
 #include "snippets/pass/tokenization.hpp"
@@ -864,6 +865,7 @@ void Transformations::PostLpt() {
 
     CPU_REGISTER_PASS_COMMON(postLPTPassManager, ov::pass::transpose_sinking::TSShapeOfForward);
     CPU_REGISTER_PASS_COMMON(postLPTPassManager, StatefulSDPAFusion);
+    CPU_REGISTER_PASS_X64(postLPTPassManager, ov::intel_cpu::SDPAFuseTransposeReshape);
     CPU_REGISTER_PASS_X64(postLPTPassManager, ov::pass::RMSFusion, false);
     CPU_REGISTER_PASS_X64(postLPTPassManager, ov::intel_cpu::DecomposeRMSNorm);
     CPU_SET_CALLBACK_X64(postLPTPassManager,
diff --git a/src/plugins/intel_cpu/tests/functional/custom/subgraph_tests/src/x64/fuse_reshape_transpose_to_sdpa.cpp b/src/plugins/intel_cpu/tests/functional/custom/subgraph_tests/src/x64/fuse_reshape_transpose_to_sdpa.cpp
new file mode 100644
index 00000000000000..a75156c0f69fcb
--- /dev/null
+++ b/src/plugins/intel_cpu/tests/functional/custom/subgraph_tests/src/x64/fuse_reshape_transpose_to_sdpa.cpp
@@ -0,0 +1,245 @@
+// Copyright (C) 2024 Intel Corporation
+// SPDX-License-Identifier: Apache-2.0
+//
+
+#include "common_test_utils/include/common_test_utils/ov_tensor_utils.hpp"
+#include "openvino/pass/manager.hpp"
+#include "shared_test_classes/base/ov_subgraph.hpp"
+#include "transformations/op_conversions/scaled_dot_product_attention_decomposition.hpp"
+#include "utils/cpu_test_utils.hpp"
+
+using namespace ov::test;
+using namespace CPUTestUtils;
+
+namespace ov {
+namespace test {
+
+// Subgraph:
+/*
+ *             Parameter       Parameter
+ *                  |              |
+ *   Parameter   ReadValue     ReadValue
+ *       |          |   \          |    \
+ *    Reshape    Reshape Assign Reshape Assign
+ *       |          |              |
+ *   Transpose  Transpoe       Transpose
+ *        \         |            /
+ *      ScaledDotProductAttention
+ *                  |
+ *              Tranpose
+ *                  |
+ *               Reshape
+ *                  |
+ *                Result
+ */
+
+// <Input_shapes, [H,S]>
+using InputShapeAndReshapeOrder = std::pair<std::vector<InputShape>, std::vector<int32_t>>;
+using FuseSDPAReshapeTransposeTestParams = std::tuple<ElementType, InputShapeAndReshapeOrder>;
+class FuseSDPAReshapeTransposeTest : virtual public ov::test::SubgraphBaseTest,
+                                     public testing::WithParamInterface<FuseSDPAReshapeTransposeTestParams>,
+                                     public CPUTestsBase {
+public:
+    static std::string getTestCaseName(const testing::TestParamInfo<FuseSDPAReshapeTransposeTestParams>& obj) {
+        ElementType inType;
+        InputShapeAndReshapeOrder inputShapeAndOrders;
+        std::tie(inType, inputShapeAndOrders) = obj.param;
+        std::ostringstream result;
+        std::vector<InputShape>& inputShapes = inputShapeAndOrders.first;
+        auto& reshapeOrderHS = inputShapeAndOrders.second;
+        result << "IS=";
+        for (const auto& shape : inputShapes) {
+            result << ov::test::utils::partialShape2str({shape.first}) << "_";
+        }
+        result << "TS=";
+        for (const auto& shape : inputShapes) {
+            result << "(";
+            if (!shape.second.empty()) {
+                for (const auto& itr : shape.second) {
+                    result << ov::test::utils::vec2str(itr);
+                }
+            }
+            result << ")_";
+        }
+        result << "Prc=" << inType << "_";
+        result << "ReshapeOrderHS=";
+        result << "(";
+        for (const auto& itr : reshapeOrderHS) {
+            result << itr << ",";
+        }
+        result << ")";
+
+        return result.str();
+    }
+
+    void SetUp() override {
+        ElementType inType;
+        InputShapeAndReshapeOrder inputShapeAndOrders;
+        std::tie(inType, inputShapeAndOrders) = this->GetParam();
+        std::vector<InputShape>& inputShapes = inputShapeAndOrders.first;
+        auto& reshapeOrderHS = inputShapeAndOrders.second;
+        targetDevice = ov::test::utils::DEVICE_CPU;
+        rel_threshold = 1e-2f;
+        configuration[ov::hint::inference_precision.name()] = ov::element::f32;
+        if (inType == ElementType::bf16) {
+            configuration[ov::hint::inference_precision.name()] = ov::element::bf16;
+            rel_threshold = 0.01f;
+        }
+        init_input_shapes(inputShapes);
+
+        // pre SDPA reshape->transpose
+        ov::ParameterVector inputParams(3);
+        ov::SinkVector sinkNodes;
+        OutputVector transposes(3);
+        for (size_t i = 0; i < 3u; i++) {
+            inputParams[i] = std::make_shared<ov::op::v0::Parameter>(inType, inputDynamicShapes[0]);
+
+            auto reshape_axis =
+                ov::op::v0::Constant::create(ov::element::i64, {4}, {0, 0, reshapeOrderHS[0], reshapeOrderHS[1]});
+
+            std::shared_ptr<ov::Node> reshape_input_1 = inputParams[i];
+            if (i > 0) {
+                auto var = std::make_shared<ov::op::util::Variable>(
+                    ov::op::util::VariableInfo{inputDynamicShapes[0], inType, "var_" + std::to_string(i)});
+                auto readvalue = std::make_shared<ov::op::v6::ReadValue>(inputParams[i], var);
+                auto assign = std::make_shared<ov::op::v6::Assign>(readvalue, var);
+                sinkNodes.emplace_back(assign);
+                reshape_input_1 = readvalue;
+            }
+
+            auto reshape = std::make_shared<ov::op::v1::Reshape>(reshape_input_1, reshape_axis, true);
+            auto transposeOrder = ov::op::v0::Constant::create(ov::element::i64, {4}, {0, 2, 1, 3});
+            transposes[i] = std::make_shared<ov::op::v1::Transpose>(reshape, transposeOrder);
+        }
+
+        auto sdpa = std::make_shared<ov::op::v13::ScaledDotProductAttention>(transposes, false);
+        sdpa->set_friendly_name("mha");
+
+        // post SDPA transpose + reshape
+        auto postOrder =
+            ov::op::v0::Constant::create(ov::element::i64, {4}, std::vector<size_t>{0, 2, 1, 3});  // BHLS -> BLHS
+        auto transposeSDPA = std::make_shared<ov::op::v1::Transpose>(sdpa, postOrder);
+
+        auto constReshape =
+            ov::op::v0::Constant::create(ov::element::i64, {3}, {0, 0, reshapeOrderHS[0] * reshapeOrderHS[1]});
+        auto reshapeSDPA = std::make_shared<ov::op::v1::Reshape>(transposeSDPA, constReshape, true);  // BLHS -> B,L,HxS
+
+        function = std::make_shared<ov::Model>(ov::OutputVector{reshapeSDPA},
+                                               sinkNodes,
+                                               inputParams,
+                                               "FuseSDPAReshapeTranspose");
+        targetDevice = ov::test::utils::DEVICE_CPU;
+        functionRefs = function->clone();
+        pass::Manager manager;
+        // decompose ScaledDotProductAttention
+        manager.register_pass<ov::pass::ScaledDotProductAttentionDecomposition>();
+        manager.run_passes(functionRefs);
+    }
+
+    template <typename IT, typename T>
+    static void strided_iota(IT first, size_t n, T value, T stride) {
+        for (size_t i = 0; i < n; i++) {
+            *first++ = value;
+            value += stride;
+        }
+    }
+    void generate(int idx, const std::vector<ov::Shape>& targetInputStaticShapes) {
+        inputs.clear();
+        auto create_input = [this] (std::shared_ptr<ov::op::v0::Parameter> param, ov::Shape shape, float val) {
+            if (param->get_element_type() == ov::element::i32) {
+                ov::Tensor t{ov::element::i32, shape};
+                auto size = ov::shape_size<ov::Shape>(shape);
+                auto* p = static_cast<int*>(t.data());
+                auto start = static_cast<int>(val);
+                for (size_t i = 0; i < size; i++) {
+                    p[i] = (start + i) % size;
+                }
+                inputs.insert({param, t});
+            } else if (param->get_element_type() == ov::element::f32) {
+                ov::Tensor t{ov::element::f32, shape};
+                strided_iota(static_cast<float*>(t.data()), t.get_size(), val, 0.1f);
+                inputs.insert({param, t});
+            } else {
+                ASSERT_TRUE(param->get_element_type() == ov::element::bf16);
+                ov::Tensor t{ov::element::bf16, shape};
+                strided_iota(static_cast<ov::bfloat16*>(t.data()), t.get_size(), val, 0.1f);
+                inputs.insert({param, t});
+            }
+        };
+        // q, k, v
+        create_input(function->get_parameters()[0], targetInputStaticShapes[0], idx + 1.0f);
+        create_input(function->get_parameters()[1], targetInputStaticShapes[0], idx + 2.0f);
+        create_input(function->get_parameters()[2], targetInputStaticShapes[0], idx + 3.0f);
+    }
+    void prepare() {
+        compile_model();
+        inferRequest = compiledModel.create_infer_request();
+        ASSERT_TRUE(inferRequest);
+    }
+    void reset() {
+        for (auto&& state : inferRequest.query_state()) {
+            state.reset();
+        }
+    }
+
+    std::vector<ov::Tensor> run_test(std::shared_ptr<ov::Model> model) {
+        function = model;
+        prepare();
+        std::vector<ov::Tensor> outputs;
+        int idx = 0;
+        for (auto&& shapes : targetStaticShapes) {
+            generate(idx++, shapes);
+            for (const auto& input : inputs) {
+                inferRequest.set_tensor(input.first, input.second);
+            }
+            inferRequest.infer();
+            auto outputTensor = inferRequest.get_output_tensor(0);
+            ov::Tensor copy{outputTensor.get_element_type(), outputTensor.get_shape()};
+            outputTensor.copy_to(copy);
+            outputs.push_back(copy);
+            reset();
+        }
+        return outputs;
+    }
+};
+
+TEST_P(FuseSDPAReshapeTransposeTest, CompareWithRefs) {
+    SKIP_IF_CURRENT_TEST_IS_DISABLED();
+    bool reshape_transpose_fused = false;
+    auto actualOutputs = run_test(function);
+    CheckNumberOfNodesWithType(compiledModel, "ScaledDotProductAttention", 1);
+    CheckNumberOfNodesWithType(compiledModel, "Reshape", 0);
+    CheckNumberOfNodesWithType(compiledModel, "Transpose", 0);
+    for (const auto& n : compiledModel.get_runtime_model()->get_ordered_ops()) {
+        if (n->get_friendly_name() == "mha/fused_reshape_transpose") {
+            reshape_transpose_fused = true;
+        }
+    }
+    ASSERT_TRUE(reshape_transpose_fused);
+
+    auto expectedOutputs = run_test(functionRefs);
+    for (size_t i = 0; i < actualOutputs.size(); i++) {
+        ov::test::utils::compare(expectedOutputs[i], actualOutputs[i], abs_threshold, rel_threshold);
+    }
+}
+
+namespace {
+const std::vector<InputShapeAndReshapeOrder> inputShapeAndReshapeOrders = {
+    // <Input_shapes, [H,S]>
+    {
+        {{
+             // Q,K,V:[B, L, H*S]
+             {{-1, -1, 4 * 16}, {{1, 1, 4 * 16}, {1, 2, 4 * 16}, {2, 2, 4 * 16}}},
+         },
+         // reshapeOrderHS
+         {4, 16}},
+    }};
+
+INSTANTIATE_TEST_SUITE_P(smoke_FuseSDPAReshapeTransposeTest,
+                         FuseSDPAReshapeTransposeTest,
+                         ::testing::Combine(::testing::Values(ElementType::f32),
+                                            ::testing::ValuesIn(inputShapeAndReshapeOrders)),
+                         FuseSDPAReshapeTransposeTest::getTestCaseName);
+}  // namespace
+}  // namespace test
+}  // namespace ov

From 4043e15cc2520f3fec9f0f9d497f6457d8367224 Mon Sep 17 00:00:00 2001
From: Katarzyna Mitrus <katarzyna.mitrus@intel.com>
Date: Mon, 21 Oct 2024 09:38:47 +0200
Subject: [PATCH 04/24] [PyOV] Extend Python API with STFT-15 (#27142)

### Details:
 - Extend Python API with STFT-15

### Tickets:
 - 147160
---
 .../src/openvino/runtime/opset15/__init__.py  |  1 +
 .../src/openvino/runtime/opset15/ops.py       | 24 +++++++++++++++++++
 .../python/tests/test_graph/test_create_op.py | 16 +++++++++++++
 3 files changed, 41 insertions(+)

diff --git a/src/bindings/python/src/openvino/runtime/opset15/__init__.py b/src/bindings/python/src/openvino/runtime/opset15/__init__.py
index 96643a7e93d596..58fd90e7fd1051 100644
--- a/src/bindings/python/src/openvino/runtime/opset15/__init__.py
+++ b/src/bindings/python/src/openvino/runtime/opset15/__init__.py
@@ -16,3 +16,4 @@
 from openvino.runtime.opset15.ops import bitwise_left_shift
 from openvino.runtime.opset15.ops import bitwise_right_shift
 from openvino.runtime.opset15.ops import slice_scatter
+from openvino.runtime.opset15.ops import stft
diff --git a/src/bindings/python/src/openvino/runtime/opset15/ops.py b/src/bindings/python/src/openvino/runtime/opset15/ops.py
index 116f63726bfeb6..c278120dab7432 100644
--- a/src/bindings/python/src/openvino/runtime/opset15/ops.py
+++ b/src/bindings/python/src/openvino/runtime/opset15/ops.py
@@ -303,3 +303,27 @@ def slice_scatter(
         inputs = as_nodes(data, updates, start, stop, step, axes, name=name)
 
     return _get_node_factory_opset15().create("SliceScatter", inputs)
+
+
+@nameable_op
+def stft(
+    data: NodeInput,
+    window: NodeInput,
+    frame_size: NodeInput,
+    frame_step: NodeInput,
+    transpose_frames: bool,
+    name: Optional[str] = None,
+) -> Node:
+    """Return a node which generates STFT operation.
+
+    :param  data: The node providing input data.
+    :param  window: The node providing window data.
+    :param  frame_size: The node with scalar value representing the size of Fourier Transform.
+    :param  frame_step: The distance (number of samples) between successive window frames.
+    :param  transpose_frames: Flag to set output shape layout. If true the `frames` dimension is at out_shape[2],
+                              otherwise it is at out_shape[1].
+    :param  name: The optional name for the created output node.
+    :return: The new node performing STFT operation.
+    """
+    inputs = as_nodes(data, window, frame_size, frame_step, name=name)
+    return _get_node_factory_opset15().create("STFT", inputs)
diff --git a/src/bindings/python/tests/test_graph/test_create_op.py b/src/bindings/python/tests/test_graph/test_create_op.py
index c5023588f5d55b..940f8244f427b8 100644
--- a/src/bindings/python/tests/test_graph/test_create_op.py
+++ b/src/bindings/python/tests/test_graph/test_create_op.py
@@ -2486,6 +2486,22 @@ def test_slice_scatter():
     assert node_default_axes.get_output_shape(0) == data_shape
 
 
+def test_stft():
+    data_shape = [4, 48]
+    data = ov.parameter(data_shape, name="input", dtype=np.float32)
+    window = ov.parameter([7], name="window", dtype=np.float32)
+    frame_size = ov.constant(np.array(11, dtype=np.int32))
+    frame_step = ov.constant(np.array(3, dtype=np.int32))
+    transpose_frames = True
+
+    op = ov_opset15.stft(data, window, frame_size, frame_step, transpose_frames)
+
+    assert op.get_type_name() == "STFT"
+    assert op.get_output_size() == 1
+    assert op.get_output_element_type(0) == Type.f32
+    assert op.get_output_shape(0) == [4, 13, 6, 2]
+
+
 def test_parameter_get_attributes():
     parameter = ov.parameter([2, 2], dtype=np.float32, name="InputData")
     parameter_attributes = parameter.get_attributes()

From 3f953f4ae4c6d22e6a57f9a51217133b7ce8a529 Mon Sep 17 00:00:00 2001
From: Karol Blaszczak <karol.blaszczak@intel.com>
Date: Mon, 21 Oct 2024 09:44:12 +0200
Subject: [PATCH 05/24] [DOCS] benchmark content restructuring (#26918)

---
 .../about-openvino/performance-benchmarks.rst | 111 +++----
 .../generative-ai-performance.rst             |  28 +-
 .../getting-performance-numbers.rst           | 273 ++++++++++++------
 .../model-accuracy-int8-fp32.rst              |   7 +-
 .../_static/benchmarks_files/llm_models.csv   |  22 ++
 .../_static/download/llm_models.csv           |  22 --
 .../_static/download/llm_models_ovms.csv      | 100 -------
 7 files changed, 273 insertions(+), 290 deletions(-)
 create mode 100644 docs/sphinx_setup/_static/benchmarks_files/llm_models.csv
 delete mode 100644 docs/sphinx_setup/_static/download/llm_models.csv
 delete mode 100644 docs/sphinx_setup/_static/download/llm_models_ovms.csv

diff --git a/docs/articles_en/about-openvino/performance-benchmarks.rst b/docs/articles_en/about-openvino/performance-benchmarks.rst
index 40b94210f6c43d..ed9d39aaf8b9e6 100644
--- a/docs/articles_en/about-openvino/performance-benchmarks.rst
+++ b/docs/articles_en/about-openvino/performance-benchmarks.rst
@@ -16,14 +16,12 @@ Performance Benchmarks
    Getting Performance Numbers <performance-benchmarks/getting-performance-numbers>
 
 
-This page presents benchmark results for
+This page presents benchmark results for the
 `Intel® Distribution of OpenVINO™ toolkit <https://software.intel.com/content/www/us/en/develop/tools/openvino-toolkit.html>`__
 and :doc:`OpenVINO Model Server <../openvino-workflow/model-server/ovms_what_is_openvino_model_server>`, for a representative
 selection of public neural networks and Intel® devices. The results may help you decide which
 hardware to use in your applications or plan AI workload for the hardware you have already
 implemented in your solutions. Click the buttons below to see the chosen benchmark data.
-For a more detailed view of performance numbers for generative AI models, check the
-:doc:`Generative AI Benchmark Results <./performance-benchmarks/generative-ai-performance>`
 
 .. grid:: 1 1 2 2
    :gutter: 4
@@ -36,7 +34,7 @@ For a more detailed view of performance numbers for generative AI models, check
          :outline:
          :expand:
 
-         :material-regular:`bar_chart;1.4em` OpenVINO Benchmark Graphs
+         :material-regular:`bar_chart;1.4em` OpenVINO Benchmark Graphs (general)
 
    .. grid-item::
 
@@ -46,10 +44,35 @@ For a more detailed view of performance numbers for generative AI models, check
          :outline:
          :expand:
 
-         :material-regular:`bar_chart;1.4em` OVMS Benchmark Graphs
+         :material-regular:`bar_chart;1.4em` OVMS Benchmark Graphs (general)
+
+   .. grid-item::
+
+      .. button-link:: ./performance-benchmarks/generative-ai-performance.html
+         :class: ov-toolkit-benchmark-genai
+         :color: primary
+         :outline:
+         :expand:
+
+         :material-regular:`table_view;1.4em` LLM performance for AI PC
+
+   .. grid-item::
+
+      .. button-link:: #
+         :class: ovms-toolkit-benchmark-llm
+         :color: primary
+         :outline:
+         :expand:
+
+         :material-regular:`bar_chart;1.4em` OVMS for GenAI (coming soon)
+
+
+
+
 
 
-Key performance indicators and workload parameters.
+
+**Key performance indicators and workload parameters**
 
 .. tab-set::
 
@@ -65,13 +88,13 @@ Key performance indicators and workload parameters.
    .. tab-item:: Latency
       :sync: latency
 
-      For Vision and NLP models this mhis measures the synchronous execution of inference requests and is reported in
-      milliseconds. Each inference request (for example: preprocess, infer, postprocess) is
-      allowed to complete before the next is started. This performance metric is relevant in
-      usage scenarios where a single image input needs to be acted upon as soon as possible. An
-      example would be the healthcare sector where medical personnel only request analysis of a
-      single ultra sound scanning image or in real-time or near real-time applications for
-      example an industrial robot's response to actions in its environment or obstacle avoidance
+      For Vision and NLP models this measures the synchronous execution of inference requests and
+      is reported in milliseconds. Each inference request (for example: preprocess, infer,
+      postprocess) is allowed to complete before the next one starts. This performance metric is
+      relevant in usage scenarios where a single image input needs to be acted upon as soon as
+      possible. An example would be the healthcare sector where medical personnel only request
+      analysis of a single ultra sound scanning image or in real-time or near real-time applications
+      such as an industrial robot's response to actions in its environment or obstacle avoidance
       for autonomous vehicles.
       For Transformer models like Stable-Diffusion this measures the time it takes to convert the prompt
       or input text into a finished image. It is presented in seconds.
@@ -97,9 +120,10 @@ Key performance indicators and workload parameters.
       * input token length: 1024 (the tokens for GenAI models are in English).
 
 
-.. raw:: html
+**Platforms, Configurations, Methodology**
 
-   <h2>Platforms, Configurations, Methodology</h2>
+To see the methodology used to obtain the numbers and learn how to test performance yourself,
+see the guide on :doc:`getting performance numbers <performance-benchmarks/getting-performance-numbers>`.
 
 For a listing of all platforms and configurations used for testing, refer to the following:
 
@@ -130,59 +154,10 @@ For a listing of all platforms and configurations used for testing, refer to the
          :material-regular:`download;1.5em` Click for Performance Data [XLSX]
 
 
-The OpenVINO benchmark setup includes a single system with OpenVINO™, as well as the benchmark
-application installed. It measures the time spent on actual inference (excluding any pre or post
-processing) and then reports on the inferences per second (or Frames Per Second).
-
-OpenVINO™ Model Server (OVMS) employs the Intel® Distribution of OpenVINO™ toolkit runtime
-libraries and exposes a set of models via a convenient inference API over gRPC or HTTP/REST.
-Its benchmark results are measured with the configuration of multiple-clients-single-server,
-using two hardware platforms connected by ethernet. Network bandwidth depends on both platforms
-and models used. It is set not to be a bottleneck for workload intensity. The connection is
-dedicated only to measuring performance.
-
-.. dropdown:: See more details about OVMS benchmark setup
-
-   The benchmark setup for OVMS consists of four main parts:
 
-   .. image:: ../assets/images/performance_benchmarks_ovms_02.png
-      :alt: OVMS Benchmark Setup Diagram
 
-   * **OpenVINO™ Model Server** is launched as a docker container on the server platform and it
-     listens to (and answers) requests from clients. OpenVINO™ Model Server is run on the same
-     system as the OpenVINO™ toolkit benchmark application in corresponding benchmarking. Models
-     served by OpenVINO™ Model Server are located in a local file system mounted into the docker
-     container. The OpenVINO™ Model Server instance communicates with other components via ports
-     over a dedicated docker network.
 
-   * **Clients** are run in separated physical machine referred to as client platform. Clients
-     are implemented in Python3 programming language based on TensorFlow* API and they work as
-     parallel processes. Each client waits for a response from OpenVINO™ Model Server before it
-     will send a new next request. The role played by the clients is also verification of
-     responses.
-
-   * **Load balancer** works on the client platform in a docker container. HAProxy is used for
-     this purpose. Its main role is counting of requests forwarded from clients to OpenVINO™
-     Model Server, estimating its latency, and sharing this information by Prometheus service.
-     The reason of locating the load balancer on the client site is to simulate real life
-     scenario that includes impact of physical network on reported metrics.
-
-   * **Execution Controller** is launched on the client platform. It is responsible for
-     synchronization of the whole measurement process, downloading metrics from the load
-     balancer, and presenting the final report of the execution.
-
-
-
-.. raw:: html
-
-   <h2>Test performance yourself</h2>
-
-You can also test performance for your system yourself, following the guide on
-:doc:`getting performance numbers <performance-benchmarks/getting-performance-numbers>`.
-
-.. raw:: html
-
-   <h2>Disclaimers</h2>
+**Disclaimers**
 
 * Intel® Distribution of OpenVINO™ toolkit performance results are based on release
   2024.3, as of July 31, 2024.
@@ -192,12 +167,11 @@ You can also test performance for your system yourself, following the guide on
 
 The results may not reflect all publicly available updates. Intel technologies' features and
 benefits depend on system configuration and may require enabled hardware, software, or service
-activation. Learn more at intel.com, or from the OEM or retailer.
+activation. Learn more at intel.com, the OEM, or retailer.
 
 See configuration disclosure for details. No product can be absolutely secure.
 Performance varies by use, configuration and other factors. Learn more at
 `www.intel.com/PerformanceIndex <https://www.intel.com/PerformanceIndex>`__.
-Your costs and results may vary.
 Intel optimizations, for Intel compilers or other products, may not optimize to the same degree
 for non-Intel products.
 
@@ -205,9 +179,6 @@ for non-Intel products.
 
 
 
-
-
-
 .. raw:: html
 
    <link rel="stylesheet" type="text/css" href="../_static/css/benchmark-banner.css">
diff --git a/docs/articles_en/about-openvino/performance-benchmarks/generative-ai-performance.rst b/docs/articles_en/about-openvino/performance-benchmarks/generative-ai-performance.rst
index 35e09f91f72b9c..39b27d12c970fd 100644
--- a/docs/articles_en/about-openvino/performance-benchmarks/generative-ai-performance.rst
+++ b/docs/articles_en/about-openvino/performance-benchmarks/generative-ai-performance.rst
@@ -4,7 +4,7 @@ Most Efficient Large Language Models for AI PC
 This page is regularly updated to help you identify the best-performing LLMs on the
 Intel® Core™ Ultra processor family and AI PCs.
 
-The tables below list the key performance indicators for a selection of Large Language Models,
+The tables below list key performance indicators for a selection of Large Language Models,
 running on an Intel® Core™ Ultra 7-165H based system, on built-in GPUs.
 
 
@@ -23,24 +23,34 @@ running on an Intel® Core™ Ultra 7-165H based system, on built-in GPUs.
          :class: modeldata stripe
          :name: supportedModelsTableOv
          :header-rows: 1
-         :file:  ../../_static/download/llm_models.csv
+         :file:  ../../_static/benchmarks_files/llm_models.csv
 
 
-For complete information on the system config, see:
-`Hardware Platforms [PDF] <https://docs.openvino.ai/2024/_static/benchmarks_files/OV-2024.4-platform_list.pdf>`__
-
-To view the data in an editable form, you can download the .csv file here:
-
 .. grid:: 1 1 2 2
    :gutter: 4
 
    .. grid-item::
 
-      .. button-link:: ../../_static/download/llm_models.csv
+      All models listed here were tested with the following parameters:
+
+      *  Framework: PyTorch
+      *  Model precision: INT4
+      *  Beam: 1
+      *  Batch size: 1
+
+   .. grid-item::
+
+      .. button-link:: https://docs.openvino.ai/2024/_static/benchmarks_files/OV-2024.4-platform_list.pdf
          :color: primary
          :outline:
          :expand:
 
-         :material-regular:`download;1.5em` Click for OpenVINO LLM results [CSV]
+         :material-regular:`download;1.5em` Get full system info [PDF]
+
+      .. button-link:: ../../_static/benchmarks_files/llm_models.csv
+         :color: primary
+         :outline:
+         :expand:
 
+         :material-regular:`download;1.5em` Get the data in .csv [CSV]
 
diff --git a/docs/articles_en/about-openvino/performance-benchmarks/getting-performance-numbers.rst b/docs/articles_en/about-openvino/performance-benchmarks/getting-performance-numbers.rst
index 069c940063cf14..e35d42a6a02abc 100644
--- a/docs/articles_en/about-openvino/performance-benchmarks/getting-performance-numbers.rst
+++ b/docs/articles_en/about-openvino/performance-benchmarks/getting-performance-numbers.rst
@@ -1,124 +1,201 @@
 Getting Performance Numbers
 ===========================
 
+1. `Benchmarking methodology for OpenVINO <#benchmarking-methodology-for-openvino>`__
 
+   a. `OpenVINO benchmarking (general) <#openvino-benchmarking--general->`__
+   b. `OpenVINO Model Server benchmarking (general) <#openvino-model-server-benchmarking--general->`__
+   c. `OpenVINO Model Server benchmarking (LLM) <#openvino-model-server-benchmarking--llm->`__
 
-This guide explains how to use the benchmark_app to get performance numbers. It also explains how the performance
-numbers are reflected through internal inference performance counters and execution graphs. It also includes
-information on using ITT and Intel® VTune™ Profiler to get performance insights.
+2. `How to obtain benchmark results <#how-to-obtain-benchmark-results>`__
 
+   a. `General considerations <#general-considerations>`__
+   b. `OpenVINO benchmarking (general) <#openvino-benchmarking--general->`__
+   c. `OpenVINO benchmarking (LLM) <#openvino-benchmarking--llm->`__
 
-.. raw:: html
 
-   <h2>Test performance with the benchmark_app</h2>
 
 
+Benchmarking methodology for OpenVINO
+###############################################################################################
 
-You can run OpenVINO benchmarks in both C++ and Python APIs, yet the experience differs in each case.
-The Python one is part of OpenVINO Runtime installation, while C++ is available as a code sample.
-For a detailed description, see: :doc:`benchmark_app <../../learn-openvino/openvino-samples/benchmark-tool>`.
+OpenVINO benchmarking (general)
+++++++++++++++++++++++++++++++++++++++++++++
 
-Make sure to install the latest release package with support for frameworks of the models you want to test.
-For the most reliable performance benchmarks, :doc:`prepare the model for use with OpenVINO <../../openvino-workflow/model-preparation>`.
+The OpenVINO benchmark setup includes a single system with OpenVINO™, as well as the benchmark
+application installed. It measures the time spent on actual inference (excluding any pre or post
+processing) and then reports on the inferences per second (or Frames Per Second).
 
 
+OpenVINO Model Server benchmarking (general)
+++++++++++++++++++++++++++++++++++++++++++++
 
-.. raw:: html
+OpenVINO™ Model Server (OVMS) employs the Intel® Distribution of OpenVINO™ toolkit runtime
+libraries and exposes a set of models via a convenient inference API over gRPC or HTTP/REST.
+Its benchmark results are measured with the configuration of multiple-clients-single-server,
+using two hardware platforms connected by ethernet. Network bandwidth depends on both platforms
+and models used. It is set not to be a bottleneck for workload intensity. The connection is
+dedicated only to measuring performance.
 
-   <h3>Running the benchmark application</h3>
+.. dropdown:: See more details about OVMS benchmark setup
 
+   The benchmark setup for OVMS consists of four main parts:
 
-The benchmark_app includes a lot of device-specific options, but the primary usage is as simple as:
+   .. image:: ../assets/images/performance_benchmarks_ovms_02.png
+      :alt: OVMS Benchmark Setup Diagram
 
-.. code-block:: sh
+   * **OpenVINO™ Model Server** is launched as a docker container on the server platform and it
+     listens to (and answers) requests from clients. OpenVINO™ Model Server is run on the same
+     system as the OpenVINO™ toolkit benchmark application in corresponding benchmarking. Models
+     served by OpenVINO™ Model Server are located in a local file system mounted into the docker
+     container. The OpenVINO™ Model Server instance communicates with other components via ports
+     over a dedicated docker network.
 
-   benchmark_app -m <model> -d <device> -i <input>
+   * **Clients** are run in separated physical machine referred to as client platform. Clients
+     are implemented in Python3 programming language based on TensorFlow* API and they work as
+     parallel processes. Each client waits for a response from OpenVINO™ Model Server before it
+     will send a new next request. The role played by the clients is also verification of
+     responses.
 
+   * **Load balancer** works on the client platform in a docker container. HAProxy is used for
+     this purpose. Its main role is counting of requests forwarded from clients to OpenVINO™
+     Model Server, estimating its latency, and sharing this information by Prometheus service.
+     The reason of locating the load balancer on the client site is to simulate real life
+     scenario that includes impact of physical network on reported metrics.
 
-Each of the :doc:`OpenVINO supported devices <../compatibility-and-support/supported-devices>` offers
-performance settings that contain command-line equivalents in the Benchmark app.
+   * **Execution Controller** is launched on the client platform. It is responsible for
+     synchronization of the whole measurement process, downloading metrics from the load
+     balancer, and presenting the final report of the execution.
 
-While these settings provide really low-level control for the optimal model performance on the *specific* device,
-it is recommended to always start performance evaluation with the :doc:`OpenVINO High-Level Performance Hints <../../openvino-workflow/running-inference/optimize-inference/high-level-performance-hints>` first, like so:
 
-.. code-block:: sh
+OpenVINO Model Server benchmarking (LLM)
+++++++++++++++++++++++++++++++++++++++++
 
-   # for throughput prioritization
-   benchmark_app -hint tput -m <model> -d <device>
-   # for latency prioritization
-   benchmark_app -hint latency -m <model> -d <device>
+In the benchmarking results presented here, the load from clients is simulated using the
+benchmark_serving.py script from vLLM and the ShareGPT dataset. It represents real life usage
+scenarios. Both OpenVINO Model Server and vLLM expose OpenAI-compatible REST endpoints so the
+methodology is identical.
 
+In the experiments, we change the average request rate to identify the tradeoff between total
+throughput and the TPOT latency.
 
+Note that in the benchmarking, the feature of prefix_caching is not used.
 
-.. raw:: html
 
-   <h2>Additional benchmarking considerations</h2>
 
 
 
-.. raw:: html
+How to obtain benchmark results
+###############################################################################################
 
-   <h3>1 - Select a Proper Set of Operations to Measure</h3>
+General considerations
+++++++++++++++++++++++
 
+.. dropdown:: Select a proper set of operations to measure
 
-When evaluating performance of a model with OpenVINO Runtime, it is required to measure a proper set of operations.
+   When evaluating performance of a model with OpenVINO Runtime, it is required to measure a
+   proper set of operations.
 
-- Avoid including one-time costs such as model loading.
-- Track operations that occur outside OpenVINO Runtime (such as video decoding) separately.
+   * Avoid including one-time costs such as model loading.
+   * Track operations that occur outside OpenVINO Runtime, such as video decoding, separately.
 
+   .. note::
 
-.. note::
+      Some image pre-processing can be baked into OpenVINO IR and accelerated accordingly.
+      For more information, refer to
+      :doc:`Embedding Pre-processing <../../documentation/legacy-features/transition-legacy-conversion-api/legacy-conversion-api/[legacy]-embedding-preprocessing-computation>`
+      and
+      :doc:`General Runtime Optimizations <../../openvino-workflow/running-inference/optimize-inference/general-optimizations>`.
 
-   Some image pre-processing can be baked into OpenVINO IR and accelerated accordingly. For more information,
-   refer to :doc:`Embedding Pre-processing <../../documentation/legacy-features/transition-legacy-conversion-api/legacy-conversion-api/[legacy]-embedding-preprocessing-computation>` and
-   :doc:`General Runtime Optimizations <../../openvino-workflow/running-inference/optimize-inference/general-optimizations>`.
+.. dropdown:: Maximize the chance to obtain credible data
 
+   Performance conclusions should be build on reproducible data. As for the performance
+   measurements, they should be done with a large number of invocations of the same routine.
+   Since the first iteration is almost always significantly slower than the subsequent ones,
+   an aggregated value can be used for the execution time for final projections:
 
+   * If the warm-up run does not help or execution times still vary, you can try running a
+     large number of iterations and then use the mean value of the results.
+   * If time values differ too much, consider using a geomean.
+   * Be aware of potential power-related irregularities, such as throttling. A device may assume
+     one of several different power states, so it is advisable to fix its frequency when
+     optimizing, for better performance data reproducibility.
+   * Note that end-to-end application benchmarking should also be performed under real
+     operational conditions.
 
-.. raw:: html
+.. dropdown:: Compare performance with native/framework code
 
-   <h3>2 - Try to Get Credible Data</h3>
+   When comparing OpenVINO Runtime performance with the framework or reference code,
+   make sure that both versions are as similar as possible:
 
-Performance conclusions should be build upon reproducible data. As for the performance measurements, they should
-be done with a large number of invocations of the same routine. Since the first iteration is almost always significantly
-slower than the subsequent ones, an aggregated value can be used for the execution time for final projections:
+   * Wrap the exact inference execution (for examples, see :doc:`Benchmark app <../../learn-openvino/openvino-samples/benchmark-tool>`).
+   * Do not include model loading time.
+   * Ensure that the inputs are identical for OpenVINO Runtime and the framework. For example, watch out for random values that can be used to populate the inputs.
+   * In situations when any user-side pre-processing should be tracked separately, consider :doc:`image pre-processing and conversion <../../openvino-workflow/running-inference/optimize-inference/optimize-preprocessing>`.
+   * When applicable, leverage the :doc:`Dynamic Shapes support <../../openvino-workflow/running-inference/dynamic-shapes>`.
+   * If possible, demand the same accuracy. For example, TensorFlow allows ``FP16`` execution, so when comparing to that, make sure to test the OpenVINO Runtime with the ``FP16`` as well.
 
-- If the warm-up run does not help or execution time still varies, you can try running a large number of iterations
-  and then average or find a mean of the results.
-- If the time values range too much, consider geomean.
-- Be aware of the throttling and other power oddities. A device can exist in one of several different power states.
-  When optimizing your model, consider fixing the device frequency for better performance data reproducibility.
-  However, the end-to-end (application) benchmarking should also be performed under real operational conditions.
+.. dropdown:: Make sure the benchmarking setup is proper for the selected scenario
 
+   * Install the latest release package supporting the frameworks of the tested models.
+   * For the most reliable performance benchmarks,
+     :doc:`prepare the model for use with OpenVINO <../../openvino-workflow/model-preparation>`.
+   * For testing generative AI models, make sure you select the method that best suits your case,
+     Optimum-Intel or the OpenVINO GenAI package.
 
 
-.. raw:: html
 
-   <h3>3 - Compare Performance with Native/Framework Code</h3>
+OpenVINO benchmarking (general)
++++++++++++++++++++++++++++++++
 
-When comparing the OpenVINO Runtime performance with the framework or another reference code, make sure that both versions are as similar as possible:
+The default way of measuring OpenVINO performance is running a piece of code, referred to as
+:doc:`the benchmark tool <../../learn-openvino/openvino-samples/benchmark-tool>`.
+For Python, it is part of the OpenVINO Runtime installation, while for C++, it is available as
+a code sample.
 
--	Wrap the exact inference execution (for examples, see :doc:`Benchmark app <../../learn-openvino/openvino-samples/benchmark-tool>`).
--	Do not include model loading time.
--	Ensure that the inputs are identical for OpenVINO Runtime and the framework. For example, watch out for random values that can be used to populate the inputs.
--	In situations when any user-side pre-processing should be tracked separately, consider :doc:`image pre-processing and conversion <../../openvino-workflow/running-inference/optimize-inference/optimize-preprocessing>`.
--  When applicable, leverage the :doc:`Dynamic Shapes support <../../openvino-workflow/running-inference/dynamic-shapes>`.
--	If possible, demand the same accuracy. For example, TensorFlow allows ``FP16`` execution, so when comparing to that, make sure to test the OpenVINO Runtime with the ``FP16`` as well.
 
+Running the benchmark application
+---------------------------------
+
+The benchmark_app includes a lot of device-specific options, but the primary usage is as simple
+as:
+
+.. code-block:: sh
+
+   benchmark_app -m <model> -d <device> -i <input>
 
-.. raw:: html
 
-   <h3>Internal Inference Performance Counters and Execution Graphs</h3>
+Each of the :doc:`OpenVINO supported devices <../compatibility-and-support/supported-devices>`
+offers performance settings that contain command-line equivalents in the Benchmark app.
 
-More detailed insights into inference performance breakdown can be achieved with device-specific performance counters and/or execution graphs.
+While these settings provide really low-level control for the optimal model performance on a
+*specific* device, it is recommended to always start performance evaluation with the
+:doc:`OpenVINO High-Level Performance Hints <../../openvino-workflow/running-inference/optimize-inference/high-level-performance-hints>`
+first, like so:
+
+.. code-block:: sh
+
+   # for throughput prioritization
+   benchmark_app -hint tput -m <model> -d <device>
+   # for latency prioritization
+   benchmark_app -hint latency -m <model> -d <device>
+
+
+Internal Inference Performance Counters and Execution Graphs
+-------------------------------------------------------------
+
+More detailed insights into inference performance breakdown can be achieved with device-specific
+performance counters and/or execution graphs.
 Both :doc:`C++ and Python <../../learn-openvino/openvino-samples/benchmark-tool>`
-versions of the *benchmark_app* support a ``-pc`` command-line parameter that outputs internal execution breakdown.
+versions of the benchmark_app support a ``-pc`` command-line parameter that outputs an internal
+execution breakdown.
 
-For example, the table shown below is part of performance counters for quantized
-`TensorFlow implementation of ResNet-50 <https://github.com/openvinotoolkit/open_model_zoo/tree/master/models/public/resnet-50-tf>`__
-model inference on :doc:`CPU Plugin <../../openvino-workflow/running-inference/inference-devices-and-modes/cpu-device>`.
-Keep in mind that since the device is CPU, the ``realTime`` wall clock and the ``cpu`` time layers are the same.
-Information about layer precision is also stored in the performance counters.
+For example, the table below is part of performance counters for
+:doc:`CPU inference <../../openvino-workflow/running-inference/inference-devices-and-modes/cpu-device>`.
+of a `TensorFlow implementation of ResNet-50 <https://github.com/openvinotoolkit/open_model_zoo/tree/master/models/public/resnet-50-tf>`__
+Keep in mind that since the device is CPU, the ``realTime`` wall clock and the ``cpu`` time
+layers are the same. Information about layer precision is also stored in the performance
+counters.
 
 
 ===========================================================  =============  ==============  =====================  =================  ==============
@@ -136,39 +213,63 @@ Information about layer precision is also stored in the performance counters.
 
 |   The ``execStatus`` column of the table includes the following possible values:
 |     - ``EXECUTED`` - the layer was executed by standalone primitive.
-|     - ``NOT_RUN`` - the layer was not executed by standalone primitive or was fused with another operation and executed in another layer primitive.
+|     - ``NOT_RUN`` - the layer was not executed by standalone primitive or was fused with
+        another operation and executed in another layer primitive.
 |
-|   The ``execType`` column of the table includes inference primitives with specific suffixes. The layers could have the following marks:
-|     - The ``I8`` suffix is for layers that had 8-bit data type input and were computed in 8-bit precision.
+|   The ``execType`` column of the table includes inference primitives with specific suffixes.
+    The layers could have the following marks:
+|     - The ``I8`` suffix is for layers that had 8-bit data type input and were computed in
+        8-bit precision.
 |     - The ``FP32`` suffix is for layers computed in 32-bit precision.
 |
-|   All ``Convolution`` layers are executed in ``int8`` precision. The rest of the layers are fused into Convolutions using post-operation optimization,
-    as described in :doc:`CPU Device <../../openvino-workflow/running-inference/inference-devices-and-modes/cpu-device>`. This contains layer names
-    (as seen in OpenVINO IR), type of the layer, and execution statistics.
+|   All ``Convolution`` layers are executed in ``int8`` precision. The rest of the layers are
+    fused into Convolutions using post-operation optimization, as described in
+    :doc:`CPU Device <../../openvino-workflow/running-inference/inference-devices-and-modes/cpu-device>`.
+    This contains layer names (as seen in OpenVINO IR), type of the layer, and execution
+    statistics.
 
 
-Both *benchmark_app* versions also support the ``exec_graph_path`` command-line option. It requires OpenVINO to output the same execution
-statistics per layer, but in the form of plugin-specific `Netron-viewable <https://netron.app/>`__ graph to the specified file.
+Both *benchmark_app* versions also support the ``exec_graph_path`` command-line option.
+It requires OpenVINO to output the same execution statistics per layer, but in the form of
+plugin-specific `Netron-viewable <https://netron.app/>`__ graph to the specified file.
+
+Especially when performance-debugging
+:doc:`latency <../../openvino-workflow/running-inference/optimize-inference/optimizing-latency>`,
+note that the counters do not reflect the time spent in the ``plugin/device/driver/etc`` queues.
+If the sum of the counters is too different from the latency of an inference request, consider
+testing with less inference requests. For example, running single
+:doc:`OpenVINO stream <../../openvino-workflow/running-inference/optimize-inference/optimizing-throughput>`
+with multiple requests would produce nearly identical counters as running a single inference
+request, while the actual latency can be quite different.
+
+Lastly, the performance statistics with both performance counters and execution graphs are
+averaged, so such data for the
+:doc:`inputs of dynamic shapes <../../openvino-workflow/running-inference/dynamic-shapes>`
+should be measured carefully, preferably by isolating the specific shape and executing multiple
+times in a loop, to gather reliable data.
+
+Use ITT to Get Performance Insights
+--------------------------------------
+
+In general, OpenVINO and its individual plugins are heavily instrumented with Intel®
+Instrumentation and Tracing Technology (ITT). Therefore, you can also compile OpenVINO from the
+source code with ITT enabled and use tools like
+`Intel® VTune™ Profiler <https://software.intel.com/en-us/vtune>`__
+to get detailed inference performance breakdown and additional insights in the application-level
+performance on the timeline view.
+
+
+OpenVINO benchmarking (LLM)
++++++++++++++++++++++++++++++++
+
+Large Language Models require a different benchmarking approach to static models. A detailed
+description will be added soon.
 
-Especially when performance-debugging the :doc:`latency <../../openvino-workflow/running-inference/optimize-inference/optimizing-latency>`, note that the counters
-do not reflect the time spent in the ``plugin/device/driver/etc`` queues. If the sum of the counters is too different from the latency
-of an inference request, consider testing with less inference requests. For example, running single
-:doc:`OpenVINO stream <../../openvino-workflow/running-inference/optimize-inference/optimizing-throughput>` with multiple requests would produce nearly identical
-counters as running a single inference request, while the actual latency can be quite different.
 
-Lastly, the performance statistics with both performance counters and execution graphs are averaged,
-so such data for the :doc:`inputs of dynamic shapes <../../openvino-workflow/running-inference/dynamic-shapes>` should be measured carefully,
-preferably by isolating the specific shape and executing multiple times in a loop, to gather reliable data.
 
 
-.. raw:: html
 
-   <h3>Use ITT to Get Performance Insights</h3>
 
-In general, OpenVINO and its individual plugins are heavily instrumented with Intel® Instrumentation and Tracing Technology (ITT).
-Therefore, you can also compile OpenVINO from the source code with ITT enabled and use tools like
-`Intel® VTune™ Profiler <https://software.intel.com/en-us/vtune>`__ to get detailed inference performance breakdown and additional
-insights in the application-level performance on the timeline view.
 
 
 
diff --git a/docs/articles_en/about-openvino/performance-benchmarks/model-accuracy-int8-fp32.rst b/docs/articles_en/about-openvino/performance-benchmarks/model-accuracy-int8-fp32.rst
index 8b93e6a1aebe7b..3162bae7254704 100644
--- a/docs/articles_en/about-openvino/performance-benchmarks/model-accuracy-int8-fp32.rst
+++ b/docs/articles_en/about-openvino/performance-benchmarks/model-accuracy-int8-fp32.rst
@@ -4,9 +4,10 @@ Model Accuracy
 
 
 The following two tables present the absolute accuracy drop calculated as the accuracy difference
-between OV-accuracy and the original frame work accuracy for FP32, and the same for INT8, BF16 and
-FP16 representations of a model on three platform architectures. The third table presents the GenAI model accuracies as absolute accuracy values. Please also refer to notes below
-the table for more information.
+between OV-accuracy and the original framework accuracy for FP32, and the same for INT8, BF16,
+and FP16 representations of a model on three platform architectures. The third table presents
+the GenAI model accuracies as absolute accuracy values. Refer to notes below the table for more
+information.
 
 * A - Intel® Core™ i9-9000K (AVX2), INT8 and FP32
 * B - Intel® Xeon® 6338, (VNNI), INT8 and FP32
diff --git a/docs/sphinx_setup/_static/benchmarks_files/llm_models.csv b/docs/sphinx_setup/_static/benchmarks_files/llm_models.csv
new file mode 100644
index 00000000000000..dee8e72a9578fd
--- /dev/null
+++ b/docs/sphinx_setup/_static/benchmarks_files/llm_models.csv
@@ -0,0 +1,22 @@
+﻿Model name,"Throughput: (tokens/sec. 2nd token)",1st token latency (msec),Max RSS memory used. (MB),Input tokens,Output tokens
+OPT-2.7b,"20.2",2757,7084,937,128
+Phi-3-mini-4k-instruct,"19.9",2776,7028,1062,128
+Orca-mini-3b,"19.2",2966,7032,1024,128
+Phi-2,"17.8",2162,7032,1024,128
+Stable-Zephyr-3b-dpo,"17.0",1791,7007,946,128
+ChatGLM3-6b,"16.5",3569,6741,1024,128
+Dolly-v2-3b,"15.8",6891,6731,1024,128
+Stablelm-3b-4e1t,"15.7",2051,7018,1024,128
+Red-Pajama-Incite-Chat-3b-V1,"14.8",6582,7028,1020,128
+Falcon-7b-instruct,"14.5",4552,7033,1049,128
+Codegen25-7b,"13.3",3982,6732,1024,128
+GPT-j-6b,"13.2",7213,6882,1024,128
+Stablelm-7b,"12.8",6339,7013,1020,128
+Llama-3-8b,"12.8",4356,6953,1024,128
+Llama-2-7b-chat,"12.3",4205,6906,1024,128
+Llama-7b,"11.7",4315,6927,1024,128
+Mistral-7b-v0.1,"10.5",4462,7242,1007,128
+Zephyr-7b-beta,"10.5",4500,7039,1024,128
+Qwen1.5-7b-chat,"9.9",4318,7034,1024,128
+Baichuan2-7b-chat,"9.8",4668,6724,1024,128
+Qwen-7b-chat,"9.0",5141,6996,1024,128
\ No newline at end of file
diff --git a/docs/sphinx_setup/_static/download/llm_models.csv b/docs/sphinx_setup/_static/download/llm_models.csv
deleted file mode 100644
index 2ff93f503a6d3b..00000000000000
--- a/docs/sphinx_setup/_static/download/llm_models.csv
+++ /dev/null
@@ -1,22 +0,0 @@
-﻿Model name,"Throughput: (tokens/sec. 2nd token)",1st token latency (msec),Max RSS memory used. (MB),Input tokens,Output tokens,Model Precision,Beam,Batch size,Framework
-OPT-2.7b,20.2,2757,7084,937,128,INT4,1,1,PT
-Phi-3-mini-4k-instruct,19.9,2776,7028,1062,128,INT4,1,1,PT
-Orca-mini-3b,19.2,2966,7032,1024,128,INT4,1,1,PT
-Phi-2,17.8,2162,7032,1024,128,INT4,1,1,PT
-Stable-Zephyr-3b-dpo,17.0,1791,7007,946,128,INT4,1,1,PT
-ChatGLM3-6b,16.5,3569,6741,1024,128,INT4,1,1,PT
-Dolly-v2-3b,15.8,6891,6731,1024,128,INT4,1,1,PT
-Stablelm-3b-4e1t,15.7,2051,7018,1024,128,INT4,1,1,PT
-Red-Pajama-Incite-Chat-3b-V1,14.8,6582,7028,1020,128,INT4,1,1,PT
-Falcon-7b-instruct,14.5,4552,7033,1049,128,INT4,1,1,PT
-Codegen25-7b,13.3,3982,6732,1024,128,INT4,1,1,PT
-GPT-j-6b,13.2,7213,6882,1024,128,INT4,1,1,PT
-Stablelm-7b,12.8,6339,7013,1020,128,INT4,1,1,PT
-Llama-3-8b,12.8,4356,6953,1024,128,INT4,1,1,PT
-Llama-2-7b-chat,12.3,4205,6906,1024,128,INT4,1,1,PT
-Llama-7b,11.7,4315,6927,1024,128,INT4,1,1,PT
-Mistral-7b-v0.1,10.5,4462,7242,1007,128,INT4,1,1,PT
-Zephyr-7b-beta,10.5,4500,7039,1024,128,INT4,1,1,PT
-Qwen1.5-7b-chat,9.9,4318,7034,1024,128,INT4,1,1,PT
-Baichuan2-7b-chat,9.8,4668,6724,1024,128,INT4,1,1,PT
-Qwen-7b-chat,9.0,5141,6996,1024,128,INT4,1,1,PT
\ No newline at end of file
diff --git a/docs/sphinx_setup/_static/download/llm_models_ovms.csv b/docs/sphinx_setup/_static/download/llm_models_ovms.csv
deleted file mode 100644
index d481fd3b6a56e8..00000000000000
--- a/docs/sphinx_setup/_static/download/llm_models_ovms.csv
+++ /dev/null
@@ -1,100 +0,0 @@
-Product,Model,Framework,Precision,Node,Request Rate,Throughput [tok/s],TPOT Mean Latency
-ovms,meta-llama/Llama-2-7b-chat-hf,PT,INT8-CW,Xeon Platinum 8380,0.2,92.75,75.75
-ovms,meta-llama/Llama-2-7b-chat-hf,PT,INT8-CW,Xeon Platinum 8380,0.3,137.89,98.6
-ovms,meta-llama/Llama-2-7b-chat-hf,PT,INT8-CW,Xeon Platinum 8380,0.4,182.68,144.36
-ovms,meta-llama/Llama-2-7b-chat-hf,PT,INT8-CW,Xeon Platinum 8380,0.5,227.02,238.54
-ovms,meta-llama/Llama-2-7b-chat-hf,PT,INT8-CW,Xeon Platinum 8380,0.6,259.06,679.07
-ovms,meta-llama/Llama-2-7b-chat-hf,PT,INT8-CW,Xeon Platinum 8380,0.7,267.24,785.75
-ovms,meta-llama/Llama-2-7b-chat-hf,PT,INT8-CW,Xeon Platinum 8380,0.8,267.77,815.11
-ovms,meta-llama/Llama-2-7b-chat-hf,PT,INT8-CW,Xeon Platinum 8380,0.9,270.01,827.09
-ovms,meta-llama/Llama-2-7b-chat-hf,PT,INT8-CW,Xeon Platinum 8380,1.0,268.92,840.1
-ovms,meta-llama/Llama-2-7b-chat-hf,PT,INT8-CW,Xeon Platinum 8380,2.0,269.6,847.81
-ovms,meta-llama/Llama-2-7b-chat-hf,PT,INT8-CW,Xeon Platinum 8380,inf,270.55,839.37
-ovms,meta-llama/Llama-2-7b-chat-hf,PT,INT8-CW,Xeon Platinum 8480+,0.2,92.63,63.23
-ovms,meta-llama/Llama-2-7b-chat-hf,PT,INT8-CW,Xeon Platinum 8480+,0.4,183.51,105.0
-ovms,meta-llama/Llama-2-7b-chat-hf,PT,INT8-CW,Xeon Platinum 8480+,0.6,272.59,95.34
-ovms,meta-llama/Llama-2-7b-chat-hf,PT,INT8-CW,Xeon Platinum 8480+,0.8,359.28,126.61
-ovms,meta-llama/Llama-2-7b-chat-hf,PT,INT8-CW,Xeon Platinum 8480+,1.0,442.69,169.24
-ovms,meta-llama/Llama-2-7b-chat-hf,PT,INT8-CW,Xeon Platinum 8480+,1.2,521.61,195.94
-ovms,meta-llama/Llama-2-7b-chat-hf,PT,INT8-CW,Xeon Platinum 8480+,1.4,589.34,267.43
-ovms,meta-llama/Llama-2-7b-chat-hf,PT,INT8-CW,Xeon Platinum 8480+,1.6,650.25,291.68
-ovms,meta-llama/Llama-2-7b-chat-hf,PT,INT8-CW,Xeon Platinum 8480+,1.8,655.39,308.64
-ovms,meta-llama/Llama-2-7b-chat-hf,PT,INT8-CW,Xeon Platinum 8480+,2.0,680.45,302.09
-ovms,meta-llama/Llama-2-7b-chat-hf,PT,INT8-CW,Xeon Platinum 8480+,inf,702.42,307.82
-ovms,meta-llama/Llama-2-7b-chat-hf,PT,INT8-CW,Xeon Platinum 8580,0.2,92.89,54.69
-ovms,meta-llama/Llama-2-7b-chat-hf,PT,INT8-CW,Xeon Platinum 8580,0.4,184.37,77.0
-ovms,meta-llama/Llama-2-7b-chat-hf,PT,INT8-CW,Xeon Platinum 8580,0.6,273.06,101.81
-ovms,meta-llama/Llama-2-7b-chat-hf,PT,INT8-CW,Xeon Platinum 8580,0.8,360.22,135.38
-ovms,meta-llama/Llama-2-7b-chat-hf,PT,INT8-CW,Xeon Platinum 8580,1.0,442.46,170.65
-ovms,meta-llama/Llama-2-7b-chat-hf,PT,INT8-CW,Xeon Platinum 8580,1.2,519.5,208.44
-ovms,meta-llama/Llama-2-7b-chat-hf,PT,INT8-CW,Xeon Platinum 8580,1.4,590.11,252.86
-ovms,meta-llama/Llama-2-7b-chat-hf,PT,INT8-CW,Xeon Platinum 8580,1.6,651.09,286.93
-ovms,meta-llama/Llama-2-7b-chat-hf,PT,INT8-CW,Xeon Platinum 8580,1.8,670.74,298.02
-ovms,meta-llama/Llama-2-7b-chat-hf,PT,INT8-CW,Xeon Platinum 8580,2.0,684.4,299.41
-ovms,meta-llama/Llama-2-7b-chat-hf,PT,INT8-CW,Xeon Platinum 8580,inf,701.91,305.9
-ovms,meta-llama/Meta-Llama-3-8B-Instruct,PT,INT8-CW,Xeon Platinum 8380,0.2,79.24,73.06
-ovms,meta-llama/Meta-Llama-3-8B-Instruct,PT,INT8-CW,Xeon Platinum 8380,0.3,118.42,90.31
-ovms,meta-llama/Meta-Llama-3-8B-Instruct,PT,INT8-CW,Xeon Platinum 8380,0.4,157.04,113.23
-ovms,meta-llama/Meta-Llama-3-8B-Instruct,PT,INT8-CW,Xeon Platinum 8380,0.5,193.85,203.97
-ovms,meta-llama/Meta-Llama-3-8B-Instruct,PT,INT8-CW,Xeon Platinum 8380,0.6,232.36,253.17
-ovms,meta-llama/Meta-Llama-3-8B-Instruct,PT,INT8-CW,Xeon Platinum 8380,0.7,260.56,581.45
-ovms,meta-llama/Meta-Llama-3-8B-Instruct,PT,INT8-CW,Xeon Platinum 8380,0.8,271.97,761.05
-ovms,meta-llama/Meta-Llama-3-8B-Instruct,PT,INT8-CW,Xeon Platinum 8380,0.9,273.36,787.74
-ovms,meta-llama/Meta-Llama-3-8B-Instruct,PT,INT8-CW,Xeon Platinum 8380,1.0,272.54,811.37
-ovms,meta-llama/Meta-Llama-3-8B-Instruct,PT,INT8-CW,Xeon Platinum 8380,2.0,278.07,809.3
-ovms,meta-llama/Meta-Llama-3-8B-Instruct,PT,INT8-CW,Xeon Platinum 8380,inf,275.71,810.89
-ovms,meta-llama/Meta-Llama-3-8B-Instruct,PT,INT8-CW,Xeon Platinum 8480+,0.2,78.3,60.37
-ovms,meta-llama/Meta-Llama-3-8B-Instruct,PT,INT8-CW,Xeon Platinum 8480+,0.4,156.42,69.27
-ovms,meta-llama/Meta-Llama-3-8B-Instruct,PT,INT8-CW,Xeon Platinum 8480+,0.6,232.27,77.79
-ovms,meta-llama/Meta-Llama-3-8B-Instruct,PT,INT8-CW,Xeon Platinum 8480+,0.8,307.37,90.07
-ovms,meta-llama/Meta-Llama-3-8B-Instruct,PT,INT8-CW,Xeon Platinum 8480+,1.0,380.61,104.71
-ovms,meta-llama/Meta-Llama-3-8B-Instruct,PT,INT8-CW,Xeon Platinum 8480+,1.2,452.18,127.36
-ovms,meta-llama/Meta-Llama-3-8B-Instruct,PT,INT8-CW,Xeon Platinum 8480+,1.4,519.44,156.18
-ovms,meta-llama/Meta-Llama-3-8B-Instruct,PT,INT8-CW,Xeon Platinum 8480+,1.6,587.62,169.44
-ovms,meta-llama/Meta-Llama-3-8B-Instruct,PT,INT8-CW,Xeon Platinum 8480+,1.8,649.94,198.44
-ovms,meta-llama/Meta-Llama-3-8B-Instruct,PT,INT8-CW,Xeon Platinum 8480+,2.0,707.46,234.44
-ovms,meta-llama/Meta-Llama-3-8B-Instruct,PT,INT8-CW,Xeon Platinum 8480+,inf,799.46,265.5
-ovms,meta-llama/Meta-Llama-3-8B-Instruct,PT,INT8-CW,Xeon Platinum 8580,0.2,78.61,54.12
-ovms,meta-llama/Meta-Llama-3-8B-Instruct,PT,INT8-CW,Xeon Platinum 8580,0.4,156.19,70.38
-ovms,meta-llama/Meta-Llama-3-8B-Instruct,PT,INT8-CW,Xeon Platinum 8580,0.6,232.36,81.83
-ovms,meta-llama/Meta-Llama-3-8B-Instruct,PT,INT8-CW,Xeon Platinum 8580,0.8,307.01,101.66
-ovms,meta-llama/Meta-Llama-3-8B-Instruct,PT,INT8-CW,Xeon Platinum 8580,1.0,376.36,139.62
-ovms,meta-llama/Meta-Llama-3-8B-Instruct,PT,INT8-CW,Xeon Platinum 8580,1.2,447.75,158.53
-ovms,meta-llama/Meta-Llama-3-8B-Instruct,PT,INT8-CW,Xeon Platinum 8580,1.4,519.74,160.26
-ovms,meta-llama/Meta-Llama-3-8B-Instruct,PT,INT8-CW,Xeon Platinum 8580,1.6,582.37,190.22
-ovms,meta-llama/Meta-Llama-3-8B-Instruct,PT,INT8-CW,Xeon Platinum 8580,1.8,635.46,231.31
-ovms,meta-llama/Meta-Llama-3-8B-Instruct,PT,INT8-CW,Xeon Platinum 8580,2.0,698.38,247.77
-ovms,meta-llama/Meta-Llama-3-8B-Instruct,PT,INT8-CW,Xeon Platinum 8580,inf,843.51,252.12
-ovms,mistralai/Mistral-7B-v0.1,PT,INT8-CW,Xeon Platinum 8380,0.2,87.18,74.96
-ovms,mistralai/Mistral-7B-v0.1,PT,INT8-CW,Xeon Platinum 8380,0.3,130.74,92.67
-ovms,mistralai/Mistral-7B-v0.1,PT,INT8-CW,Xeon Platinum 8380,0.4,172.94,117.03
-ovms,mistralai/Mistral-7B-v0.1,PT,INT8-CW,Xeon Platinum 8380,0.5,214.71,172.69
-ovms,mistralai/Mistral-7B-v0.1,PT,INT8-CW,Xeon Platinum 8380,0.6,255.45,282.74
-ovms,mistralai/Mistral-7B-v0.1,PT,INT8-CW,Xeon Platinum 8380,0.7,280.38,629.68
-ovms,mistralai/Mistral-7B-v0.1,PT,INT8-CW,Xeon Platinum 8380,0.8,280.55,765.16
-ovms,mistralai/Mistral-7B-v0.1,PT,INT8-CW,Xeon Platinum 8380,0.9,289.65,765.65
-ovms,mistralai/Mistral-7B-v0.1,PT,INT8-CW,Xeon Platinum 8380,1.0,290.67,783.47
-ovms,mistralai/Mistral-7B-v0.1,PT,INT8-CW,Xeon Platinum 8380,2.0,284.14,815.09
-ovms,mistralai/Mistral-7B-v0.1,PT,INT8-CW,Xeon Platinum 8380,inf,290.39,793.52
-ovms,mistralai/Mistral-7B-v0.1,PT,INT8-CW,Xeon Platinum 8480+,0.2,88.9,60.04
-ovms,mistralai/Mistral-7B-v0.1,PT,INT8-CW,Xeon Platinum 8480+,0.4,176.5,70.24
-ovms,mistralai/Mistral-7B-v0.1,PT,INT8-CW,Xeon Platinum 8480+,0.6,262.04,77.01
-ovms,mistralai/Mistral-7B-v0.1,PT,INT8-CW,Xeon Platinum 8480+,0.8,346.01,95.29
-ovms,mistralai/Mistral-7B-v0.1,PT,INT8-CW,Xeon Platinum 8480+,1.0,427.37,114.16
-ovms,mistralai/Mistral-7B-v0.1,PT,INT8-CW,Xeon Platinum 8480+,1.2,507.86,138.56
-ovms,mistralai/Mistral-7B-v0.1,PT,INT8-CW,Xeon Platinum 8480+,1.4,582.58,150.72
-ovms,mistralai/Mistral-7B-v0.1,PT,INT8-CW,Xeon Platinum 8480+,1.6,655.61,166.64
-ovms,mistralai/Mistral-7B-v0.1,PT,INT8-CW,Xeon Platinum 8480+,1.8,717.9,216.76
-ovms,mistralai/Mistral-7B-v0.1,PT,INT8-CW,Xeon Platinum 8480+,2.0,774.3,233.49
-ovms,mistralai/Mistral-7B-v0.1,PT,INT8-CW,Xeon Platinum 8480+,inf,873.93,245.31
-ovms,mistralai/Mistral-7B-v0.1,PT,INT8-CW,Xeon Platinum 8580,0.2,88.92,56.33
-ovms,mistralai/Mistral-7B-v0.1,PT,INT8-CW,Xeon Platinum 8580,0.4,175.99,72.72
-ovms,mistralai/Mistral-7B-v0.1,PT,INT8-CW,Xeon Platinum 8580,0.6,261.96,84.24
-ovms,mistralai/Mistral-7B-v0.1,PT,INT8-CW,Xeon Platinum 8580,0.8,346.78,101.67
-ovms,mistralai/Mistral-7B-v0.1,PT,INT8-CW,Xeon Platinum 8580,1.0,427.85,128.33
-ovms,mistralai/Mistral-7B-v0.1,PT,INT8-CW,Xeon Platinum 8580,1.2,506.17,150.01
-ovms,mistralai/Mistral-7B-v0.1,PT,INT8-CW,Xeon Platinum 8580,1.4,581.72,167.61
-ovms,mistralai/Mistral-7B-v0.1,PT,INT8-CW,Xeon Platinum 8580,1.6,651.97,190.91
-ovms,mistralai/Mistral-7B-v0.1,PT,INT8-CW,Xeon Platinum 8580,1.8,713.2,222.56
-ovms,mistralai/Mistral-7B-v0.1,PT,INT8-CW,Xeon Platinum 8580,2.0,771.17,232.08
-ovms,mistralai/Mistral-7B-v0.1,PT,INT8-CW,Xeon Platinum 8580,inf,839.74,253.74

From dd16602824c66c53935a2d084ab4d7ace36a6414 Mon Sep 17 00:00:00 2001
From: Tomasz Krupa <tomasz.krupa@intel.com>
Date: Mon, 21 Oct 2024 07:44:28 +0000
Subject: [PATCH 06/24] [GPU] Weightless caching (#25731)

Co-authored-by: Pavel Durandin <pavel.durandin@intel.com>
---
 .../openvino/runtime/properties/__init__.py   |   1 +
 .../pyopenvino/core/properties/properties.cpp |   1 +
 .../tests/test_runtime/test_properties.py     |   5 +
 .../rt_info/weightless_caching_attributes.hpp |  38 ++++
 .../op/util/weightless_caching_attributes.cpp |   9 +
 .../pass/serialization/deterministicity.cpp   |   1 +
 src/frontends/ir/src/ir_deserializer.cpp      |   8 +
 .../include/openvino/runtime/properties.hpp   |   6 +
 .../include/intel_gpu/primitives/data.hpp     |  84 +++++--
 src/plugins/intel_gpu/src/graph/program.cpp   |  11 +
 .../intel_gpu/src/plugin/compiled_model.cpp   |  26 +--
 src/plugins/intel_gpu/src/plugin/plugin.cpp   |   7 +-
 .../intel_gpu/src/plugin/program_builder.cpp  |  10 +
 .../src/runtime/execution_config.cpp          |   1 +
 .../tests/functional/behavior/model_cache.cpp | 210 ++++++++++++++++++
 15 files changed, 388 insertions(+), 30 deletions(-)
 create mode 100644 src/core/dev_api/openvino/core/rt_info/weightless_caching_attributes.hpp
 create mode 100644 src/core/src/op/util/weightless_caching_attributes.cpp
 create mode 100644 src/plugins/intel_gpu/tests/functional/behavior/model_cache.cpp

diff --git a/src/bindings/python/src/openvino/runtime/properties/__init__.py b/src/bindings/python/src/openvino/runtime/properties/__init__.py
index caaa93f37223b0..3269ea42e32ac2 100644
--- a/src/bindings/python/src/openvino/runtime/properties/__init__.py
+++ b/src/bindings/python/src/openvino/runtime/properties/__init__.py
@@ -29,6 +29,7 @@
 from openvino._pyopenvino.properties import execution_devices
 from openvino._pyopenvino.properties import loaded_from_cache
 from openvino._pyopenvino.properties import cache_encryption_callbacks
+from openvino._pyopenvino.properties import weights_path
 
 # Submodules
 from openvino.runtime.properties import hint
diff --git a/src/bindings/python/src/pyopenvino/core/properties/properties.cpp b/src/bindings/python/src/pyopenvino/core/properties/properties.cpp
index 470161d9779558..a6b30bd773001f 100644
--- a/src/bindings/python/src/pyopenvino/core/properties/properties.cpp
+++ b/src/bindings/python/src/pyopenvino/core/properties/properties.cpp
@@ -43,6 +43,7 @@ void regmodule_properties(py::module m) {
     OPENVINO_SUPPRESS_DEPRECATED_END
     wrap_property_RW(m_properties, ov::force_tbb_terminate, "force_tbb_terminate");
     wrap_property_RW(m_properties, ov::enable_mmap, "enable_mmap");
+    wrap_property_RW(m_properties, ov::weights_path, "weights_path");
 
     wrap_property_RO(m_properties, ov::supported_properties, "supported_properties");
     wrap_property_RO(m_properties, ov::available_devices, "available_devices");
diff --git a/src/bindings/python/tests/test_runtime/test_properties.py b/src/bindings/python/tests/test_runtime/test_properties.py
index e8d3162c362f4f..32eb48f6765f41 100644
--- a/src/bindings/python/tests/test_runtime/test_properties.py
+++ b/src/bindings/python/tests/test_runtime/test_properties.py
@@ -266,6 +266,11 @@ def test_properties_ro(ov_property_ro, expected_value):
         ),
         (props.force_tbb_terminate, "FORCE_TBB_TERMINATE", ((True, True), (False, False))),
         (props.enable_mmap, "ENABLE_MMAP", ((True, True), (False, False))),
+        (
+            props.weights_path,
+            "WEIGHTS_PATH",
+            (("./model.bin", "./model.bin"),),
+        ),
         (hints.inference_precision, "INFERENCE_PRECISION_HINT", ((Type.f32, Type.f32),)),
         (
             hints.model_priority,
diff --git a/src/core/dev_api/openvino/core/rt_info/weightless_caching_attributes.hpp b/src/core/dev_api/openvino/core/rt_info/weightless_caching_attributes.hpp
new file mode 100644
index 00000000000000..fedcb030fb52cf
--- /dev/null
+++ b/src/core/dev_api/openvino/core/rt_info/weightless_caching_attributes.hpp
@@ -0,0 +1,38 @@
+// Copyright (C) 2024 Intel Corporation
+// SPDX-License-Identifier: Apache-2.0
+//
+
+#pragma once
+
+#include "openvino/core/core_visibility.hpp"
+#include "openvino/core/runtime_attribute.hpp"
+
+namespace ov {
+
+/**
+ * @brief Holds weightless caching attributes of a single constant.
+ *
+ * WeightlessCacheAttribute class represents runtime info attribute that holds
+ * the values of original size of the constant in bytes and the binary offset of the
+ * constant's data in the weights file used by the weightless caching mechanism. It's
+ * not copyable in case the data was changed (the original node was replaced by a new
+ * one produced during the tranformation pipeline) - in that case weightless caching
+ * can't be used for that constant.
+ */
+class OPENVINO_API WeightlessCacheAttribute : public RuntimeAttribute {
+public:
+    OPENVINO_RTTI("WeightlessCacheAttribute");
+
+    WeightlessCacheAttribute() = delete;
+
+    WeightlessCacheAttribute(size_t original_size, size_t bin_offset)
+        : original_size(original_size),
+          bin_offset(bin_offset) {}
+
+    bool is_copyable() const override;
+
+    size_t original_size;
+    size_t bin_offset;
+};
+
+}  // namespace ov
diff --git a/src/core/src/op/util/weightless_caching_attributes.cpp b/src/core/src/op/util/weightless_caching_attributes.cpp
new file mode 100644
index 00000000000000..7c540f8a3bef02
--- /dev/null
+++ b/src/core/src/op/util/weightless_caching_attributes.cpp
@@ -0,0 +1,9 @@
+// Copyright (C) 2024 Intel Corporation
+// SPDX-License-Identifier: Apache-2.0
+//
+
+#include "openvino/core/rt_info/weightless_caching_attributes.hpp"
+
+bool ov::WeightlessCacheAttribute::is_copyable() const {
+    return false;
+}
diff --git a/src/core/tests/pass/serialization/deterministicity.cpp b/src/core/tests/pass/serialization/deterministicity.cpp
index 5bcfbf97b77890..8441da501eb9bf 100644
--- a/src/core/tests/pass/serialization/deterministicity.cpp
+++ b/src/core/tests/pass/serialization/deterministicity.cpp
@@ -193,6 +193,7 @@ TEST_P(SerializationDeterministicityInputOutputTest, FromOvModel) {
     auto& expected1 = modelRef;
     ov::pass::Serialize(m_out_xml_path_1, m_out_bin_path_1, irVersion).run_on_model(modelRef);
     auto expected2 = ov::test::readModel(m_out_xml_path_1, m_out_bin_path_1);
+
     ov::pass::Serialize(m_out_xml_path_2, m_out_bin_path_2, irVersion).run_on_model(expected2);
 
     EXPECT_EQ(input0Name, expected1->input(0).get_node()->get_friendly_name());
diff --git a/src/frontends/ir/src/ir_deserializer.cpp b/src/frontends/ir/src/ir_deserializer.cpp
index 68900b150514bc..f9ddcf1e8c14a6 100644
--- a/src/frontends/ir/src/ir_deserializer.cpp
+++ b/src/frontends/ir/src/ir_deserializer.cpp
@@ -9,6 +9,7 @@
 
 #include "openvino/core/except.hpp"
 #include "openvino/core/meta_data.hpp"
+#include "openvino/core/rt_info/weightless_caching_attributes.hpp"
 #include "openvino/core/type/element_type.hpp"
 #include "openvino/op/constant.hpp"
 #include "openvino/op/loop.hpp"
@@ -944,6 +945,13 @@ std::shared_ptr<ov::Node> ov::XmlDeserializer::create_node(const std::vector<ov:
         if (aw_data) {
             rtInfo["alt_width"] = aw_data.value();
         }
+        const auto size = dn.attribute("size");
+        const auto offset = dn.attribute("offset");
+        if (size && offset) {
+            rtInfo[ov::WeightlessCacheAttribute::get_type_info_static()] =
+                ov::WeightlessCacheAttribute(static_cast<size_t>(pugixml::get_uint64_attr(dn, "size")),
+                                             static_cast<size_t>(pugixml::get_uint64_attr(dn, "offset")));
+        }
     }
 
     ovNode->set_friendly_name(params.name);
diff --git a/src/inference/include/openvino/runtime/properties.hpp b/src/inference/include/openvino/runtime/properties.hpp
index 621c0074fc9d1e..627314748bbe9c 100644
--- a/src/inference/include/openvino/runtime/properties.hpp
+++ b/src/inference/include/openvino/runtime/properties.hpp
@@ -1345,4 +1345,10 @@ static constexpr Property<Affinity> affinity{"AFFINITY"};
  */
 static constexpr Property<std::vector<std::string>, PropertyMutability::RO> execution_devices{"EXECUTION_DEVICES"};
 
+/**
+ * @brief Path to the file with model's weights.
+ *
+ * @note This property is used for weightless caching. Only used when ov::CacheMode Property is set to "OPTIMIZE_SIZE".
+ */
+static constexpr Property<std::string, PropertyMutability::RW> weights_path{"WEIGHTS_PATH"};
 }  // namespace ov
diff --git a/src/plugins/intel_gpu/include/intel_gpu/primitives/data.hpp b/src/plugins/intel_gpu/include/intel_gpu/primitives/data.hpp
index 7bc020c2529a88..461f063ec26bc5 100644
--- a/src/plugins/intel_gpu/include/intel_gpu/primitives/data.hpp
+++ b/src/plugins/intel_gpu/include/intel_gpu/primitives/data.hpp
@@ -3,9 +3,13 @@
 //
 
 #pragma once
-#include "primitive.hpp"
-#include "intel_gpu/runtime/memory.hpp"
+#include <climits>
+
 #include "intel_gpu/runtime/engine.hpp"
+#include "intel_gpu/runtime/memory.hpp"
+#include "openvino/runtime/shared_buffer.hpp"
+#include "openvino/util/mmap_object.hpp"
+#include "primitive.hpp"
 
 namespace cldnn {
 
@@ -29,6 +33,9 @@ struct data : public primitive_base<data> {
     /// @note If memory is attached by memory::attach(), the attached buffer should be valid till network build.
     memory::ptr mem;
 
+    size_t original_size = SIZE_MAX;
+    size_t bin_offset = SIZE_MAX;
+
     size_t hash() const override {
         size_t seed = primitive::hash();
         seed = hash_combine(seed, id);
@@ -46,20 +53,30 @@ struct data : public primitive_base<data> {
         size_t data_size = mem->size();
         ob << make_data(&data_size, sizeof(size_t));
 
-        if (_allocation_type == allocation_type::usm_host || _allocation_type == allocation_type::usm_shared) {
-            ob << make_data(mem->buffer_ptr(), data_size);
+        bool is_cache_without_weights = bin_offset != SIZE_MAX && data_size == original_size;
+
+        if (is_cache_without_weights) {
+            ob << true;
+            ob << bin_offset;
         } else {
-            std::vector<uint8_t> _buf;
-            _buf.resize(data_size);
-            stream* strm = reinterpret_cast<stream*>(ob.get_stream());
-            mem->copy_to(*strm, _buf.data());
-            ob << make_data(_buf.data(), data_size);
+            ob << false;
+            if (_allocation_type == allocation_type::usm_host || _allocation_type == allocation_type::usm_shared) {
+                ob << make_data(mem->buffer_ptr(), data_size);
+            } else {
+                std::vector<uint8_t> _buf;
+                _buf.resize(data_size);
+                stream* strm = reinterpret_cast<stream*>(ob.get_stream());
+                mem->copy_to(*strm, _buf.data());
+                ob << make_data(_buf.data(), data_size);
+            }
         }
     }
 
     void load(BinaryInputBuffer& ib) override {
         primitive_base<data>::load(ib);
+    }
 
+    void load_weights(BinaryInputBuffer& ib, std::shared_ptr<ov::MappedMemory> mapped_weights) {
         layout output_layout = layout();
         ib >> output_layout;
 
@@ -71,14 +88,39 @@ struct data : public primitive_base<data> {
 
         mem = ib.get_engine().allocate_memory(output_layout, _allocation_type, false);
 
+        bool is_cache_without_weights;
+        ib >> is_cache_without_weights;
+        if (is_cache_without_weights && mapped_weights == nullptr) {
+            OPENVINO_THROW("mmap object is null");
+        }
+
+        std::shared_ptr<ov::SharedBuffer<std::shared_ptr<ov::MappedMemory>>> shared_buf;
+        if (is_cache_without_weights) {
+            ib >> bin_offset;
+            original_size = data_size;
+
+            shared_buf = std::make_shared<ov::SharedBuffer<std::shared_ptr<ov::MappedMemory>>>(
+                mapped_weights->data() + bin_offset,
+                data_size,
+                mapped_weights);
+        }
+
         if (_allocation_type == allocation_type::usm_host || _allocation_type == allocation_type::usm_shared) {
-            ib >> make_data(mem->buffer_ptr(), data_size);
+            if (is_cache_without_weights) {
+                std::memcpy(reinterpret_cast<uint8_t*>(mem->buffer_ptr()), shared_buf->get_ptr<uint8_t>(), data_size);
+            } else {
+                ib >> make_data(mem->buffer_ptr(), data_size);
+            }
         } else {
             const size_t DATA_BLOCK_SIZE = 2 * 1024 * 1024;
             auto& strm = ib.get_engine().get_service_stream();
             if (data_size < DATA_BLOCK_SIZE || output_layout.format.is_image_2d()) {
                 std::vector<uint8_t> _buf(data_size);
-                ib >> make_data(_buf.data(), data_size);
+                if (is_cache_without_weights) {
+                    std::memcpy(reinterpret_cast<uint8_t*>(_buf.data()), shared_buf->get_ptr<uint8_t>(), data_size);
+                } else {
+                    ib >> make_data(_buf.data(), data_size);
+                }
                 mem->copy_from(strm, _buf.data());
             } else {
                 std::vector<uint8_t> _buf1(DATA_BLOCK_SIZE);
@@ -86,21 +128,33 @@ struct data : public primitive_base<data> {
                 bool buf_flag = true;
                 event::ptr ev1, ev2;
                 ev1 = ev2 = nullptr;
-
                 size_t dst_offset = 0;
                 while (dst_offset < data_size) {
                     const bool is_blocking = false;
                     const size_t src_offset = 0;
-                    size_t copy_size = (data_size > (dst_offset + DATA_BLOCK_SIZE)) ? DATA_BLOCK_SIZE : (data_size - dst_offset);
+                    size_t copy_size =
+                        (data_size > (dst_offset + DATA_BLOCK_SIZE)) ? DATA_BLOCK_SIZE : (data_size - dst_offset);
                     if (buf_flag) {
-                        ib >> make_data(_buf1.data(), copy_size);
+                        if (is_cache_without_weights) {
+                            std::memcpy(reinterpret_cast<uint8_t*>(_buf1.data()),
+                                        shared_buf->get_ptr<uint8_t>() + dst_offset,
+                                        copy_size);
+                        } else {
+                            ib >> make_data(_buf1.data(), copy_size);
+                        }
                         if (ev2 != nullptr) {
                             ev2->wait();
                             ev2 = nullptr;
                         }
                         ev1 = mem->copy_from(strm, _buf1.data(), src_offset, dst_offset, copy_size, is_blocking);
                     } else {
-                        ib >> make_data(_buf2.data(), copy_size);
+                        if (is_cache_without_weights) {
+                            std::memcpy(reinterpret_cast<uint8_t*>(_buf2.data()),
+                                        shared_buf->get_ptr<uint8_t>() + dst_offset,
+                                        copy_size);
+                        } else {
+                            ib >> make_data(_buf2.data(), copy_size);
+                        }
                         if (ev1 != nullptr) {
                             ev1->wait();
                             ev1 = nullptr;
diff --git a/src/plugins/intel_gpu/src/graph/program.cpp b/src/plugins/intel_gpu/src/graph/program.cpp
index d4461b8aad9107..1e2e84043dc82b 100644
--- a/src/plugins/intel_gpu/src/graph/program.cpp
+++ b/src/plugins/intel_gpu/src/graph/program.cpp
@@ -1720,6 +1720,7 @@ void program::cancel_compilation_context() {
 void program::save(cldnn::BinaryOutputBuffer& ob) const {
     std::map<cldnn::memory::ptr, std::vector<const cldnn::program_node*>> mutable_datas_ptrs;
     ob << nodes_map.size();
+
     for (auto& node : nodes_map) {
         ob.setKernelImplParams(node.second->get_kernel_impl_params().get());
 
@@ -1732,6 +1733,7 @@ void program::save(cldnn::BinaryOutputBuffer& ob) const {
                 node.second->as<data>().typed_desc()->mem = data_node.get_attached_memory_ptr();
             }
         }
+
         ob << true;
 
         ob << node.second->desc;
@@ -1835,6 +1837,12 @@ void program::save(cldnn::BinaryOutputBuffer& ob) const {
 void program::load(cldnn::BinaryInputBuffer& ib) {
     init_program();
 
+    std::shared_ptr<ov::MappedMemory> mapped_memory = nullptr;
+    std::string weights_path = _config.get_property(ov::weights_path);
+    if (!weights_path.empty()) {
+        mapped_memory = ov::load_mmap_object(weights_path);
+    }
+
     size_t num_nodes;
     ib >> num_nodes;
     bool is_valid_data_node;
@@ -1845,6 +1853,9 @@ void program::load(cldnn::BinaryInputBuffer& ib) {
 
         std::shared_ptr<cldnn::primitive> prim;
         ib >> prim;
+        if (auto data_prim = dynamic_cast<cldnn::data*>(prim.get())) {
+            data_prim->load_weights(ib, mapped_memory);
+        }
         get_or_create(prim);
     }
 
diff --git a/src/plugins/intel_gpu/src/plugin/compiled_model.cpp b/src/plugins/intel_gpu/src/plugin/compiled_model.cpp
index b9729ca7bf0f20..15ff4447b4bafe 100644
--- a/src/plugins/intel_gpu/src/plugin/compiled_model.cpp
+++ b/src/plugins/intel_gpu/src/plugin/compiled_model.cpp
@@ -42,18 +42,15 @@ CompiledModel::CompiledModel(std::shared_ptr<ov::Model> model,
                              const std::shared_ptr<const ov::IPlugin>& plugin,
                              RemoteContextImpl::Ptr context,
                              const ExecutionConfig& config)
-    : ov::ICompiledModel(model,
-                         plugin,
-                         context,
-                         create_task_executor(plugin, config),
-                         nullptr)
-    , m_context(context)
-    , m_config(config)
-    , m_wait_executor(std::make_shared<ov::threading::CPUStreamsExecutor>(ov::threading::IStreamsExecutor::Config{"Intel GPU plugin wait executor"}))
-    , m_model_name(model->get_friendly_name())
-    , m_inputs(ov::ICompiledModel::inputs())
-    , m_outputs(ov::ICompiledModel::outputs())
-    , m_loaded_from_cache(false) {
+    : ov::ICompiledModel(model, plugin, context, create_task_executor(plugin, config), nullptr),
+      m_context(context),
+      m_config(config),
+      m_wait_executor(std::make_shared<ov::threading::CPUStreamsExecutor>(
+          ov::threading::IStreamsExecutor::Config{"Intel GPU plugin wait executor"})),
+      m_model_name(model->get_friendly_name()),
+      m_inputs(ov::ICompiledModel::inputs()),
+      m_outputs(ov::ICompiledModel::outputs()),
+      m_loaded_from_cache(false) {
     auto graph_base = std::make_shared<Graph>(model, m_context, m_config, 0);
     for (uint16_t n = 0; n < m_config.get_property(ov::num_streams); n++) {
         auto graph = n == 0 ? graph_base : std::make_shared<Graph>(graph_base, n);
@@ -170,7 +167,10 @@ std::shared_ptr<ov::IAsyncInferRequest> CompiledModel::create_infer_request() co
 //     [ ov::Node::Input/ ov::Node::Output ]
 //     [ ov::intel_gpu::Graph ]
 void CompiledModel::export_model(std::ostream& model) const {
-    if (m_config.get_property(ov::cache_mode) == ov::CacheMode::OPTIMIZE_SIZE)
+    // If ov::CacheMode::OPTIMIZE_SIZE is set, do the export iff it's possible to do weightless caching
+    // which requires the weights_path.
+    if (m_config.get_property(ov::cache_mode) == ov::CacheMode::OPTIMIZE_SIZE &&
+        m_config.get_property(ov::weights_path).empty())
         return;
 
     OV_ITT_SCOPED_TASK(itt::domains::intel_gpu_plugin, "CompiledModel::export_model");
diff --git a/src/plugins/intel_gpu/src/plugin/plugin.cpp b/src/plugins/intel_gpu/src/plugin/plugin.cpp
index 4ea7851b3f8c58..2d29601ef0b69d 100644
--- a/src/plugins/intel_gpu/src/plugin/plugin.cpp
+++ b/src/plugins/intel_gpu/src/plugin/plugin.cpp
@@ -308,10 +308,13 @@ std::shared_ptr<ov::ICompiledModel> Plugin::import_model(std::istream& model,
     config.set_user_property(_orig_config);
     config.apply_user_properties(context_impl->get_engine().get_device_info());
 
-    if (config.get_property(ov::cache_mode) == ov::CacheMode::OPTIMIZE_SIZE)
+    cldnn::BinaryInputBuffer ib(model, context_impl->get_engine());
+
+    if (config.get_property(ov::cache_mode) == ov::CacheMode::OPTIMIZE_SIZE &&
+        config.get_property(ov::weights_path).empty()) {
         return nullptr;
+    }
 
-    cldnn::BinaryInputBuffer ib(model, context_impl->get_engine());
     return std::make_shared<CompiledModel>(ib, shared_from_this(), context_impl, config, loaded_from_cache);
 }
 
diff --git a/src/plugins/intel_gpu/src/plugin/program_builder.cpp b/src/plugins/intel_gpu/src/plugin/program_builder.cpp
index aae9b163b4f6bf..510d715e7ac805 100644
--- a/src/plugins/intel_gpu/src/plugin/program_builder.cpp
+++ b/src/plugins/intel_gpu/src/plugin/program_builder.cpp
@@ -2,6 +2,7 @@
 // SPDX-License-Identifier: Apache-2.0
 //
 
+#include "openvino/core/rt_info/weightless_caching_attributes.hpp"
 #include "openvino/op/constant.hpp"
 #include "openvino/op/split.hpp"
 #include "openvino/op/variadic_split.hpp"
@@ -304,6 +305,15 @@ void ProgramBuilder::add_primitive(const ov::Node& op, std::shared_ptr<cldnn::pr
     prim->origin_op_name = op.get_friendly_name();
     prim->origin_op_type_name = op.get_type_name();
 
+    if (auto data_prim = dynamic_cast<cldnn::data*>(prim.get())) {
+        auto rt_info = op.get_rt_info();
+        auto weightless_cache_attr = rt_info.find(ov::WeightlessCacheAttribute::get_type_info_static());
+        if (weightless_cache_attr != rt_info.end()) {
+            data_prim->bin_offset = weightless_cache_attr->second.as<ov::WeightlessCacheAttribute>().bin_offset;
+            data_prim->original_size = weightless_cache_attr->second.as<ov::WeightlessCacheAttribute>().original_size;
+        }
+    }
+
     bool should_profile = prim->type != cldnn::mutable_data::type_id() &&
                           prim->type != cldnn::data::type_id();
 
diff --git a/src/plugins/intel_gpu/src/runtime/execution_config.cpp b/src/plugins/intel_gpu/src/runtime/execution_config.cpp
index a498dad24aa2f5..9c24fae1d6729a 100644
--- a/src/plugins/intel_gpu/src/runtime/execution_config.cpp
+++ b/src/plugins/intel_gpu/src/runtime/execution_config.cpp
@@ -60,6 +60,7 @@ void ExecutionConfig::set_default() {
         std::make_tuple(ov::cache_encryption_callbacks, EncryptionCallbacks{}),
         std::make_tuple(ov::hint::dynamic_quantization_group_size, 0),
         std::make_tuple(ov::intel_gpu::hint::enable_kernels_reuse, false),
+        std::make_tuple(ov::weights_path, ""),
 
         // Legacy API properties
         std::make_tuple(ov::intel_gpu::nv12_two_inputs, false),
diff --git a/src/plugins/intel_gpu/tests/functional/behavior/model_cache.cpp b/src/plugins/intel_gpu/tests/functional/behavior/model_cache.cpp
new file mode 100644
index 00000000000000..573d275da84e51
--- /dev/null
+++ b/src/plugins/intel_gpu/tests/functional/behavior/model_cache.cpp
@@ -0,0 +1,210 @@
+// Copyright (C) 2024 Intel Corporation
+// SPDX-License-Identifier: Apache-2.0
+//
+
+#include <cstdio>
+
+#include "base/ov_behavior_test_utils.hpp"
+#include "common_test_utils/common_utils.hpp"
+#include "common_test_utils/file_utils.hpp"
+#include "common_test_utils/ov_tensor_utils.hpp"
+#include "common_test_utils/subgraph_builders/2_input_subtract.hpp"
+#include "common_test_utils/subgraph_builders/concat_with_params.hpp"
+#include "common_test_utils/subgraph_builders/conv_bias.hpp"
+#include "common_test_utils/subgraph_builders/conv_pool_relu.hpp"
+#include "common_test_utils/subgraph_builders/conv_pool_relu_no_reshapes.hpp"
+#include "common_test_utils/subgraph_builders/conv_pool_relu_non_zero.hpp"
+#include "common_test_utils/subgraph_builders/convert_transpose.hpp"
+#include "common_test_utils/subgraph_builders/detection_output.hpp"
+#include "common_test_utils/subgraph_builders/kso_func.hpp"
+#include "common_test_utils/subgraph_builders/matmul_bias.hpp"
+#include "common_test_utils/subgraph_builders/multi_single_conv.hpp"
+#include "common_test_utils/subgraph_builders/multiple_input_outpput_double_concat.hpp"
+#include "common_test_utils/subgraph_builders/nested_branch_conv_concat.hpp"
+#include "common_test_utils/subgraph_builders/nested_split_conv_concat.hpp"
+#include "common_test_utils/subgraph_builders/read_concat_split_assign.hpp"
+#include "common_test_utils/subgraph_builders/single_concat_with_constant.hpp"
+#include "common_test_utils/subgraph_builders/single_conv.hpp"
+#include "common_test_utils/subgraph_builders/single_split.hpp"
+#include "common_test_utils/subgraph_builders/split_concat.hpp"
+#include "common_test_utils/subgraph_builders/split_conv_concat.hpp"
+#include "common_test_utils/subgraph_builders/split_multi_conv_concat.hpp"
+#include "common_test_utils/subgraph_builders/ti_with_lstm_cell.hpp"
+#include "common_test_utils/test_common.hpp"
+#include "openvino/pass/serialize.hpp"
+
+namespace {
+class CheckWeightlessCacheAccuracy : public ::testing::Test {
+protected:
+    std::shared_ptr<ov::Model> model;
+    std::string xml_path;
+    std::string bin_path;
+    std::string cache_path;
+
+    void SetUp() override;
+    void TearDown() override;
+    void run();
+};
+
+void CheckWeightlessCacheAccuracy::SetUp() {
+    std::string filePrefix = ov::test::utils::generateTestFilePrefix();
+    xml_path = filePrefix + ".xml";
+    bin_path = filePrefix + ".bin";
+    cache_path = filePrefix + ".blob";
+}
+
+void CheckWeightlessCacheAccuracy::TearDown() {
+    std::remove(xml_path.c_str());
+    std::remove(bin_path.c_str());
+    std::remove(cache_path.c_str());
+}
+
+void CheckWeightlessCacheAccuracy::run() {
+    ov::AnyMap config = { ov::cache_mode(ov::CacheMode::OPTIMIZE_SIZE), ov::weights_path(bin_path) };
+    auto core = ov::test::utils::PluginCache::get().core();
+    ov::pass::Serialize(xml_path, bin_path).run_on_model(model);
+
+    ov::CompiledModel compiled_model;
+    OV_ASSERT_NO_THROW(compiled_model = core->compile_model(xml_path, ov::test::utils::DEVICE_GPU, config));
+
+    auto ofstr = std::ofstream(cache_path, std::ofstream::binary);
+    OV_ASSERT_NO_THROW(compiled_model.export_model(ofstr));
+    ofstr.close();
+
+    auto ifstr = std::ifstream(cache_path, std::ifstream::binary);
+    ov::CompiledModel imported_model;
+    OV_ASSERT_NO_THROW(imported_model = core->import_model(ifstr, ov::test::utils::DEVICE_GPU, config));
+    ifstr.close();
+
+    auto orig_req = compiled_model.create_infer_request();
+    auto new_req = imported_model.create_infer_request();
+
+    for (size_t param_idx = 0; param_idx < model->get_parameters().size(); ++param_idx) {
+        auto input = model->get_parameters().at(param_idx);
+        auto tensor = ov::test::utils::create_and_fill_tensor(input->get_element_type(), input->get_shape());
+        orig_req.set_tensor(input, tensor);
+        new_req.set_tensor(input, tensor);
+    }
+
+    OV_ASSERT_NO_THROW(orig_req.infer());
+    OV_ASSERT_NO_THROW(new_req.infer());
+
+    auto result_vector = model->get_results();
+    for (auto& res : result_vector) {
+        auto orig_out = orig_req.get_tensor(res);
+        auto new_out = new_req.get_tensor(res);
+        ov::test::utils::compare(orig_out, new_out);
+    }
+}
+
+TEST_F(CheckWeightlessCacheAccuracy, 2InputSubtract) {
+    model = ov::test::utils::make_2_input_subtract();
+    run();
+}
+
+TEST_F(CheckWeightlessCacheAccuracy, ConcatWithParams) {
+    model = ov::test::utils::make_concat_with_params();
+    run();
+}
+
+TEST_F(CheckWeightlessCacheAccuracy, ConvBias) {
+    model = ov::test::utils::make_conv_bias();
+    run();
+}
+
+TEST_F(CheckWeightlessCacheAccuracy, ConvPoolRelu) {
+    model = ov::test::utils::make_conv_pool_relu();
+    run();
+}
+
+TEST_F(CheckWeightlessCacheAccuracy, ConvPoolReluNoReshapes) {
+    model = ov::test::utils::make_conv_pool_relu_no_reshapes();
+    run();
+}
+
+TEST_F(CheckWeightlessCacheAccuracy, ConvPoolReluNonZero) {
+    model = ov::test::utils::make_conv_pool_relu_non_zero();
+    run();
+}
+
+TEST_F(CheckWeightlessCacheAccuracy, ConvertTranspose) {
+    model = ov::test::utils::make_convert_transpose();
+    run();
+}
+
+TEST_F(CheckWeightlessCacheAccuracy, DetectionOutput) {
+    model = ov::test::utils::make_detection_output();
+    run();
+}
+
+TEST_F(CheckWeightlessCacheAccuracy, KsoFunction) {
+    model = ov::test::utils::make_kso_function();
+    run();
+}
+
+TEST_F(CheckWeightlessCacheAccuracy, MatmulBias) {
+    model = ov::test::utils::make_matmul_bias();
+    run();
+}
+
+TEST_F(CheckWeightlessCacheAccuracy, MultiSingleConv) {
+    model = ov::test::utils::make_multi_single_conv();
+    run();
+}
+
+TEST_F(CheckWeightlessCacheAccuracy, MultipleInputOutputDoubleConcat) {
+    model = ov::test::utils::make_multiple_input_output_double_concat();
+    run();
+}
+
+TEST_F(CheckWeightlessCacheAccuracy, NestedBranchConvConcat) {
+    model = ov::test::utils::make_nested_branch_conv_concat();
+    run();
+}
+
+TEST_F(CheckWeightlessCacheAccuracy, NestedSplitConvConcat) {
+    model = ov::test::utils::make_nested_split_conv_concat();
+    run();
+}
+
+TEST_F(CheckWeightlessCacheAccuracy, ReadConcatSplitAssign) {
+    model = ov::test::utils::make_read_concat_split_assign();
+    run();
+}
+
+TEST_F(CheckWeightlessCacheAccuracy, SingleConcatWithConstant) {
+    model = ov::test::utils::make_single_concat_with_constant();
+    run();
+}
+
+TEST_F(CheckWeightlessCacheAccuracy, SingleConv) {
+    model = ov::test::utils::make_single_conv();
+    run();
+}
+
+TEST_F(CheckWeightlessCacheAccuracy, SingleSplit) {
+    model = ov::test::utils::make_single_split();
+    run();
+}
+
+TEST_F(CheckWeightlessCacheAccuracy, SplitConcat) {
+    model = ov::test::utils::make_split_concat();
+    run();
+}
+
+TEST_F(CheckWeightlessCacheAccuracy, SplitConvConcat) {
+    model = ov::test::utils::make_split_conv_concat();
+    run();
+}
+
+TEST_F(CheckWeightlessCacheAccuracy, SplitMultiConvConcat) {
+    model = ov::test::utils::make_split_multi_conv_concat();
+    run();
+}
+
+TEST_F(CheckWeightlessCacheAccuracy, TiWithLstmCell) {
+    model = ov::test::utils::make_ti_with_lstm_cell();
+    run();
+}
+
+}  // namespace

From f3911616becfd47e22376b59f0cb0e103231d82e Mon Sep 17 00:00:00 2001
From: Karol Blaszczak <karol.blaszczak@intel.com>
Date: Mon, 21 Oct 2024 09:49:52 +0200
Subject: [PATCH 07/24] [DOCS] torch.compile examples (#27107)

---
 .../openvino-workflow/torch-compile.rst       | 180 ++++++++++++++++++
 1 file changed, 180 insertions(+)

diff --git a/docs/articles_en/openvino-workflow/torch-compile.rst b/docs/articles_en/openvino-workflow/torch-compile.rst
index 6d874ff4d14be3..5bdb51a596d5d8 100644
--- a/docs/articles_en/openvino-workflow/torch-compile.rst
+++ b/docs/articles_en/openvino-workflow/torch-compile.rst
@@ -20,6 +20,186 @@ By default, Torch code runs in eager-mode, but with the use of ``torch.compile``
 How to Use
 ####################
 
+
+.. tab-set::
+
+   .. tab-item:: Image Generation
+
+      .. tab-set::
+
+         .. tab-item:: Stable-Diffusion-2
+
+            .. code-block:: py
+               :force:
+
+               import torch
+               from diffusers import StableDiffusionPipeline, DPMSolverMultistepScheduler
+
+               model_id = "stabilityai/stable-diffusion-2-1"
+
+               # Use the DPMSolverMultistepScheduler (DPM-Solver++) scheduler here instead
+               pipe = StableDiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16)
+               pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)
+
+               + pipe.text_encoder = torch.compile(pipe.text_encoder, backend="openvino") #Optional
+               + pipe.unet = torch.compile(pipe.unet, backend=“openvino”)
+               + pipe.vae.decode = torch.compile(pipe.vae.decode, backend=“openvino”) #Optional
+
+               prompt = "a photo of an astronaut riding a horse on mars"
+               image = pipe(prompt).images[0]
+
+               image.save("astronaut_rides_horse.png")
+
+
+         .. tab-item:: Stable-Diffusion-3
+
+            .. code-block:: py
+
+               import torch
+               from diffusers import StableDiffusion3Pipeline
+
+               pipe = StableDiffusion3Pipeline.from_pretrained("stabilityai/stable-diffusion-3-medium-diffusers", torch_dtype=torch.float32)
+
+               + pipe.transformer = torch.compile(pipe.transformer, backend="openvino")
+
+               image = pipe(
+                   "A cat holding a sign that says hello world",
+                   negative_prompt="",
+                   num_inference_steps=28,
+                   guidance_scale=7.0,
+               ).images[0]
+
+               image.save('out.png')
+
+         .. tab-item:: Stable-Diffusion-XL
+
+            .. code-block:: py
+
+               import torch
+               from diffusers import UNet2DConditionModel, DiffusionPipeline, LCMScheduler
+
+               unet = UNet2DConditionModel.from_pretrained("latent-consistency/lcm-sdxl", torch_dtype=torch.float16, variant="fp16")
+               pipe = DiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", unet=unet, torch_dtype=torch.float16, variant="fp16")
+               pipe.scheduler = LCMScheduler.from_config(pipe.scheduler.config)
+
+               + pipe.text_encoder = torch.compile(pipe.text_encoder, backend="openvino") #Optional
+               + pipe.unet = torch.compile(pipe.unet, backend="openvino")
+               + pipe.vae.decode = torch.compile(pipe.vae.decode, backend="openvino") #Optional
+
+               prompt = "a close-up picture of an old man standing in the rain"
+               image = pipe(prompt, num_inference_steps=5, guidance_scale=8.0).images[0]
+               image.save("result.png")
+
+   .. tab-item:: Text Generation
+
+      .. tab-set::
+
+         .. tab-item:: Llama-3.2-1B
+
+            .. code-block:: py
+
+               import torch
+               from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
+
+               model_name_or_path = "meta-llama/Llama-3.2-1B-Instruct"
+               tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, trust_remote_code=True, torch_dtype=torch.float32)
+               model = AutoModelForCausalLM.from_pretrained(
+                   model_name_or_path,
+                   trust_remote_code=True,
+                   device_map='cpu',
+                   torch_dtype=torch.float32
+               )
+
+               prompt = "Tell me about AI"
+
+               + model.forward = torch.compile(model.forward, backend="openvino", options={'aot_autograd': True})
+
+               pipe = pipeline(
+                   "text-generation",
+                   model=model,
+                   tokenizer=tokenizer,
+                   max_new_tokens=64
+               )
+               result = pipe(prompt)
+               print(result[0]['generated_text'])
+
+
+         .. tab-item:: Llama-2-7B-GPTQ
+
+            .. code-block:: py
+
+               import torch
+               from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
+
+               model_name_or_path = "TheBloke/Llama-2-7B-GPTQ"
+               tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, trust_remote_code=True, torch_dtype=torch.float32)
+               model = AutoModelForCausalLM.from_pretrained(
+                   model_name_or_path,
+                   trust_remote_code=True,
+                   device_map='cpu',
+                   torch_dtype=torch.float32
+               )
+
+               prompt = "Tell me about AI"
+
+               + model.forward = torch.compile(model.forward, backend="openvino", options={'aot_autograd': True})
+
+               pipe = pipeline(
+                   "text-generation",
+                   model=model,
+                   tokenizer=tokenizer,
+                   max_new_tokens=64
+               )
+               result = pipe(prompt)
+               print(result[0]['generated_text'])
+
+
+         .. tab-item:: Chatglm-4-GPTQ
+
+            .. code-block:: py
+
+               import torch
+               from transformers import AutoModelForCausalLM, AutoTokenizer
+
+               query = "tell me about AI“
+
+               tokenizer = AutoTokenizer.from_pretrained("mcavus/glm-4v-9b-gptq-4bit-dynamo", trust_remote_code=True)
+               inputs = tokenizer.apply_chat_template([{"role": "user", "content": query}],
+                                                      add_generation_prompt=True,
+                                                      tokenize=True,
+                                                      return_tensors="pt",
+                                                      return_dict=True
+                                                      )
+               model = AutoModelForCausalLM.from_pretrained(
+                   "mcavus/glm-4v-9b-gptq-4bit-dynamo",
+                   torch_dtype=torch.float32,
+                   low_cpu_mem_usage=True,
+                   trust_remote_code=True
+               )
+
+               + model.transformer.encoder.forward = torch.compile(model.transformer.encoder.forward, backend="openvino", options={"aot_autograd":True})
+
+               gen_kwargs = {"max_length": 2500, "do_sample": True, "top_k": 1}
+               with torch.no_grad():
+                   outputs = model.generate(**inputs, **gen_kwargs)
+                   outputs = outputs[:, inputs['input_ids'].shape[1]:]
+                   print(tokenizer.decode(outputs[0], skip_special_tokens=True))
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
 To use ``torch.compile``, you need to define the ``openvino`` backend in your PyTorch application.
 This way Torch FX subgraphs will be directly converted to OpenVINO representation without
 any additional PyTorch-based tracing/scripting.

From 308b420dde9216bab4ee2d70d1d2afc7a95b77c6 Mon Sep 17 00:00:00 2001
From: yuanxion <96522341+yuanxion@users.noreply.github.com>
Date: Mon, 21 Oct 2024 15:56:17 +0800
Subject: [PATCH 08/24] [GPU] Fix different element types of MatMul
 dequantization scales issue (#27077)

### Details:
 - MatMul dequantization
Convert both dequantization scale variables (mulConst1 & mulConst2) to
f32 instead of just one (mulConst2), to avoid different data type
complaint issue (f16 & f32).

### Tickets:
 - 151988

---------

Signed-off-by: yuan.xiong <yuan.xiong@intel.com>
---
 .../src/mat_mul.cpp                              |  2 +-
 .../mat_mul_with_constant_transformation.cpp     | 16 ++++++++++++++++
 2 files changed, 17 insertions(+), 1 deletion(-)

diff --git a/src/common/low_precision_transformations/src/mat_mul.cpp b/src/common/low_precision_transformations/src/mat_mul.cpp
index 15afe2408cc459..705f3d400a098c 100644
--- a/src/common/low_precision_transformations/src/mat_mul.cpp
+++ b/src/common/low_precision_transformations/src/mat_mul.cpp
@@ -160,7 +160,7 @@ bool MatMulTransformation::transform(TransformationContext &context, ov::pass::p
     }
 
     const auto newMulConst = NetworkHelper::toScalarIfPossible(fold<ov::opset1::Multiply>(
-            mulConst1,
+            foldConvert(mulConst1, element::f32),
             foldConvert(mulConst2, element::f32)));
 
     const auto newMultiply = std::make_shared<ov::op::TypeRelaxed<ov::opset1::Multiply>>(
diff --git a/src/common/low_precision_transformations/tests/mat_mul_with_constant_transformation.cpp b/src/common/low_precision_transformations/tests/mat_mul_with_constant_transformation.cpp
index 454802c965f945..8425db398085ae 100644
--- a/src/common/low_precision_transformations/tests/mat_mul_with_constant_transformation.cpp
+++ b/src/common/low_precision_transformations/tests/mat_mul_with_constant_transformation.cpp
@@ -157,6 +157,22 @@ std::vector<MatMullTransformationTestValues> testValues = {
       {},
       {}}},
 
+    // test: multiply with f16 constant
+    {LayerTransformation::createParamsU8I8(),
+     {ov::element::u8,
+      {ov::element::f32, {}, ov::builder::subgraph::DequantizationOperations::Multiply{0.02f}.setConstantPrecision(ov::element::f16)},
+      {std::vector<float>(1024 * 1024, 1.f), ov::element::i8, ov::Shape{1024, 1024}},
+      {},
+      {ov::element::f32, {}, {0.1f}},
+     },
+     {ov::element::u8,
+      {},
+      {std::vector<float>(1024 * 1024, 1.f), ov::element::i8, ov::Shape{1024, 1024}},
+      ov::element::u8,
+      {{}, {}, {0.02f * 0.1f}},
+      {},
+      {}}},
+
     // supported 3D: U8 & I8 with Dq on weights
     {LayerTransformation::createParamsU8I8(),
      {

From 2cb8222dd7bf443096f80e09bcba8766c223f680 Mon Sep 17 00:00:00 2001
From: Egor Duplenskii <egor.duplensky@gmail.com>
Date: Mon, 21 Oct 2024 10:38:52 +0200
Subject: [PATCH 09/24] [CPU] Use actual input shape to init desc for
 MemoryInputSDPA (#27143)

An output shape was previously used to create an input descriptor for
some reason
---
 src/plugins/intel_cpu/src/nodes/memory.cpp | 52 +++-------------------
 src/plugins/intel_cpu/src/nodes/memory.hpp |  2 -
 2 files changed, 7 insertions(+), 47 deletions(-)

diff --git a/src/plugins/intel_cpu/src/nodes/memory.cpp b/src/plugins/intel_cpu/src/nodes/memory.cpp
index 88693ebfa49fdf..756fbc5b578f61 100644
--- a/src/plugins/intel_cpu/src/nodes/memory.cpp
+++ b/src/plugins/intel_cpu/src/nodes/memory.cpp
@@ -427,29 +427,20 @@ void MemoryInputBase::initSupportedPrimitiveDescriptors() {
     if (!supportedPrimitiveDescriptors.empty())
         return;
 
-    auto&& shape = getOutputShapeAtPort(0);
     auto precision = getOriginalOutputPrecisionAtPort(0);
     auto&& descCreators = ov::intel_cpu::BlockedDescCreator::getCommonCreators();
-
     NodeConfig config;
 
     if (!getParentEdges().empty()) {
-        PortConfig inPortConfig;
-
-        inPortConfig.inPlace(-1);
-        inPortConfig.constant(false);
-        inPortConfig.setMemDesc(descCreators.at(LayoutType::ncsp)->createSharedDesc(precision, shape));
-
-        config.inConfs.push_back(std::move(inPortConfig));
+        const auto& inputShape = getInputShapeAtPort(0);
+        config.inConfs.emplace_back(descCreators.at(LayoutType::ncsp)->createSharedDesc(precision, inputShape));
     }
 
-    PortConfig outPortConfig;
-
-    outPortConfig.inPlace(0);
-    outPortConfig.constant(false);
-    outPortConfig.setMemDesc(descCreators.at(LayoutType::ncsp)->createSharedDesc(precision, shape));
-
-    config.outConfs.push_back(std::move(outPortConfig));
+    const auto& outputShape = getOutputShapeAtPort(0);
+    config.outConfs.emplace_back(
+        descCreators.at(LayoutType::ncsp)->createSharedDesc(precision, outputShape),
+        BlockedMemoryDesc::FULL_MASK,
+        0);
 
     supportedPrimitiveDescriptors.emplace_back(config, impl_desc_type::unknown);
 }
@@ -759,35 +750,6 @@ void MemoryInputSDPA::createPrimitive() {
     OPENVINO_ASSERT(m_child_port_idx != -1, getName(), " should be connected to SDPA node.");
 }
 
-void MemoryInputSDPA::initSupportedPrimitiveDescriptors() {
-    if (!supportedPrimitiveDescriptors.empty())
-        return;
-
-    auto&& shape = getOutputShapeAtPort(0);
-    auto precision = getOriginalOutputPrecisionAtPort(0);
-    auto&& descCreators = ov::intel_cpu::BlockedDescCreator::getCommonCreators();
-    NodeConfig config;
-    if (!getParentEdges().empty()) {
-        PortConfig inPortConfig;
-        inPortConfig.inPlace(-1);
-        inPortConfig.constant(false);
-        inPortConfig.setMemDesc(descCreators.at(LayoutType::ncsp)->createSharedDesc(precision, shape));
-        config.inConfs.push_back(std::move(inPortConfig));
-    }
-
-    PortConfig outPortConfig;
-    outPortConfig.inPlace(0);
-    outPortConfig.constant(false);
-    // layout for fake memory obj, the child sdpa also does not use it
-    outPortConfig.setMemDesc(descCreators.at(LayoutType::ncsp)->createSharedDesc(precision, shape));
-    config.outConfs.push_back(std::move(outPortConfig));
-    supportedPrimitiveDescriptors.emplace_back(config, impl_desc_type::unknown);
-}
-
-void MemoryInputSDPA::initOptimalPrimitiveDescriptor() {
-    Node::initOptimalPrimitiveDescriptor();
-}
-
 void MemoryInputSDPA::assignStateHook() {
     auto currentState = getAssignedState();
     auto sdpaNode = m_sdpaNode.lock();
diff --git a/src/plugins/intel_cpu/src/nodes/memory.hpp b/src/plugins/intel_cpu/src/nodes/memory.hpp
index c5a83cfa5cad1a..c158d738a36148 100644
--- a/src/plugins/intel_cpu/src/nodes/memory.hpp
+++ b/src/plugins/intel_cpu/src/nodes/memory.hpp
@@ -204,8 +204,6 @@ class MemoryInputSDPA : public MemoryInputBase {
     static bool isSupportedOperation(const std::shared_ptr<const ov::Node>& op, std::string& errorMessage) noexcept;
 
     void createPrimitive() override;
-    void initSupportedPrimitiveDescriptors() override;
-    void initOptimalPrimitiveDescriptor() override;
     void resolveInPlaceEdges(Edge::LOOK look) override;
 
     MemStatePtr makeState() const override;

From b785e6eec95ad4a3da8a89bff56994522fa66ee4 Mon Sep 17 00:00:00 2001
From: Mingyu Kim <mingyu.kim@intel.com>
Date: Mon, 21 Oct 2024 18:00:46 +0900
Subject: [PATCH 10/24] [GPU] Enable dynamic quantization gs32 as default for
 non-systolic (#27119)

### Details:
 - It is applied only to int4 compressed model, non-systolic path
 - Though it is a global configuration, systolic hardware will ignore it

### Tickets:
 - 151708
---
 .../include/intel_gpu/runtime/debug_configuration.hpp  |  1 +
 .../fully_connected_kernel_bf_tiled.cpp                |  2 +-
 .../intel_gpu/src/plugin/transformations_pipeline.cpp  |  2 +-
 .../intel_gpu/src/runtime/debug_configuration.cpp      |  2 +-
 src/plugins/intel_gpu/src/runtime/execution_config.cpp |  5 ++---
 .../tests/unit/fusions/fully_connected_fusion_test.cpp |  2 ++
 .../tests/unit/test_cases/fully_connected_gpu_test.cpp | 10 +++++++---
 7 files changed, 15 insertions(+), 9 deletions(-)

diff --git a/src/plugins/intel_gpu/include/intel_gpu/runtime/debug_configuration.hpp b/src/plugins/intel_gpu/include/intel_gpu/runtime/debug_configuration.hpp
index fbc8ae84c36a29..c65aa3e5894cb8 100644
--- a/src/plugins/intel_gpu/include/intel_gpu/runtime/debug_configuration.hpp
+++ b/src/plugins/intel_gpu/include/intel_gpu/runtime/debug_configuration.hpp
@@ -175,6 +175,7 @@ class debug_configuration {
     } dump_prof_data_iter_params;
 
     static std::ostream* verbose_stream;
+    static const int DYNAMIC_QUANTIZE_GROUP_SIZE_NOT_SET = -2;
 };
 
 }  // namespace cldnn
diff --git a/src/plugins/intel_gpu/src/kernel_selector/kernels/fully_connected/fully_connected_kernel_bf_tiled.cpp b/src/plugins/intel_gpu/src/kernel_selector/kernels/fully_connected/fully_connected_kernel_bf_tiled.cpp
index c4115d74f54a92..b26b11ce97df6a 100644
--- a/src/plugins/intel_gpu/src/kernel_selector/kernels/fully_connected/fully_connected_kernel_bf_tiled.cpp
+++ b/src/plugins/intel_gpu/src/kernel_selector/kernels/fully_connected/fully_connected_kernel_bf_tiled.cpp
@@ -55,7 +55,7 @@ static size_t get_dynamic_quantize_group_size(const fully_connected_params& para
     auto dynamic_quantization_group_size = params.dynamic_quantization_group_size;
 
     GPU_DEBUG_GET_INSTANCE(debug_config);
-    GPU_DEBUG_IF(debug_config->dynamic_quantize_group_size) {
+    GPU_DEBUG_IF(debug_config->dynamic_quantize_group_size != debug_config->DYNAMIC_QUANTIZE_GROUP_SIZE_NOT_SET) {
         dynamic_quantization_group_size = debug_config->dynamic_quantize_group_size;
 
         // Specify which Fully-connected layer would be dynamic-quantized
diff --git a/src/plugins/intel_gpu/src/plugin/transformations_pipeline.cpp b/src/plugins/intel_gpu/src/plugin/transformations_pipeline.cpp
index f173e378fca3f9..b75519ac40e678 100644
--- a/src/plugins/intel_gpu/src/plugin/transformations_pipeline.cpp
+++ b/src/plugins/intel_gpu/src/plugin/transformations_pipeline.cpp
@@ -872,7 +872,7 @@ void TransformationsPipeline::apply(std::shared_ptr<ov::Model> func) {
         manager.register_pass<ov::pass::Validate>();
 
         auto dynamic_quantization_group_size = config.get_property(ov::hint::dynamic_quantization_group_size);
-        if (device_info.supports_immad) { // XXX: 1048576 is considered per-token
+        if (device_info.supports_immad) {
             pass_config->set_callback<ov::intel_gpu::DynamicQuantizeFullyConnected>([=](const_node_ptr& root) -> bool {
                 if (root->get_input_node_shared_ptr(0)->get_element_type() == ov::element::Type_t::f32) {
                     GPU_DEBUG_TRACE << root->get_friendly_name() << "  Dynamic quantization is turned off because input type is not supported" << std::endl;
diff --git a/src/plugins/intel_gpu/src/runtime/debug_configuration.cpp b/src/plugins/intel_gpu/src/runtime/debug_configuration.cpp
index dcbabff548cc5d..5f943564d6f50e 100644
--- a/src/plugins/intel_gpu/src/runtime/debug_configuration.cpp
+++ b/src/plugins/intel_gpu/src/runtime/debug_configuration.cpp
@@ -253,7 +253,7 @@ debug_configuration::debug_configuration()
         , disable_runtime_skip_reorder(0)
         , disable_primitive_fusing(0)
         , disable_fake_alignment(0)
-        , dynamic_quantize_group_size(0)
+        , dynamic_quantize_group_size(DYNAMIC_QUANTIZE_GROUP_SIZE_NOT_SET)
         , disable_horizontal_fc_fusion(0) {
 #ifdef GPU_DEBUG_CONFIG
     get_gpu_debug_env_var("Help", help);
diff --git a/src/plugins/intel_gpu/src/runtime/execution_config.cpp b/src/plugins/intel_gpu/src/runtime/execution_config.cpp
index 9c24fae1d6729a..7661444cc4fd7b 100644
--- a/src/plugins/intel_gpu/src/runtime/execution_config.cpp
+++ b/src/plugins/intel_gpu/src/runtime/execution_config.cpp
@@ -46,7 +46,6 @@ void ExecutionConfig::set_default() {
         std::make_tuple(ov::hint::execution_mode, ov::hint::ExecutionMode::PERFORMANCE),
         std::make_tuple(ov::hint::num_requests, 0),
         std::make_tuple(ov::hint::enable_cpu_pinning, false),
-        std::make_tuple(ov::hint::dynamic_quantization_group_size, 0),
 
         std::make_tuple(ov::intel_gpu::hint::host_task_priority, ov::hint::Priority::MEDIUM),
         std::make_tuple(ov::intel_gpu::hint::queue_throttle, ov::intel_gpu::hint::ThrottleLevel::MEDIUM),
@@ -58,7 +57,7 @@ void ExecutionConfig::set_default() {
         std::make_tuple(ov::internal::query_model_ratio, 1.0f),
         std::make_tuple(ov::cache_mode, ov::CacheMode::OPTIMIZE_SPEED),
         std::make_tuple(ov::cache_encryption_callbacks, EncryptionCallbacks{}),
-        std::make_tuple(ov::hint::dynamic_quantization_group_size, 0),
+        std::make_tuple(ov::hint::dynamic_quantization_group_size, 32),
         std::make_tuple(ov::intel_gpu::hint::enable_kernels_reuse, false),
         std::make_tuple(ov::weights_path, ""),
 
@@ -204,7 +203,7 @@ void ExecutionConfig::apply_debug_options(const cldnn::device_info& info) {
         set_property(ov::intel_gpu::use_only_static_kernels_for_dynamic_shape(true));
     }
 
-    GPU_DEBUG_IF(debug_config->dynamic_quantize_group_size) {
+    GPU_DEBUG_IF(debug_config->dynamic_quantize_group_size != debug_config->DYNAMIC_QUANTIZE_GROUP_SIZE_NOT_SET) {
         if (debug_config->dynamic_quantize_group_size == -1)
             set_property(ov::hint::dynamic_quantization_group_size(UINT64_MAX));
         else
diff --git a/src/plugins/intel_gpu/tests/unit/fusions/fully_connected_fusion_test.cpp b/src/plugins/intel_gpu/tests/unit/fusions/fully_connected_fusion_test.cpp
index 3743298a3c981a..5e9b5134fb3802 100644
--- a/src/plugins/intel_gpu/tests/unit/fusions/fully_connected_fusion_test.cpp
+++ b/src/plugins/intel_gpu/tests/unit/fusions/fully_connected_fusion_test.cpp
@@ -666,6 +666,7 @@ TEST_P(fc_compressed_int8_bias_dynamic_onednn, basic) {
 
     bool is_dynamic = true;
     cfg_not_fused.set_property(ov::intel_gpu::allow_new_shape_infer(is_dynamic));
+    cfg_not_fused.set_property(ov::hint::dynamic_quantization_group_size(0));
     tolerance = 1.0f;
     execute(p, false, is_dynamic);
 }
@@ -705,6 +706,7 @@ TEST_P(fc_compressed_int8_bias_prod_unfused_dynamic_onednn, basic) {
 
     bool is_dynamic = true;
     cfg_not_fused.set_property(ov::intel_gpu::allow_new_shape_infer(is_dynamic));
+    cfg_not_fused.set_property(ov::hint::dynamic_quantization_group_size(0));
     tolerance = 1.0f;
     execute(p, false, is_dynamic);
 }
diff --git a/src/plugins/intel_gpu/tests/unit/test_cases/fully_connected_gpu_test.cpp b/src/plugins/intel_gpu/tests/unit/test_cases/fully_connected_gpu_test.cpp
index 0ef7b6a5ca088b..dde1b6215148b3 100644
--- a/src/plugins/intel_gpu/tests/unit/test_cases/fully_connected_gpu_test.cpp
+++ b/src/plugins/intel_gpu/tests/unit/test_cases/fully_connected_gpu_test.cpp
@@ -1590,6 +1590,7 @@ class fully_connected_gpu_tests: public ::testing::Test {
             config.set_property(ov::intel_gpu::allow_new_shape_infer(true));
             ov::intel_gpu::ImplementationDesc fc_impl_desc = { format::bfyx, "fully_connected_gpu_bfyx_ref", impl_types::ocl };
             config.set_property(ov::intel_gpu::force_implementations(ov::intel_gpu::ImplForcingMap{ {"fc_prim", fc_impl_desc} }));
+            config.set_property(ov::hint::dynamic_quantization_group_size(0));
 
             network network(engine, topology, config);
             network.set_input_data("input", input_mem);
@@ -1615,6 +1616,7 @@ class fully_connected_gpu_tests: public ::testing::Test {
         auto config = get_test_default_config(engine);
         config.set_property(ov::intel_gpu::allow_new_shape_infer(true));
         config.set_property(ov::intel_gpu::optimize_data(true));
+        config.set_property(ov::hint::dynamic_quantization_group_size(0));
 
         network::ptr network = get_network(engine, topology, config, get_test_stream_ptr(), is_caching_test);
 
@@ -1698,9 +1700,7 @@ class fully_connected_gpu_tests: public ::testing::Test {
             config.set_property(ov::intel_gpu::allow_new_shape_infer(true));
             ov::intel_gpu::ImplementationDesc fc_impl_desc = { format::bfyx, "fully_connected_gpu_bfyx_ref", impl_types::ocl };
             config.set_property(ov::intel_gpu::force_implementations(ov::intel_gpu::ImplForcingMap{ {"fc_prim", fc_impl_desc} }));
-            if (is_dyn_quan) {
-                config.set_property(ov::hint::dynamic_quantization_group_size(0));
-            }
+            config.set_property(ov::hint::dynamic_quantization_group_size(0));
 
             network network(engine, topology, config);
             network.set_input_data("input", input_mem);
@@ -1728,6 +1728,8 @@ class fully_connected_gpu_tests: public ::testing::Test {
         config.set_property(ov::intel_gpu::optimize_data(true));
         if (is_dyn_quan) {
             config.set_property(ov::hint::dynamic_quantization_group_size(32));
+        } else {
+            config.set_property(ov::hint::dynamic_quantization_group_size(0));
         }
 
         network::ptr network = get_network(engine, topology, config, get_test_stream_ptr(), is_caching_test);
@@ -1868,6 +1870,7 @@ class fully_connected_gpu_tests: public ::testing::Test {
             config.set_property(ov::intel_gpu::allow_new_shape_infer(true));
             ov::intel_gpu::ImplementationDesc fc_impl = { in_layout.format, "", impl_types::ocl };
             config.set_property(ov::intel_gpu::force_implementations(ov::intel_gpu::ImplForcingMap{ { "fc_prim1", fc_impl }, { "fc_prim2", fc_impl }  }));
+            config.set_property(ov::hint::dynamic_quantization_group_size(0));
 
             network network(engine, topology, config);
             network.set_input_data("input", input_mem);
@@ -1896,6 +1899,7 @@ class fully_connected_gpu_tests: public ::testing::Test {
         auto config = get_test_default_config(engine);
         config.set_property(ov::intel_gpu::allow_new_shape_infer(true));
         config.set_property(ov::intel_gpu::optimize_data(true));
+        config.set_property(ov::hint::dynamic_quantization_group_size(0));
 
         network::ptr network = get_network(engine, topology, config, get_test_stream_ptr(), is_caching_test);
 

From 2e25c873f66477c0676354511dfb4c58b13b05c4 Mon Sep 17 00:00:00 2001
From: Wilson Seok <wilson.seok@intel.com>
Date: Mon, 21 Oct 2024 02:03:06 -0700
Subject: [PATCH 11/24] [GPU] Fix not to check _dynamic_dims_mask when
 get_from_padded_pool() (#27120)

### Details:
 - Fix not to check _dynamic_dims_mask when get_from_padded_pool()

### Tickets:
 - 154329
 - 155099
 - 154137
---
 src/plugins/intel_gpu/include/intel_gpu/runtime/layout.hpp | 2 +-
 src/plugins/intel_gpu/src/runtime/memory_pool.cpp          | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/src/plugins/intel_gpu/include/intel_gpu/runtime/layout.hpp b/src/plugins/intel_gpu/include/intel_gpu/runtime/layout.hpp
index 82cf01ab9522b1..62e4c08a90f004 100644
--- a/src/plugins/intel_gpu/include/intel_gpu/runtime/layout.hpp
+++ b/src/plugins/intel_gpu/include/intel_gpu/runtime/layout.hpp
@@ -183,7 +183,7 @@ struct padding {
     }
 
     friend bool operator<(const padding& lhs, const padding& rhs) {
-        OPENVINO_ASSERT(!lhs.is_dynamic() && !rhs.is_dynamic(), "[GPU] padding compare is called for dynamic shape");
+        // Compare only actual padding size not _dynamic_dims_mask
         if (lhs._lower_size < rhs._lower_size) return true;
         else if (lhs._lower_size > rhs._lower_size) return false;
         if (lhs._upper_size < rhs._upper_size) return true;
diff --git a/src/plugins/intel_gpu/src/runtime/memory_pool.cpp b/src/plugins/intel_gpu/src/runtime/memory_pool.cpp
index 9dee7c4487002e..1d34cfcde18a63 100644
--- a/src/plugins/intel_gpu/src/runtime/memory_pool.cpp
+++ b/src/plugins/intel_gpu/src/runtime/memory_pool.cpp
@@ -306,7 +306,7 @@ memory::ptr memory_pool::get_memory(const layout& layout,
     }
     if (do_reuse) {
         // reusable within the same network
-        if (!layout.format.is_image() && layout.data_padding == padding{{0, 0, 0, 0}, 0}) {
+        if (!layout.format.is_image() && !layout.data_padding) {
             // non-padded buffers
             return get_from_non_padded_pool(layout, prim_id, unique_id, network_id, restrictions, type, reset, is_dynamic);
         } else if (!layout.format.is_image()) {

From 34398738424a9908f10891b80459ce582c71a1e6 Mon Sep 17 00:00:00 2001
From: Vladimir Paramuzov <vladimir.paramuzov@intel.com>
Date: Mon, 21 Oct 2024 13:04:07 +0400
Subject: [PATCH 12/24] [GPU] Disable onednn pool in some cases due to the bug
 (#27115)

### Tickets:
 - *CVS-155035*
---
 .../src/graph/impls/onednn/pooling_onednn.hpp       |  2 +-
 .../src/graph/impls/registry/pooling_impls.cpp      | 13 ++++++++++++-
 2 files changed, 13 insertions(+), 2 deletions(-)

diff --git a/src/plugins/intel_gpu/src/graph/impls/onednn/pooling_onednn.hpp b/src/plugins/intel_gpu/src/graph/impls/onednn/pooling_onednn.hpp
index 26cecbb659e475..343fe66771de25 100644
--- a/src/plugins/intel_gpu/src/graph/impls/onednn/pooling_onednn.hpp
+++ b/src/plugins/intel_gpu/src/graph/impls/onednn/pooling_onednn.hpp
@@ -14,7 +14,7 @@ namespace onednn {
 
 struct PoolingImplementationManager : public ImplementationManager {
     OV_GPU_PRIMITIVE_IMPL("onednn::pool")
-    PoolingImplementationManager(shape_types shape_type) : ImplementationManager(impl_types::onednn, shape_type) {}
+    PoolingImplementationManager(shape_types shape_type, ValidateFunc vf = nullptr) : ImplementationManager(impl_types::onednn, shape_type, vf) {}
     std::unique_ptr<primitive_impl> create_impl(const program_node& node, const kernel_impl_params& params) const override;
 
     bool validate_impl(const program_node& node) const override {
diff --git a/src/plugins/intel_gpu/src/graph/impls/registry/pooling_impls.cpp b/src/plugins/intel_gpu/src/graph/impls/registry/pooling_impls.cpp
index 191edc050cd694..9958404b14bfee 100644
--- a/src/plugins/intel_gpu/src/graph/impls/registry/pooling_impls.cpp
+++ b/src/plugins/intel_gpu/src/graph/impls/registry/pooling_impls.cpp
@@ -17,7 +17,18 @@ using namespace cldnn;
 
 const std::vector<std::shared_ptr<cldnn::ImplementationManager>>& Registry<pooling>::get_implementations() {
     static const std::vector<std::shared_ptr<ImplementationManager>> impls = {
-        OV_GPU_CREATE_INSTANCE_ONEDNN(onednn::PoolingImplementationManager, shape_types::static_shape)
+        OV_GPU_CREATE_INSTANCE_ONEDNN(onednn::PoolingImplementationManager, shape_types::static_shape, [](const program_node& node) {
+            const auto& in_layout = node.get_input_layout(0);
+            const auto& out_layout = node.get_output_layout(0);
+            // Disable this case due to sporadic hang for the following case:
+            // onednn_verbose,primitive,exec,gpu:0,pooling,jit:ir,forward_inference,src_u8::blocked:acdb::f0 dst_u8::blocked:abcd::f0
+            // ws_undef::undef:::,attr-scratchpad:user attr-post-ops:eltwise_linear:1.52456,alg:pooling_avg_include_padding,
+            // mb1ic96_ih56oh28kh2sh2dh0ph0_iw56ow28kw2sw2dw0pw0,0.0400391
+            // issue: 12579
+            if (in_layout.format == format::byxf && out_layout.format == format::bfyx && ov::element::Type(in_layout.data_type).is_integral_number())
+                return false;
+            return true;
+        })
         OV_GPU_GET_INSTANCE_OCL(pooling, shape_types::static_shape)
     };
 

From 924b311dc3a5228ae17ced7bfe9013344b007f1b Mon Sep 17 00:00:00 2001
From: captainneil <captainneil118@outlook.com>
Date: Mon, 21 Oct 2024 17:31:22 +0800
Subject: [PATCH 13/24] [Conan Build]Fix Debug Build (#27150)

### Details:
 - *Fix Conan Debug Build*

A failed compilation look like this
```
cmake -G "Visual Studio 17 2022" -DCMAKE_TOOLCHAIN_FILE="generators/conan_toolchain.cmake" -DCMAKE_INSTALL_PREFIX="F:/.conan2/p/b/openvdff378fa94719/p" -DENABLE_INTEL_CPU="ON" -DENABLE_INTEL_GPU="ON" -DENABLE_ONEDNN_FOR_GPU="ON" -DENABLE_INTEL_GNA="OFF" -DENABLE_AUTO="ON" -DENABLE_MULTI="ON" -DENABLE_AUTO_BATCH="ON" -DENABLE_HETERO="ON" -DENABLE_OV_IR_FRONTEND="ON" -DENABLE_OV_PADDLE_FRONTEND="ON" -DENABLE_OV_TF_FRONTEND="ON" -DENABLE_OV_TF_LITE_FRONTEND="ON" -DENABLE_OV_ONNX_FRONTEND="ON" -DENABLE_OV_PYTORCH_FRONTEND="ON" -DENABLE_SYSTEM_TBB="ON" -DENABLE_TBBBIND_2_5="OFF" -DENABLE_SYSTEM_PUGIXML="ON" -DENABLE_SYSTEM_PROTOBUF="ON" -DENABLE_SYSTEM_SNAPPY="ON" -DENABLE_SYSTEM_FLATBUFFERS="ON" -DENABLE_SYSTEM_OPENCL="ON" -DENABLE_GAPI_PREPROCESSING="ON" -DBUILD_SHARED_LIBS="ON" -DCPACK_GENERATOR="CONAN" -DENABLE_PROFILING_ITT="OFF" -DENABLE_PYTHON="OFF" -DENABLE_PROXY="OFF" -DENABLE_WHEEL="OFF" -DENABLE_CPPLINT="OFF" -DENABLE_NCC_STYLE="OFF" -DENABLE_SAMPLES="OFF" -DENABLE_TEMPLATE="OFF" -DCMAKE_POLICY_DEFAULT_CMP0091="NEW" "F:/.conan2/p/openvac7fc2c3b20db/s/src" --fresh
-- Using Conan toolchain: F:/.conan2/p/b/openvdff378fa94719/b/build/generators/conan_toolchain.cmake
-- Conan toolchain: Including user_toolchain: F:/.conan2/profiles/disable_vcpkg.cmake
-- Conan toolchain: Including user_toolchain: F:/.conan2/profiles/limit_sdkver.cmake
-- Conan user toolchain: CMAKE_VS_WINDOWS_TARGET_PLATFORM_VERSION_MAXIMUM=10.0.22621.0
-- Conan toolchain: CMAKE_GENERATOR_TOOLSET=v142
-- Conan toolchain: Setting CMAKE_MSVC_RUNTIME_LIBRARY=$<$<CONFIG:Debug>:MultiThreadedDebugDLL>
-- Conan toolchain: C++ Standard 17 with extensions OFF
-- Conan toolchain: Setting BUILD_SHARED_LIBS = ON
-- Selecting Windows SDK version 10.0.22621.0 to target Windows 10.0.22631.
-- The C compiler identification is MSVC 19.29.30154.0
-- The CXX compiler identification is MSVC 19.29.30154.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: C:/Program Files/Microsoft Visual Studio/2022/Professional/VC/Tools/MSVC/14.29.30133/bin/HostX64/x64/cl.exe - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: C:/Program Files/Microsoft Visual Studio/2022/Professional/VC/Tools/MSVC/14.29.30133/bin/HostX64/x64/cl.exe - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- OpenVINO version is 2023.2.0 (Build 000)
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Failed
-- Looking for pthread_create in pthreads
-- Looking for pthread_create in pthreads - not found
-- Looking for pthread_create in pthread
-- Looking for pthread_create in pthread - not found
-- Found Threads: TRUE
-- Performing Test SUGGEST_OVERRIDE_SUPPORTED
-- Performing Test SUGGEST_OVERRIDE_SUPPORTED - Failed
-- Performing Test UNUSED_BUT_SET_VARIABLE_SUPPORTED
-- Performing Test UNUSED_BUT_SET_VARIABLE_SUPPORTED - Failed
-- OpenVINO Runtime enabled features:
--
--     CI_BUILD_NUMBER: 2023.2.0-000--
--     CPACK_GENERATOR = CONAN
--     ENABLE_LTO = OFF
--     OS_FOLDER = OFF
--     USE_BUILD_TYPE_SUBFOLDER = OFF
--     CMAKE_COMPILE_WARNING_AS_ERROR = OFF
--     ENABLE_QSPECTRE = OFF
--     ENABLE_INTEGRITYCHECK = OFF
--     ENABLE_SANITIZER = OFF
--     ENABLE_UB_SANITIZER = OFF
--     ENABLE_THREAD_SANITIZER = OFF
--     ENABLE_COVERAGE = OFF
--     ENABLE_SSE42 = ON
--     ENABLE_AVX2 = ON
--     ENABLE_AVX512F = ON
--     BUILD_SHARED_LIBS = ON
--     ENABLE_LIBRARY_VERSIONING = OFF
--     ENABLE_FASTER_BUILD = OFF
--     ENABLE_CPPLINT = OFF
--     ENABLE_CPPLINT_REPORT = OFF
--     ENABLE_CLANG_FORMAT = OFF
--     ENABLE_NCC_STYLE = OFF
--     ENABLE_UNSAFE_LOCATIONS = OFF
--     ENABLE_FUZZING = OFF
--     ENABLE_PROXY = OFF
--     ENABLE_INTEL_CPU = ON
--     ENABLE_ARM_COMPUTE_CMAKE = OFF
--     ENABLE_TESTS = OFF
--     ENABLE_INTEL_GPU = ON
--     ENABLE_ONEDNN_FOR_GPU = ON
--     ENABLE_DEBUG_CAPS = OFF
--     ENABLE_GPU_DEBUG_CAPS = OFF
--     ENABLE_CPU_DEBUG_CAPS = OFF
--     ENABLE_PROFILING_ITT = OFF
--     ENABLE_PROFILING_FILTER = ALL
--     ENABLE_PROFILING_FIRST_INFERENCE = ON
--     SELECTIVE_BUILD = OFF
--     ENABLE_DOCS = OFF
--     ENABLE_PKGCONFIG_GEN = OFF
--     THREADING = TBB
--     ENABLE_TBBBIND_2_5 = OFF
--     ENABLE_TBB_RELEASE_ONLY = OFF
--     ENABLE_INTEL_GNA = OFF
--     ENABLE_INTEL_GNA_DEBUG = OFF
--     ENABLE_V7_SERIALIZE = OFF
--     ENABLE_IR_V7_READER = OFF
--     ENABLE_GAPI_PREPROCESSING = ON
--     ENABLE_MULTI = ON
--     ENABLE_AUTO = ON
--     ENABLE_AUTO_BATCH = ON
--     ENABLE_HETERO = ON
--     ENABLE_TEMPLATE = OFF
--     ENABLE_PLUGINS_XML = OFF
--     GAPI_TEST_PERF = OFF
--     ENABLE_FUNCTIONAL_TESTS = OFF
--     ENABLE_SAMPLES = OFF
--     ENABLE_OV_ONNX_FRONTEND = ON
--     ENABLE_OV_PADDLE_FRONTEND = ON
--     ENABLE_OV_IR_FRONTEND = ON
--     ENABLE_OV_PYTORCH_FRONTEND = ON
--     ENABLE_OV_IR_FRONTEND = ON
--     ENABLE_OV_TF_FRONTEND = ON
--     ENABLE_OV_TF_LITE_FRONTEND = ON
--     ENABLE_SNAPPY_COMPRESSION = ON
--     ENABLE_STRICT_DEPENDENCIES = OFF
--     ENABLE_SYSTEM_TBB = ON
--     ENABLE_SYSTEM_PUGIXML = ON
--     ENABLE_SYSTEM_FLATBUFFERS = ON
--     ENABLE_SYSTEM_OPENCL = ON
--     ENABLE_SYSTEM_PROTOBUF = ON
--     ENABLE_SYSTEM_SNAPPY = ON
--     ENABLE_PYTHON_PACKAGING = OFF
--     ENABLE_OPENVINO_DEBUG = OFF
--
-- CMAKE_VERSION ......................... 3.29.8
-- OpenVINO_SOURCE_DIR ................... F:/.conan2/p/openvac7fc2c3b20db/s/src
-- OpenVINO_BINARY_DIR ................... F:/.conan2/p/b/openvdff378fa94719/b/build
-- CMAKE_GENERATOR ....................... Visual Studio 17 2022
-- CPACK_GENERATOR ....................... CONAN
-- CMAKE_C_COMPILER_ID ................... MSVC
-- CMAKE_CXX_COMPILER_ID ................. MSVC
-- CMAKE_CXX_STANDARD .................... 17
-- CMAKE_CONFIGURATION_TYPES ............. Debug Release MinSizeRel RelWithDebInfo
-- CMAKE_GENERATOR_PLATFORM .............. x64
-- CMAKE_GENERATOR_PLATFORM .............. x64
-- CMAKE_GENERATOR_PLATFORM .............. x64
-- CMAKE_GENERATOR_TOOLSET ............... v142
-- CMAKE_TOOLCHAIN_FILE .................. F:/.conan2/p/b/openvdff378fa94719/b/build/generators/conan_toolchain.cmake
-- Conan: Target declared 'pugixml::pugixml'
-- Conan: Component target declared 'protobuf::libprotobuf'
-- Conan: Component target declared 'protobuf::libprotoc'
-- Conan: Target declared 'protobuf::protobuf'
-- Conan: Target declared 'ZLIB::ZLIB'
-- Conan: Including build module from 'F:/.conan2/p/b/protoa6c757f4d3132/p/lib/cmake/protobuf/protobuf-generate.cmake'
-- Conan: Including build module from 'F:/.conan2/p/b/protoa6c757f4d3132/p/lib/cmake/protobuf/protobuf-module.cmake'
-- Conan: Including build module from 'F:/.conan2/p/b/protoa6c757f4d3132/p/lib/cmake/protobuf/protobuf-options.cmake'
-- Conan: Including build module from 'F:/.conan2/p/b/protoa6c757f4d3132/p/lib/cmake/protobuf/protobuf-conan-protoc-target.cmake'
-- Conan: Component target declared 'flatbuffers::libflatbuffers'
-- Conan: Target declared 'flatbuffers::flatbuffers'
-- Conan: Including build module from 'F:/.conan2/p/b/flatb71a17782f7317/p/lib/cmake/FlatcTargets.cmake'
-- Conan: Including build module from 'F:/.conan2/p/b/flatb71a17782f7317/p/lib/cmake/BuildFlatBuffers.cmake'
-- Conan: Component target declared 'Snappy::snappy'
-- Cannot locate shared library: tbb_debug
-- Cannot locate shared library: tbb_debug
-- TBB (2021.10.0) is found at F:/.conan2/p/b/openvdff378fa94719/b/build/generators
CMake Error at src/cmake/ov_parallel.cmake:75 (message):
  Failed to detect TBB library location
Call Stack (most recent call first):
  src/cmake/install_tbb.cmake:18 (_ov_get_tbb_location)
  src/cmake/install_tbb.cmake:37 (_ov_detect_dynamic_tbbbind_2_5)
  src/CMakeLists.txt:11 (include)
```
---
 src/cmake/ov_parallel.cmake | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/cmake/ov_parallel.cmake b/src/cmake/ov_parallel.cmake
index 1c10f1c121d8bc..110e7fe185f63f 100644
--- a/src/cmake/ov_parallel.cmake
+++ b/src/cmake/ov_parallel.cmake
@@ -23,7 +23,7 @@ function(_ov_get_tbb_location tbb_target _tbb_lib_location_var)
         get_target_property(_imported_configs ${target} IMPORTED_CONFIGURATIONS)
         if(NOT _imported_configs)
             # if IMPORTED_CONFIGURATIONS property is not set, then set a common list
-            set(_imported_configs RELEASE NONE)
+            set(_imported_configs RELEASE DEBUG NONE)
             if(NOT OV_GENERATOR_MULTI_CONFIG)
                 string(TOUPPER ${CMAKE_BUILD_TYPE} _build_type)
                 list(APPEND _imported_configs ${_build_type})

From c9deb2128ddad68d7a7abea64643527b708f75ad Mon Sep 17 00:00:00 2001
From: Tomasz Jankowski <tomasz1.jankowski@intel.com>
Date: Mon, 21 Oct 2024 11:36:47 +0200
Subject: [PATCH 14/24] [IR FE] Ignore unrecognized xml rt_info entries
 (#27118)

### Details:
 - Ignores unrecognized `<rt_info>` entries instead of throwing

### Tickets:
 - CVS-155326
---
 src/frontends/ir/src/ir_deserializer.cpp           | 13 +++----------
 src/frontends/ir/tests/rt_info_deserialization.cpp |  4 ++++
 2 files changed, 7 insertions(+), 10 deletions(-)

diff --git a/src/frontends/ir/src/ir_deserializer.cpp b/src/frontends/ir/src/ir_deserializer.cpp
index f9ddcf1e8c14a6..7c8b6e9d4b97ab 100644
--- a/src/frontends/ir/src/ir_deserializer.cpp
+++ b/src/frontends/ir/src/ir_deserializer.cpp
@@ -968,16 +968,9 @@ std::shared_ptr<ov::Node> ov::XmlDeserializer::create_node(const std::vector<ov:
             std::string attribute_name, attribute_version;
             // For view:
             // <attribute name="old_api_map_order" version="0" value="0,3,1,2"/>
-            if (!getStrAttribute(item, "name", attribute_name)) {
-                std::stringstream ss;
-                item.print(ss);
-                OPENVINO_THROW("rt_info attribute has no \"name\" field: ", ss.str());
-            }
-            if (!getStrAttribute(item, "version", attribute_version)) {
-                std::stringstream ss;
-                item.print(ss);
-                OPENVINO_THROW("rt_info attribute: ", attribute_name, " has no \"version\" field: ", ss.str());
-            }
+            if (!getStrAttribute(item, "name", attribute_name) || !getStrAttribute(item, "version", attribute_version))
+                continue;
+
             const auto& type_info = ov::DiscreteTypeInfo(attribute_name.c_str(), attribute_version.c_str());
             auto attr = attrs_factory.create_by_type_info(type_info);
             if (!attr.empty()) {
diff --git a/src/frontends/ir/tests/rt_info_deserialization.cpp b/src/frontends/ir/tests/rt_info_deserialization.cpp
index 4313b4d19be515..466db1291e674a 100644
--- a/src/frontends/ir/tests/rt_info_deserialization.cpp
+++ b/src/frontends/ir/tests/rt_info_deserialization.cpp
@@ -405,11 +405,15 @@ TEST_F(RTInfoDeserialization, node_v11) {
                 <attribute name="old_api_map_element_type" version="0" value="f16"/>
                 <attribute name="old_api_map_order" version="0" value="0,2,3,1"/>
                 <attribute name="fused_names" version="0" value="in1"/>
+                <attribute name="if no version" value="then ignore"/>
+                <attribute name="fused_names" version="1" comment="unknown version"/>
             </rt_info>
             <output>
                 <port id="0" precision="FP32" names="input_tensor">
                     <rt_info>
                         <attribute name="layout" version="0" layout="[N,H,W,C]"/>
+                        <no_name version="0" value="param1"/>
+                        <empty_node/>
                     </rt_info>
                     <dim>1</dim>
                     <dim>22</dim>

From a3c07d582de7cae5e901184b5492f42f84a905a6 Mon Sep 17 00:00:00 2001
From: Roman Lyamin <Roman.Lyamin@intel.com>
Date: Mon, 21 Oct 2024 14:02:09 +0400
Subject: [PATCH 15/24] [GPU] Added empty LoRA adapters support for onednn case
 (#27111)

### Tickets:
 - *[152852](https://jira.devtools.intel.com/browse/CVS-152852)*
---
 .../src/graph/impls/onednn/gemm_onednn.cpp    | 30 +++++++++++++++----
 .../impls/onednn/primitive_onednn_base.h      |  6 +++-
 2 files changed, 30 insertions(+), 6 deletions(-)

diff --git a/src/plugins/intel_gpu/src/graph/impls/onednn/gemm_onednn.cpp b/src/plugins/intel_gpu/src/graph/impls/onednn/gemm_onednn.cpp
index 637a391b7f9e65..767128a5be2950 100644
--- a/src/plugins/intel_gpu/src/graph/impls/onednn/gemm_onednn.cpp
+++ b/src/plugins/intel_gpu/src/graph/impls/onednn/gemm_onednn.cpp
@@ -31,9 +31,13 @@ struct gemm_onednn : typed_primitive_onednn_impl<gemm> {
         auto dnnl_engine = engine.get_onednn_engine();
 
         {
+            dnnl::memory input1_mem;
             auto& weights = instance.input_memory(1);
             auto offset = onednn::get_offset(instance.get_input_layout(1), _pd.dnnl::primitive_desc_base::weights_desc(0));
-            args.insert({DNNL_ARG_WEIGHTS, weights.get_onednn_memory(_pd.weights_desc(0), offset)});
+            if (instance.get_input_layout(1).count() != 0) {
+                input1_mem = weights.get_onednn_memory(_pd.weights_desc(0), offset);
+            }
+            args.insert({DNNL_ARG_WEIGHTS, input1_mem});
         }
 
         if (instance.inputs_memory_count() == 3) {
@@ -86,11 +90,16 @@ struct gemm_onednn : typed_primitive_onednn_impl<gemm> {
         const auto& in0_l = in_layouts[0];
         const auto& in1_l = in_layouts[1];
 
-        size_t in0_batched_size = in0_l.count() / (in0_l.spatial(0) * in0_l.spatial(1));
-        size_t in1_batched_size = in1_l.count() / (in1_l.spatial(0) * in1_l.spatial(1));
-        size_t out_batched_size = out_l.count() / (out_l.spatial(0) * out_l.spatial(1));
+        bool batched_dims_can_be_removed = false;
+
+        if (in0_l.count() != 0 && in1_l.count() != 0) {
+            size_t in0_batched_size = in0_l.count() / (in0_l.spatial(0) * in0_l.spatial(1));
+            size_t in1_batched_size = in1_l.count() / (in1_l.spatial(0) * in1_l.spatial(1));
+            size_t out_batched_size = out_l.count() / (out_l.spatial(0) * out_l.spatial(1));
+
+            batched_dims_can_be_removed = in0_batched_size == 1 && in1_batched_size == 1 && out_batched_size == 1;
+        }
 
-        auto batched_dims_can_be_removed = in0_batched_size == 1 && in1_batched_size == 1 && out_batched_size == 1;
         if (gemm_with_bias) {
             const auto& bias_l = in_layouts[2];
             size_t bias_batched_size = bias_l.count() / (bias_l.spatial(0) * bias_l.spatial(1));
@@ -434,6 +443,17 @@ struct gemm_onednn : typed_primitive_onednn_impl<gemm> {
 
         return cldnn::make_unique<gemm_onednn>(engine, config, attr, *prim_desc);
     }
+
+    event::ptr execute_impl(const std::vector<event::ptr>& events, typed_primitive_inst<gemm>& instance) override {
+        if (instance.get_input_layout(0).count() == 0 ||
+            instance.get_input_layout(1).count() == 0) {
+            stream& stream = instance.get_network().get_stream();
+            stream.enqueue_barrier();
+            return instance.output_memory_ptr()->fill(stream, false);
+        }
+
+        return parent::execute_impl(events, instance);
+    }
 };
 
 std::unique_ptr<primitive_impl> GemmImplementationManager::create_impl(const program_node& node, const kernel_impl_params& params) const  {
diff --git a/src/plugins/intel_gpu/src/graph/impls/onednn/primitive_onednn_base.h b/src/plugins/intel_gpu/src/graph/impls/onednn/primitive_onednn_base.h
index 96834b6a03c35e..6a8f2cb57d275b 100644
--- a/src/plugins/intel_gpu/src/graph/impls/onednn/primitive_onednn_base.h
+++ b/src/plugins/intel_gpu/src/graph/impls/onednn/primitive_onednn_base.h
@@ -455,9 +455,13 @@ struct typed_primitive_onednn_impl : public typed_primitive_impl<PType> {
         auto dnnl_engine = engine.get_onednn_engine();
 
         {
+            dnnl::memory input_mem;
             auto& input = instance.input_memory(0);
             auto offset = onednn::get_offset(instance.get_input_layout(0), _pd.dnnl::primitive_desc_base::src_desc(0));
-            args.insert({DNNL_ARG_SRC, input.get_onednn_memory(_pd.dnnl::primitive_desc_base::src_desc(0), offset)});
+            if (instance.get_input_layout(0).count() != 0) {
+                input_mem = input.get_onednn_memory(_pd.dnnl::primitive_desc_base::src_desc(0), offset);
+            }
+            args.insert({DNNL_ARG_SRC, input_mem});
         }
 
         {

From 85253c4f6717d3b512821b784bd575fa92a777fd Mon Sep 17 00:00:00 2001
From: Georgy Krivoruchko <georgy.krivoruchko@intel.com>
Date: Mon, 21 Oct 2024 14:18:48 +0400
Subject: [PATCH 16/24] [ONNX] Update ONNX version for vcpkg (#27155)

### Details:
- Delayed update ONNX version for vcpkg due to delay in the original
repository

### Tickets:
 - N/A
---
 vcpkg.json | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/vcpkg.json b/vcpkg.json
index 4956cee14cae9d..7214195df49506 100644
--- a/vcpkg.json
+++ b/vcpkg.json
@@ -80,7 +80,7 @@
             "dependencies": [
                 {
                     "name": "onnx",
-                    "version>=": "1.15.0"
+                    "version>=": "1.16.2"
                 },
                 {
                     "name": "protobuf",

From 1f41cbae5d7c4a12da3e23dd1f0a33db44c9f900 Mon Sep 17 00:00:00 2001
From: Liubov Talamanova <liubov.talamanova@intel.com>
Date: Mon, 21 Oct 2024 13:25:37 +0100
Subject: [PATCH 17/24] Update NNCF WC documentation (#27101)

Co-authored-by: Alexander Kozlov <alexander.kozlov@intel.com>
Co-authored-by: Tatiana Savina <tatiana.savina@intel.com>
---
 .../weight-compression.rst                    | 46 ++++++++++++++++---
 1 file changed, 40 insertions(+), 6 deletions(-)

diff --git a/docs/articles_en/openvino-workflow/model-optimization-guide/weight-compression.rst b/docs/articles_en/openvino-workflow/model-optimization-guide/weight-compression.rst
index 6348ca897c5ea5..47cfed977dc3df 100644
--- a/docs/articles_en/openvino-workflow/model-optimization-guide/weight-compression.rst
+++ b/docs/articles_en/openvino-workflow/model-optimization-guide/weight-compression.rst
@@ -161,15 +161,16 @@ trade-offs after optimization:
   `Larger Group Size`: Results in faster inference and a smaller model, but might
   compromise accuracy.
 
-* ``ratio`` controls the ratio between INT4 and INT8_ASYM compressed layers in the model.
+* ``ratio`` controls the ratio between the layers compressed to the precision defined
+  by ``mode`` and the rest of the layers that will be kept in the ``backup_mode`` in the optimized model.
   Ratio is a decimal between 0 and 1. For example, 0.8 means that 80% of layers will be
-  compressed to INT4, while the rest will be compressed to INT8_ASYM precision. The default
-  value for ratio is 1.
+  compressed to the precision defined by ``mode``, while the rest will be compressed to
+  ``backup_mode`` precision. The default value for ratio is 1.
 
-  `Higher Ratio (more INT4)`: Reduces the model size and increase inference speed but
+  `Higher Ratio (more layers set to mode precision)`: Reduces the model size and increase inference speed but
   might lead to higher accuracy degradation.
 
-  `Lower Ratio (more INT8_ASYM)`: Maintains better accuracy but results in a larger model size
+  `Lower Ratio (more layers set to backup_mode precision)`: Maintains better accuracy but results in a larger model size
   and potentially slower inference.
 
   In this example, 90% of the model's layers are quantized to INT4 asymmetrically with
@@ -196,8 +197,11 @@ trade-offs after optimization:
   4 bits. The method can sometimes result in reduced accuracy when used with
   Dynamic Quantization of activations. Requires dataset.
 
+* ``gptq`` - boolean parameter that enables the GPTQ method for more accurate INT4 weight
+  quantization. Requires dataset.
+
 * ``dataset`` - calibration dataset for data-aware weight compression. It is required
-  for some compression options, for example, ``scale_estimation`` or ``awq``. Some types
+  for some compression options, for example, ``scale_estimation``, ``gptq`` or ``awq``. Some types
   of ``sensitivity_metric`` can use data for precision selection.
 
 * ``sensitivity_metric`` - controls the metric to estimate the sensitivity of compressing
@@ -226,6 +230,36 @@ trade-offs after optimization:
 * ``all_layers`` - boolean parameter that enables INT4 weight quantization of all
   Fully-Connected and Embedding layers, including the first and last layers in the model.
 
+* ``lora_correction`` - boolean parameter that enables the LoRA Correction Algorithm
+  to further improve the accuracy of INT4 compressed models on top of other
+  algorithms - AWQ and Scale Estimation.
+
+* ``backup_mode`` - defines a backup precision for mixed-precision weight compression.
+  There are three modes: INT8_ASYM, INT8_SYM, and NONE, which retains
+  the original floating-point precision of the model weights (``INT8_ASYM`` is default value).
+
+
+**Use synthetic data for LLM weight compression**
+
+It is possible to generate a synthetic dataset using the `nncf.data.generate_text_data` method for
+data-aware weight compression. The method takes a language model (e.g. from `optimum.intel.openvino`)
+and a tokenizer (e.g. from `transformers`) as input and returns the list of strings generated by the model.
+Note that dataset generation takes time and depends on various conditions, like the model size,
+requested dataset length or environment setup. Also, since the dataset is generated by the model output,
+it does not guarantee significant accuracy improvement after compression. This method is recommended
+only when a better dataset is not available. Refer to the
+`example <https://github.com/openvinotoolkit/nncf/tree/develop/examples/llm_compression/openvino/tiny_llama_synthetic_data>`__
+for details of the usage.
+
+.. code-block:: python
+
+   from nncf import Dataset
+   from nncf.data import generate_text_data
+
+   # Example: Generating synthetic dataset
+   synthetic_data = generate_text_data(model, tokenizer)
+   nncf_dataset = nncf.Dataset(synthetic_data, transform_fn)
+
 For data-aware weight compression refer to the following
 `example <https://github.com/openvinotoolkit/nncf/tree/develop/examples/llm_compression/openvino/tiny_llama>`__.
 

From d34cddac9676c943f9351ad31dadf20b06e5c812 Mon Sep 17 00:00:00 2001
From: Alexey Smirnov <alexey.smirnov@intel.com>
Date: Mon, 21 Oct 2024 16:30:05 +0100
Subject: [PATCH 18/24] [NPUW] Support mixed precision models (#27130)

Separated change from
https://github.com/openvinotoolkit/openvino/pull/26263
---
 .../al/include/intel_npu/al/config/npuw.hpp   |  2 +-
 .../al/include/npuw_private_properties.hpp    |  2 +-
 .../npuw/partitioning/online/compiler.cpp     |  2 -
 .../plugin/npuw/partitioning/online/group.cpp | 12 ++++
 .../plugin/npuw/partitioning/online/group.hpp |  6 ++
 .../npuw/partitioning/online/repeated.hpp     | 15 ++++-
 .../npuw/partitioning/online/snapshot.cpp     | 64 +++++++++++++++++++
 .../npuw/partitioning/online/snapshot.hpp     |  6 +-
 .../plugin/npuw/partitioning/partitioning.cpp |  5 +-
 9 files changed, 105 insertions(+), 9 deletions(-)

diff --git a/src/plugins/intel_npu/src/al/include/intel_npu/al/config/npuw.hpp b/src/plugins/intel_npu/src/al/include/intel_npu/al/config/npuw.hpp
index b0ecf3cd45d152..f315d333d67ae4 100644
--- a/src/plugins/intel_npu/src/al/include/intel_npu/al/config/npuw.hpp
+++ b/src/plugins/intel_npu/src/al/include/intel_npu/al/config/npuw.hpp
@@ -35,7 +35,7 @@ DEFINE_OPT(NPUW_ONLINE_AVOID, std::string, "", npuw::partitioning::online::avoid
 DEFINE_OPT(NPUW_ONLINE_ISOLATE, std::string, "", npuw::partitioning::online::isolate, CompileTime);
 DEFINE_OPT(NPUW_ONLINE_NO_FOLD, std::string, "", npuw::partitioning::online::nofold, CompileTime);
 DEFINE_OPT(NPUW_ONLINE_MIN_SIZE, std::size_t, 10, npuw::partitioning::online::min_size, CompileTime);
-DEFINE_OPT(NPUW_ONLINE_KEEP_BLOCKS, std::size_t, 10, npuw::partitioning::online::keep_blocks, CompileTime);
+DEFINE_OPT(NPUW_ONLINE_KEEP_BLOCKS, std::size_t, 5, npuw::partitioning::online::keep_blocks, CompileTime);
 DEFINE_OPT(NPUW_ONLINE_KEEP_BLOCK_SIZE, std::size_t, 10, npuw::partitioning::online::keep_block_size, CompileTime);
 DEFINE_OPT(NPUW_ONLINE_DUMP_PLAN, std::string, "", npuw::partitioning::online::dump_plan, CompileTime);
 DEFINE_OPT(NPUW_PLAN, std::string, "", npuw::partitioning::plan, CompileTime);
diff --git a/src/plugins/intel_npu/src/al/include/npuw_private_properties.hpp b/src/plugins/intel_npu/src/al/include/npuw_private_properties.hpp
index 834f90db9cf9ef..a3eb4ecfa8cb63 100644
--- a/src/plugins/intel_npu/src/al/include/npuw_private_properties.hpp
+++ b/src/plugins/intel_npu/src/al/include/npuw_private_properties.hpp
@@ -123,7 +123,7 @@ static constexpr ov::Property<std::size_t> min_size{"NPUW_ONLINE_MIN_SIZE"};
  * Used to control fusion term criteria in online partitioning.
  * Only compatible with online partitioning.
  * Possible values: Integer > 0.
- * Default value: 10.
+ * Default value: 5.
  */
 static constexpr ov::Property<std::size_t> keep_blocks{"NPUW_ONLINE_KEEP_BLOCKS"};
 
diff --git a/src/plugins/intel_npu/src/plugin/npuw/partitioning/online/compiler.cpp b/src/plugins/intel_npu/src/plugin/npuw/partitioning/online/compiler.cpp
index a06a6f3bd1ced5..173091011d38fe 100644
--- a/src/plugins/intel_npu/src/plugin/npuw/partitioning/online/compiler.cpp
+++ b/src/plugins/intel_npu/src/plugin/npuw/partitioning/online/compiler.cpp
@@ -73,8 +73,6 @@ std::vector<Avoid> getAvoids(::intel_npu::Config& cfg) {
 
     std::string avoids_opt = cfg.getString<::intel_npu::NPUW_ONLINE_AVOID>();
     if (avoids_opt.empty()) {
-        LOG_VERB(::intel_npu::NPUW_ONLINE_AVOID().key()
-                 << " property is not set. NPU device will be prioritized for every subgraph.");
         return {};
     }
 
diff --git a/src/plugins/intel_npu/src/plugin/npuw/partitioning/online/group.cpp b/src/plugins/intel_npu/src/plugin/npuw/partitioning/online/group.cpp
index cfa9e451ffb149..2b2878481f1330 100644
--- a/src/plugins/intel_npu/src/plugin/npuw/partitioning/online/group.cpp
+++ b/src/plugins/intel_npu/src/plugin/npuw/partitioning/online/group.cpp
@@ -292,6 +292,10 @@ void Group::takeFlags(const Group::GPtr& gptr_other) {
             m_reptrack[layer].push_back(rep);
         }
     }
+    // Update weights precisions
+    for (const auto& wp : gptr_other->m_consts_precision) {
+        m_consts_precision.push_back(wp);
+    }
     // Update avoids
     for (const auto& device : gptr_other->avoidedTargets()) {
         avoid(device);
@@ -417,6 +421,14 @@ std::unordered_set<Interconnect> Group::interconnect(const Group::GPtr& gptr_pro
     return ics;
 }
 
+void Group::addWeightsPrecision(const std::vector<ov::element::Type>& prec) {
+    m_consts_precision.insert(m_consts_precision.end(), prec.begin(), prec.end());
+}
+
+const std::vector<ov::element::Type>& Group::getConstsPrecision() const {
+    return m_consts_precision;
+}
+
 std::string Group::specialTags() const {
     std::string tags = "";
 
diff --git a/src/plugins/intel_npu/src/plugin/npuw/partitioning/online/group.hpp b/src/plugins/intel_npu/src/plugin/npuw/partitioning/online/group.hpp
index 538eeb03bc851c..17527033173a82 100644
--- a/src/plugins/intel_npu/src/plugin/npuw/partitioning/online/group.hpp
+++ b/src/plugins/intel_npu/src/plugin/npuw/partitioning/online/group.hpp
@@ -81,6 +81,8 @@ class Group : public std::enable_shared_from_this<Group> {
     const std::set<std::string>& avoidedTargets() const;
     const std::string& isolatedTag() const;
     std::string specialTags() const;
+    void addWeightsPrecision(const std::vector<ov::element::Type>& prec);
+    const std::vector<ov::element::Type>& getConstsPrecision() const;
 
 private:
     void includeExtraLayers(detail::OVNodeSet& input_layers,
@@ -105,6 +107,10 @@ class Group : public std::enable_shared_from_this<Group> {
     std::set<std::string> m_avoided_devices;
     std::string m_isol_tag = "";
 
+    // Structure to keep track of mixed precision within initial model
+    // Note: partitioning is stable so keep it in a single vector
+    std::vector<ov::element::Type> m_consts_precision;
+
     // Unique repeated tag
     std::shared_ptr<Repeated> m_repeated = nullptr;
     // For each layer inside group, store it's history of repeated groups
diff --git a/src/plugins/intel_npu/src/plugin/npuw/partitioning/online/repeated.hpp b/src/plugins/intel_npu/src/plugin/npuw/partitioning/online/repeated.hpp
index fe34063fda211d..43eebc5f17ddb0 100644
--- a/src/plugins/intel_npu/src/plugin/npuw/partitioning/online/repeated.hpp
+++ b/src/plugins/intel_npu/src/plugin/npuw/partitioning/online/repeated.hpp
@@ -66,13 +66,24 @@ struct hash<std::pair<ov::npuw::online::detail::OVNodePtr, ov::npuw::online::det
     }
 };
 
+template <>
+struct hash<std::vector<ov::element::Type>> {
+    inline size_t operator()(const std::vector<ov::element::Type>& vec) const {
+        std::size_t seed = vec.size();
+        for (const auto& s : vec) {
+            seed ^= s.hash() + 0x9e3779b9 + (seed << 6) + (seed >> 2);
+        }
+        return seed;
+    }
+};
+
 template <>
 struct hash<std::tuple<std::string, std::set<std::string>, std::string>> {
     inline size_t operator()(const std::tuple<std::string, std::set<std::string>, std::string>& t) const {
         std::size_t seed = std::hash<std::string>()(std::get<0>(t)) + 0x9e3779b9;
-        seed ^= std::hash<std::string>()(std::get<2>(t)) + 0x9e3779b9;
+        seed ^= std::hash<std::string>()(std::get<2>(t)) + 0x9e3779b9 + (seed << 6) + (seed >> 2);
         for (const auto& s : std::get<1>(t)) {
-            seed ^= std::hash<std::string>()(s) + 0x9e3779b9;
+            seed ^= std::hash<std::string>()(s) + 0x9e3779b9 + (seed << 6) + (seed >> 2);
         }
         return seed;
     }
diff --git a/src/plugins/intel_npu/src/plugin/npuw/partitioning/online/snapshot.cpp b/src/plugins/intel_npu/src/plugin/npuw/partitioning/online/snapshot.cpp
index 4cdc92ffc92d25..c8a27c47665021 100644
--- a/src/plugins/intel_npu/src/plugin/npuw/partitioning/online/snapshot.cpp
+++ b/src/plugins/intel_npu/src/plugin/npuw/partitioning/online/snapshot.cpp
@@ -45,11 +45,35 @@ bool isOp(const std::shared_ptr<ov::Node>& node) {
     }
     return true;
 }
+
+std::vector<ov::element::Type> getConstsPrecision(const std::shared_ptr<ov::Node>& node) {
+    NPUW_ASSERT(!ov::op::util::is_constant(node) && !ov::op::util::is_parameter(node) &&
+                !ov::op::util::is_output(node));
+
+    std::vector<ov::element::Type> precisions;
+
+    for (size_t i = 0; i < node->inputs().size(); ++i) {
+        auto target_input = node->get_input_source_output(i);
+        auto ov_node_parent = target_input.get_node()->shared_from_this();
+
+        if (ov::is_type<ov::opset1::Convert>(ov_node_parent)) {
+            auto target_op_input = ov_node_parent->get_input_source_output(0);
+            auto parent_op_node = target_op_input.get_node()->shared_from_this();
+
+            if (ov::op::util::is_constant(parent_op_node)) {
+                precisions.push_back(parent_op_node->get_element_type());
+            }
+        }
+    }
+
+    return precisions;
+}
 }  // namespace detail
 }  // namespace online
 }  // namespace npuw
 }  // namespace ov
 
+using ov::npuw::online::detail::getConstsPrecision;
 using ov::npuw::online::detail::isOp;
 
 void Snapshot::buildGraph() {
@@ -68,6 +92,7 @@ void Snapshot::buildGraph() {
 
         auto nh = m_graph->create();
         auto group = std::make_shared<Group>(ov_node, gid, nh, m_graph, shared_from_this());
+        group->addWeightsPrecision(getConstsPrecision(ov_node));
         m_graph->meta(nh).set(group);
         m_node_to_gr->emplace(std::make_pair(ov_node, group));
         ++gid;
@@ -126,6 +151,44 @@ void Snapshot::buildGraph() {
     LOG_INFO("DONE.");
 }
 
+void Snapshot::splitMixedPrecision() {
+    LOG_INFO("Online partitioning: executing splitMixedPrecision pass...");
+    LOG_BLOCK();
+
+    auto reptag_to_gset = repeating();
+    // Iterate over repeated blocks
+    for (const auto& elem : reptag_to_gset) {
+        auto reptag = elem.first;
+        auto gset = elem.second;
+
+        // Fill a map of ordered consts precisions to a Group
+        std::unordered_map<std::vector<ov::element::Type>, GPtrSet> prec_to_new_gset;
+        for (const auto& gptr : gset) {
+            prec_to_new_gset[gptr->getConstsPrecision()].insert(gptr);
+        }
+
+        // In case all precisions match - skip
+        if (prec_to_new_gset.size() == 1) {
+            continue;
+        }
+
+        // Otherwise need to split repeated block based on consts precisions
+        for (const auto& elem : prec_to_new_gset) {
+            // Assign new reptags - basically create a new repeated block
+            std::shared_ptr<Repeated> rep = std::make_shared<Repeated>();
+
+            LOG_VERB("Identified mixed precision, splitting a new repeated block of " << elem.second.size()
+                                                                                      << " groups.");
+
+            for (const auto& gptr : elem.second) {
+                gptr->setRepeated(rep);
+            }
+        }
+    }
+
+    LOG_INFO("DONE");
+}
+
 void Snapshot::singleGroup() {
     LOG_INFO("Online partitioning: executing singleGroup pass...");
     LOG_BLOCK();
@@ -458,6 +521,7 @@ void Snapshot::repeatedBlocks(Snapshot::CB&& on_done) {
             return;  // FROM top-level repeat!
         }
     });
+    splitMixedPrecision();
     cleanUpUniques();
 
     LOG_INFO("Number of groups after compiler pass: " << graphSize());
diff --git a/src/plugins/intel_npu/src/plugin/npuw/partitioning/online/snapshot.hpp b/src/plugins/intel_npu/src/plugin/npuw/partitioning/online/snapshot.hpp
index 6da1a6d98939bb..0ce6766d45850f 100644
--- a/src/plugins/intel_npu/src/plugin/npuw/partitioning/online/snapshot.hpp
+++ b/src/plugins/intel_npu/src/plugin/npuw/partitioning/online/snapshot.hpp
@@ -18,8 +18,11 @@ namespace online {
 
 namespace detail {
 // At partitioning level we exclude some "non-Ops" to not interfere with the passes.
-// We include some of them back to properly link everything at plugin level
+// We include some of them back to properly link everything at plugin level.
 bool isOp(const std::shared_ptr<ov::Node>& node);
+// Find Const->Convert->Node if any and return Const precisions.
+// Used for mixed-precision models to properly identify repeated blocks.
+std::vector<ov::element::Type> getConstsPrecision(const std::shared_ptr<ov::Node>& node);
 }  // namespace detail
 
 // Core part of the partitioning algorithm which implements a list of graph passes.
@@ -69,6 +72,7 @@ class Snapshot : public std::enable_shared_from_this<Snapshot> {
     void identifyUniques();
     void mergeUniques();
     void mergeTriangles();
+    void splitMixedPrecision();
     void cleanUpUniques();
     void afterUniques();
     void markInternalCompute();
diff --git a/src/plugins/intel_npu/src/plugin/npuw/partitioning/partitioning.cpp b/src/plugins/intel_npu/src/plugin/npuw/partitioning/partitioning.cpp
index f12350e8952eaa..6c7f996acca22f 100644
--- a/src/plugins/intel_npu/src/plugin/npuw/partitioning/partitioning.cpp
+++ b/src/plugins/intel_npu/src/plugin/npuw/partitioning/partitioning.cpp
@@ -111,7 +111,7 @@ ov::npuw::Ensemble load_groups(const std::shared_ptr<ov::Model>& model, const st
 
     std::ifstream ifs(path_to_plan);
     if (!ifs) {
-        LOG_ERROR("Couldn't open " << ::intel_npu::NPUW_PLAN().key() << "pointing to " << path_to_plan << "!");
+        LOG_ERROR("Couldn't open " << ::intel_npu::NPUW_PLAN().key() << " pointing to " << path_to_plan << "!");
         return {};
     }
 
@@ -276,6 +276,7 @@ class Partitioner {
         if (!ov::is_type<ov::op::v0::Constant>(node_ptr)) {
             OPENVINO_THROW("NPUW: trying to get a unique name of a non-Constant node");
         }
+        // FIXME: cache this
         return node_ptr->get_friendly_name() + " with meta " + ov::npuw::online::util::getMetaDesc(node_ptr) +
                " with output " + (*node_ptr->output(0).get_target_inputs().begin()).get_node()->description();
     }
@@ -2160,7 +2161,7 @@ ov::npuw::Partitioning ov::npuw::getPartitioning(const std::shared_ptr<ov::Model
     // Try to load the partitioning plan...
     const std::string file_path = cfg.get<::intel_npu::NPUW_PLAN>();
     if (file_path.empty()) {
-        LOG_WARN("No " << ::intel_npu::NPUW_PLAN().key() << " property is provided! Using online partitioning.");
+        LOG_INFO("No " << ::intel_npu::NPUW_PLAN().key() << " property is provided! Using online partitioning.");
         ens = ov::npuw::online::buildPartitioning(model, cfg);
     } else {
         ens = load_groups(model, file_path);

From 5f4a445c8d588b37f978cd812661b439da2d63e8 Mon Sep 17 00:00:00 2001
From: Nikolay Shchegolev <nikolay.shchegolev@intel.com>
Date: Mon, 21 Oct 2024 19:41:24 +0400
Subject: [PATCH 19/24] [CPU] CACHE_DIR hash optimization (#25624)

### Details:
 - *JIT implementation of the hash function in the ConstantWriter*

### Tickets:
 - *127331*
---
 src/core/CMakeLists.txt                       |   5 +-
 .../dev_api/openvino/runtime/compute_hash.hpp |  20 +
 src/core/reference/CMakeLists.txt             |   3 -
 .../reference/utils}/jit_generator.hpp        |  70 +-
 .../reference/utils/registers_pool.hpp        | 247 +++++
 src/core/reference/src/op/convert.cpp         |   6 +-
 .../src/{op => utils}/jit_generator.cpp       |  19 +-
 .../reference/src/utils/registers_pool.cpp    | 106 ++
 src/core/src/pass/serialize.cpp               | 105 +-
 src/core/src/runtime/compute_hash.cpp         | 918 ++++++++++++++++++
 10 files changed, 1410 insertions(+), 89 deletions(-)
 create mode 100644 src/core/dev_api/openvino/runtime/compute_hash.hpp
 rename src/core/reference/{src/op => include/openvino/reference/utils}/jit_generator.hpp (59%)
 create mode 100644 src/core/reference/include/openvino/reference/utils/registers_pool.hpp
 rename src/core/reference/src/{op => utils}/jit_generator.cpp (91%)
 create mode 100644 src/core/reference/src/utils/registers_pool.cpp
 create mode 100644 src/core/src/runtime/compute_hash.cpp

diff --git a/src/core/CMakeLists.txt b/src/core/CMakeLists.txt
index bc42ffca8a3cf6..5ea4a21b705489 100644
--- a/src/core/CMakeLists.txt
+++ b/src/core/CMakeLists.txt
@@ -49,6 +49,9 @@ target_include_directories(openvino_core_dev INTERFACE
     $<BUILD_INTERFACE:${OpenVINO_SOURCE_DIR}/src/common/transformations/include>
     $<BUILD_INTERFACE:${OpenVINO_SOURCE_DIR}/src/common/low_precision_transformations/include>)
 
+target_include_directories(openvino_core_dev SYSTEM INTERFACE
+    $<BUILD_INTERFACE:$<$<TARGET_EXISTS:xbyak::xbyak>:$<TARGET_PROPERTY:xbyak::xbyak,INTERFACE_INCLUDE_DIRECTORIES>>>)
+
 target_link_libraries(openvino_core_dev INTERFACE openvino::itt openvino::util)
 
 set_target_properties(openvino_core_dev PROPERTIES EXPORT_NAME core::dev)
@@ -81,7 +84,7 @@ if(ENABLE_SYSTEM_PUGIXML)
     set_target_properties(openvino_core_obj PROPERTIES NO_SYSTEM_FROM_IMPORTED ON)
 endif()
 
-target_compile_definitions(openvino_core_obj PRIVATE IMPLEMENT_OPENVINO_API)
+target_compile_definitions(openvino_core_obj PRIVATE IMPLEMENT_OPENVINO_API XBYAK_NO_OP_NAMES XBYAK64)
 
 ov_build_target_faster(openvino_core_obj
     UNITY
diff --git a/src/core/dev_api/openvino/runtime/compute_hash.hpp b/src/core/dev_api/openvino/runtime/compute_hash.hpp
new file mode 100644
index 00000000000000..47a90d589be4ee
--- /dev/null
+++ b/src/core/dev_api/openvino/runtime/compute_hash.hpp
@@ -0,0 +1,20 @@
+// Copyright (C) 2018-2024 Intel Corporation
+// SPDX-License-Identifier: Apache-2.0
+//
+
+#pragma once
+
+#include <cstddef>
+
+namespace ov {
+namespace runtime {
+
+/**
+ * @brief Computes the hash value for the input data
+ * @param src  A pointer to the input data
+ * @param size The length of the input data in bytes
+ */
+size_t compute_hash(const void* src, size_t size);
+
+}  // namespace runtime
+}  // namespace ov
diff --git a/src/core/reference/CMakeLists.txt b/src/core/reference/CMakeLists.txt
index f7874964233cf5..b62cf02f23f4f1 100644
--- a/src/core/reference/CMakeLists.txt
+++ b/src/core/reference/CMakeLists.txt
@@ -50,9 +50,6 @@ target_include_directories(${TARGET_NAME} PUBLIC
     $<BUILD_INTERFACE:${OV_CORE_DEV_API_PATH}>
     $<BUILD_INTERFACE:${OV_CORE_INCLUDE_PATH}>)
 
-target_include_directories(${TARGET_NAME} SYSTEM PRIVATE
-    $<BUILD_INTERFACE:$<$<TARGET_EXISTS:xbyak::xbyak>:$<TARGET_PROPERTY:xbyak::xbyak,INTERFACE_INCLUDE_DIRECTORIES>>>)
-
 find_package(Threads REQUIRED)
 target_link_libraries(${TARGET_NAME} PRIVATE Threads::Threads openvino::core::dev)
 
diff --git a/src/core/reference/src/op/jit_generator.hpp b/src/core/reference/include/openvino/reference/utils/jit_generator.hpp
similarity index 59%
rename from src/core/reference/src/op/jit_generator.hpp
rename to src/core/reference/include/openvino/reference/utils/jit_generator.hpp
index b4b9cd7a60c23f..539f686020049c 100644
--- a/src/core/reference/src/op/jit_generator.hpp
+++ b/src/core/reference/include/openvino/reference/utils/jit_generator.hpp
@@ -15,7 +15,6 @@
 namespace ov {
 namespace reference {
 namespace jit {
-#ifdef XBYAK64
 static const Xbyak::Operand::Code abi_save_gpr_regs[] = {
     Xbyak::Operand::RBX,
     Xbyak::Operand::RBP,
@@ -23,28 +22,42 @@ static const Xbyak::Operand::Code abi_save_gpr_regs[] = {
     Xbyak::Operand::R13,
     Xbyak::Operand::R14,
     Xbyak::Operand::R15,
-#    ifdef _WIN32
+#ifdef _WIN32
     Xbyak::Operand::RDI,
     Xbyak::Operand::RSI,
-#    endif
+#endif
 };
 
-#    ifdef _WIN32
-#        define abi_param1 Xbyak::Reg64(Xbyak::Operand::RCX)  // RCX
-#    else
-#        define abi_param1 Xbyak::Reg64(Xbyak::Operand::RDI)  // RDI
-#    endif
-#endif  // XBYAK64
+#ifdef _WIN32
+#    define abi_param1 Xbyak::Reg64(Xbyak::Operand::RCX)  // RCX
+#else
+#    define abi_param1 Xbyak::Reg64(Xbyak::Operand::RDI)  // RDI
+#endif
 
-class Generator : public Xbyak::CodeGenerator {
-    static constexpr size_t xmm_len = 16;
+typedef enum {
+    isa_any,
+    sse42,
+    avx,
+    avx2,
+    avx512_common,
+    avx512_core,
+    avx512_core_vnni,
+    avx512_mic,
+    avx512_mic_4ops,
+    avx512_core_bf16,
+    avx512_vpopcnt,
+    fp16,
+    pclmulqdq,
+    vpclmulqdq
+} cpu_isa_t;
 
+class Generator : public Xbyak::CodeGenerator {
 #ifdef _WIN32
-    static constexpr size_t xmm_to_preserve_start = 6;
-    static constexpr size_t xmm_to_preserve = 10;
+    static constexpr size_t xmm_to_preserve_start = 6llu;
+    static constexpr size_t xmm_to_preserve = 10llu;
 #else
-    static constexpr size_t xmm_to_preserve_start = 0;
-    static constexpr size_t xmm_to_preserve = 0;
+    static constexpr size_t xmm_to_preserve_start = 0lu;
+    static constexpr size_t xmm_to_preserve = 0lu;
 #endif
 
     static const size_t num_abi_save_gpr_regs = sizeof(abi_save_gpr_regs) / sizeof(abi_save_gpr_regs[0]);
@@ -52,29 +65,19 @@ class Generator : public Xbyak::CodeGenerator {
 
     const Xbyak::Reg64 reg_EVEX_max_8b_offt;
     static constexpr int EVEX_max_8b_offt = 0x200;
+    size_t m_vlen = ymm_len;
 
 public:
-    const Xbyak::Reg64 param = abi_param1;
+    static constexpr size_t xmm_len = 16lu;
+    static constexpr size_t ymm_len = 32lu;
+    static constexpr size_t zmm_len = 64lu;
 
-    typedef enum {
-        isa_any,
-        sse42,
-        avx,
-        avx2,
-        avx512_common,
-        avx512_core,
-        avx512_core_vnni,
-        avx512_mic,
-        avx512_mic_4ops,
-        avx512_core_bf16,
-        avx512_vpopcnt,
-        fp16
-    } cpu_isa_t;
+    const Xbyak::Reg64 param = abi_param1;
 
     static bool mayiuse(const cpu_isa_t cpu_isa);
     static bool is_x64();
 
-    Generator(void* code_ptr = nullptr, size_t code_size = 16 * 1024);
+    Generator(cpu_isa_t isa = avx2, void* code_ptr = nullptr, size_t code_size = 16lu * 1024lu);
     void preamble();
     void postamble();
 
@@ -85,7 +88,12 @@ class Generator : public Xbyak::CodeGenerator {
 
     template <typename T>
     void copy(const Xbyak::Reg64& dst, const Xbyak::Reg64& src, const Xbyak::Reg64& size);
+
+    size_t get_vlen() {
+        return m_vlen;
+    }
 };
+
 }  // namespace jit
 }  // namespace reference
 }  // namespace ov
diff --git a/src/core/reference/include/openvino/reference/utils/registers_pool.hpp b/src/core/reference/include/openvino/reference/utils/registers_pool.hpp
new file mode 100644
index 00000000000000..62dfe01ec4ef1d
--- /dev/null
+++ b/src/core/reference/include/openvino/reference/utils/registers_pool.hpp
@@ -0,0 +1,247 @@
+// Copyright (C) 2018-2024 Intel Corporation
+// SPDX-License-Identifier: Apache-2.0
+//
+
+#pragma once
+
+#include <memory>
+#include <utility>
+#include <vector>
+
+#include "openvino/core/except.hpp"
+#include "openvino/reference/utils/jit_generator.hpp"
+namespace ov {
+namespace reference {
+namespace jit {
+
+class RegistersPool {
+public:
+    using Ptr = std::shared_ptr<RegistersPool>;
+    using WeakPtr = std::weak_ptr<RegistersPool>;
+    static constexpr int any_idx = -1;
+
+    template <typename TReg>
+    class Reg {
+        friend class RegistersPool;
+
+    public:
+        Reg() {}
+        Reg(const RegistersPool::Ptr& regPool) {
+            initialize(regPool);
+        }
+        Reg(const RegistersPool::Ptr& regPool, int requested_idx) {
+            initialize(regPool, requested_idx);
+        }
+        ~Reg() {
+            release();
+        }
+        Reg& operator=(Reg&& other) noexcept {
+            release();
+            reg = other.reg;
+            regPool = std::move(other.regPool);
+            return *this;
+        }
+        Reg(Reg&& other) noexcept : reg(other.reg), regPool(std::move(other.regPool)) {}
+        operator TReg&() {
+            ensure_valid();
+            return reg;
+        }
+        operator const TReg&() const {
+            ensure_valid();
+            return reg;
+        }
+        operator Xbyak::RegExp() const {
+            ensure_valid();
+            return reg;
+        }
+        int getIdx() const {
+            ensure_valid();
+            return reg.getIdx();
+        }
+        friend Xbyak::RegExp operator+(const Reg& lhs, const Xbyak::RegExp& rhs) {
+            lhs.ensure_valid();
+            return lhs.operator Xbyak::RegExp() + rhs;
+        }
+        void release() {
+            if (auto pool = regPool.lock()) {
+                pool->return_to_pool(reg);
+                regPool.reset();
+            }
+        }
+        bool is_initialized() const {
+            return !regPool.expired();
+        }
+
+    private:
+        void ensure_valid() const {
+            if (!is_initialized()) {
+                OPENVINO_THROW("RegistersPool::Reg is either not initialized or released");
+            }
+        }
+
+        void initialize(const RegistersPool::Ptr& pool, int requested_idx = any_idx) {
+            release();
+            reg = TReg(pool->template get_free<TReg>(requested_idx));
+            regPool = pool;
+        }
+
+    private:
+        TReg reg;
+        RegistersPool::WeakPtr regPool;
+    };
+
+    virtual ~RegistersPool() {
+        check_unique_and_update(false);
+    }
+
+    template <ov::reference::jit::cpu_isa_t isa>
+    static Ptr create(std::initializer_list<Xbyak::Reg> regsToExclude);
+
+    static Ptr create(cpu_isa_t isa, std::initializer_list<Xbyak::Reg> regsToExclude);
+
+    template <typename TReg>
+    size_t count_free() const {
+        if (std::is_base_of<Xbyak::Mmx, TReg>::value) {
+            return m_simd_set.count_unused();
+        } else if (std::is_same<TReg, Xbyak::Reg8>::value || std::is_same<TReg, Xbyak::Reg16>::value ||
+                   std::is_same<TReg, Xbyak::Reg32>::value || std::is_same<TReg, Xbyak::Reg64>::value) {
+            return m_general_set.count_unused();
+        } else if (std::is_same<TReg, Xbyak::Opmask>::value) {
+            return count_unused_opmask();
+        }
+    }
+
+protected:
+    class PhysicalSet {
+    public:
+        PhysicalSet(int size) : m_is_free_index_vector(size, true) {}
+
+        void set_as_used(size_t reg_idx);
+
+        void set_as_unused(size_t reg_idx);
+
+        size_t get_unused(size_t requested_idx);
+
+        void exclude(Xbyak::Reg reg) {
+            m_is_free_index_vector.at(reg.getIdx()) = false;
+        }
+
+        size_t count_unused() const;
+
+    private:
+        size_t get_first_free_index();
+
+    private:
+        std::vector<bool> m_is_free_index_vector;
+    };
+
+    virtual int get_free_opmask(int requested_idx) {
+        OPENVINO_THROW("get_free_opmask: The Opmask is not supported in current instruction set");
+    }
+    virtual void return_opmask_to_pool(int idx) {
+        OPENVINO_THROW("return_opmask_to_pool: The Opmask is not supported in current instruction set");
+    }
+    virtual size_t count_unused_opmask() const {
+        OPENVINO_THROW("count_unused_opmask: The Opmask is not supported in current instruction set");
+    }
+
+    RegistersPool(int simd_registers_number);
+
+    RegistersPool(std::initializer_list<Xbyak::Reg> regsToExclude, int simd_registers_number);
+
+private:
+    template <typename TReg>
+    int get_free(int requested_idx) {
+        if (std::is_base_of<Xbyak::Mmx, TReg>::value) {
+            auto idx = m_simd_set.get_unused(requested_idx);
+            m_simd_set.set_as_used(idx);
+            return static_cast<int>(idx);
+        } else if (std::is_same<TReg, Xbyak::Reg8>::value || std::is_same<TReg, Xbyak::Reg16>::value ||
+                   std::is_same<TReg, Xbyak::Reg32>::value || std::is_same<TReg, Xbyak::Reg64>::value) {
+            auto idx = m_general_set.get_unused(requested_idx);
+            m_general_set.set_as_used(idx);
+            return static_cast<int>(idx);
+        } else if (std::is_same<TReg, Xbyak::Opmask>::value) {
+            return get_free_opmask(requested_idx);
+        }
+    }
+
+    template <typename TReg>
+    void return_to_pool(const TReg& reg) {
+        if (std::is_base_of<Xbyak::Mmx, TReg>::value) {
+            m_simd_set.set_as_unused(reg.getIdx());
+        } else if (std::is_same<TReg, Xbyak::Reg8>::value || std::is_same<TReg, Xbyak::Reg16>::value ||
+                   std::is_same<TReg, Xbyak::Reg32>::value || std::is_same<TReg, Xbyak::Reg64>::value) {
+            m_general_set.set_as_unused(reg.getIdx());
+        } else if (std::is_same<TReg, Xbyak::Opmask>::value) {
+            return_opmask_to_pool(reg.getIdx());
+        }
+    }
+
+    void check_unique_and_update(bool isCtor = true);
+
+    PhysicalSet m_general_set;
+    PhysicalSet m_simd_set;
+};
+
+template <cpu_isa_t isa>
+class IsaRegistersPool : public RegistersPool {
+public:
+    IsaRegistersPool(std::initializer_list<Xbyak::Reg> regsToExclude) : RegistersPool(regsToExclude, 32) {}
+};
+
+template <>
+class IsaRegistersPool<avx512_core> : public RegistersPool {
+public:
+    IsaRegistersPool() : RegistersPool(32) {
+        m_opmask_set.exclude(
+            Xbyak::Opmask(0));  // the Opmask(0) has special meaning for some instructions, like gather instruction
+    }
+
+    IsaRegistersPool(std::initializer_list<Xbyak::Reg> regsToExclude) : RegistersPool(regsToExclude, 32) {
+        for (auto& reg : regsToExclude) {
+            if (reg.isOPMASK()) {
+                m_opmask_set.exclude(reg);
+            }
+        }
+    }
+
+    int get_free_opmask(int requested_idx) override {
+        auto idx = static_cast<int>(m_opmask_set.get_unused(requested_idx));
+        m_opmask_set.set_as_used(idx);
+        return idx;
+    }
+
+    void return_opmask_to_pool(int idx) override {
+        m_opmask_set.set_as_unused(idx);
+    }
+
+    size_t count_unused_opmask() const override {
+        return m_opmask_set.count_unused();
+    }
+
+protected:
+    PhysicalSet m_opmask_set{8};
+};
+
+template <cpu_isa_t isa>
+RegistersPool::Ptr RegistersPool::create(std::initializer_list<Xbyak::Reg> regsToExclude) {
+    return std::make_shared<IsaRegistersPool<isa>>(regsToExclude);
+}
+
+inline RegistersPool::Ptr RegistersPool::create(cpu_isa_t isa, std::initializer_list<Xbyak::Reg> regsToExclude) {
+#define ISA_SWITCH_CASE(isa) \
+    case isa:                \
+        return std::make_shared<IsaRegistersPool<isa>>(regsToExclude);
+    switch (isa) {
+        ISA_SWITCH_CASE(avx2)
+        ISA_SWITCH_CASE(avx512_core)
+    default:
+        OPENVINO_THROW("Invalid isa argument in RegistersPool::create(): ", isa);
+    }
+#undef ISA_SWITCH_CASE
+}
+
+}  // namespace jit
+}  // namespace reference
+}  // namespace ov
diff --git a/src/core/reference/src/op/convert.cpp b/src/core/reference/src/op/convert.cpp
index 5054121b5615c0..034734afd8fd2a 100644
--- a/src/core/reference/src/op/convert.cpp
+++ b/src/core/reference/src/op/convert.cpp
@@ -7,7 +7,7 @@
 #include "openvino/reference/utils/convert_util.hpp"
 
 #ifdef OV_CORE_USE_XBYAK_JIT
-#    include "jit_generator.hpp"
+#    include "openvino/reference/utils/jit_generator.hpp"
 #endif
 
 #ifdef OV_CORE_USE_INTRINSICS
@@ -256,7 +256,7 @@ class jit_convert_array : public jit::Generator {
 
     template <typename src_t, typename dst_t, bool clamp = false>
     static fn_t get() {
-        if (is_x64() && mayiuse(avx) && mayiuse(avx2) && mayiuse(fp16)) {
+        if (is_x64() && mayiuse(jit::avx) && mayiuse(jit::avx2) && mayiuse(jit::fp16)) {
             static const jit_convert_array::context_t context{{sizeof(src_t), &jit::Generator::copy<src_t>},
                                                               {sizeof(dst_t), &jit::Generator::copy<dst_t>},
                                                               jit_convert_vec<src_t, dst_t, clamp>,
@@ -460,7 +460,7 @@ class jit_count_out_of_range : public jit::Generator {
 
     template <typename data_t, typename range_t>
     static fn_t get() {
-        if (is_x64() && mayiuse(avx2)) {
+        if (is_x64() && mayiuse(jit::avx2)) {
             static const jit_count_out_of_range::context_t context{
                 {sizeof(data_t), &jit::Generator::copy<data_t>},
                 jit_count_out_of_range_vec_prepare<data_t, range_t>,
diff --git a/src/core/reference/src/op/jit_generator.cpp b/src/core/reference/src/utils/jit_generator.cpp
similarity index 91%
rename from src/core/reference/src/op/jit_generator.cpp
rename to src/core/reference/src/utils/jit_generator.cpp
index 7d7da06d5da8d5..39dc31c0033f9f 100644
--- a/src/core/reference/src/op/jit_generator.cpp
+++ b/src/core/reference/src/utils/jit_generator.cpp
@@ -11,9 +11,10 @@
 #    endif
 #    include <xbyak/xbyak_util.h>
 
-#    include "jit_generator.hpp"
+#    include "openvino/core/except.hpp"
 #    include "openvino/core/type/bfloat16.hpp"
 #    include "openvino/core/type/float16.hpp"
+#    include "openvino/reference/utils/jit_generator.hpp"
 
 namespace ov {
 namespace reference {
@@ -51,6 +52,10 @@ bool Generator::mayiuse(const cpu_isa_t cpu_isa) {
         return true && cpu.has(Cpu::tAVX512_VPOPCNTDQ);
     case fp16:
         return cpu.has(Cpu::tF16C);
+    case cpu_isa_t::pclmulqdq:
+        return cpu.has(Cpu::tPCLMULQDQ);
+    case cpu_isa_t::vpclmulqdq:
+        return cpu.has(Cpu::tVPCLMULQDQ);
     case isa_any:
         return true;
     }
@@ -60,10 +65,18 @@ bool Generator::mayiuse(const cpu_isa_t cpu_isa) {
 bool Generator::is_x64() {
     return sizeof(void*) == 8;
 }
-Generator::Generator(void* code_ptr, size_t code_size)
+Generator::Generator(cpu_isa_t isa, void* code_ptr, size_t code_size)
     : Xbyak::CodeGenerator(code_size, code_ptr),
       size_of_abi_save_regs(num_abi_save_gpr_regs * rax.getBit() / 8 + xmm_to_preserve * xmm_len),
-      reg_EVEX_max_8b_offt(rbp) {}
+      reg_EVEX_max_8b_offt(rbp) {
+    if (isa == avx512_core) {
+        m_vlen = zmm_len;
+    } else if (isa == avx2) {
+        m_vlen = ymm_len;
+    } else {
+        OPENVINO_THROW("Unsupported isa: ", isa);
+    }
+}
 
 void Generator::preamble() {
     if (xmm_to_preserve) {
diff --git a/src/core/reference/src/utils/registers_pool.cpp b/src/core/reference/src/utils/registers_pool.cpp
new file mode 100644
index 00000000000000..413fdcc3ed83cf
--- /dev/null
+++ b/src/core/reference/src/utils/registers_pool.cpp
@@ -0,0 +1,106 @@
+// Copyright (C) 2018-2024 Intel Corporation
+// SPDX-License-Identifier: Apache-2.0
+//
+
+#include "openvino/core/visibility.hpp"
+
+#if defined(OPENVINO_ARCH_X86) || defined(OPENVINO_ARCH_X86_64)
+#    include "openvino/reference/utils/registers_pool.hpp"
+
+namespace ov {
+namespace reference {
+namespace jit {
+
+RegistersPool::RegistersPool(int simd_registers_number) : m_general_set(16), m_simd_set(simd_registers_number) {
+    check_unique_and_update();
+    m_general_set.exclude(Xbyak::Reg64(Xbyak::Operand::RSP));
+    m_general_set.exclude(Xbyak::Reg64(Xbyak::Operand::RAX));
+    m_general_set.exclude(Xbyak::Reg64(Xbyak::Operand::RCX));
+    m_general_set.exclude(Xbyak::Reg64(Xbyak::Operand::RDI));
+    m_general_set.exclude(Xbyak::Reg64(Xbyak::Operand::RBP));
+}
+
+RegistersPool::RegistersPool(std::initializer_list<Xbyak::Reg> regsToExclude, int simd_registers_number)
+    : m_general_set(16),
+      m_simd_set(simd_registers_number) {
+    check_unique_and_update();
+    for (auto& reg : regsToExclude) {
+        if (reg.isXMM() || reg.isYMM() || reg.isZMM()) {
+            m_simd_set.exclude(reg);
+        } else if (reg.isREG()) {
+            m_general_set.exclude(reg);
+        }
+    }
+    m_general_set.exclude(Xbyak::Reg64(Xbyak::Operand::RSP));
+}
+
+void RegistersPool::check_unique_and_update(bool is_ctor) {
+    static thread_local bool is_created = false;
+    if (is_ctor) {
+        if (is_created) {
+            OPENVINO_THROW("There should be only one instance of RegistersPool per thread");
+        }
+        is_created = true;
+    } else {
+        is_created = false;
+    }
+}
+
+void RegistersPool::PhysicalSet::set_as_used(size_t reg_idx) {
+    if (reg_idx >= m_is_free_index_vector.size()) {
+        OPENVINO_THROW("reg_idx is out of bounds in RegistersPool::PhysicalSet::set_as_used()");
+    }
+    if (!m_is_free_index_vector[reg_idx]) {
+        OPENVINO_THROW("Inconsistency in RegistersPool::PhysicalSet::set_as_used()");
+    }
+    m_is_free_index_vector[reg_idx] = false;
+}
+
+void RegistersPool::PhysicalSet::set_as_unused(size_t reg_idx) {
+    if (reg_idx >= m_is_free_index_vector.size()) {
+        OPENVINO_THROW("reg_idx is out of bounds in RegistersPool::PhysicalSet::set_as_used()");
+    }
+    if (m_is_free_index_vector[reg_idx]) {
+        OPENVINO_THROW("Inconsistency in RegistersPool::PhysicalSet::set_as_unused()");
+    }
+    m_is_free_index_vector[reg_idx] = true;
+}
+
+size_t RegistersPool::PhysicalSet::get_unused(size_t requested_idx) {
+    if (requested_idx == static_cast<size_t>(any_idx)) {
+        return get_first_free_index();
+    } else {
+        if (requested_idx >= m_is_free_index_vector.size()) {
+            OPENVINO_THROW("requested_idx is out of bounds in RegistersPool::PhysicalSet::get_unused()");
+        }
+        if (!m_is_free_index_vector[requested_idx]) {
+            OPENVINO_THROW("The register with index #", requested_idx, " already used in the RegistersPool");
+        }
+        return requested_idx;
+    }
+}
+
+size_t RegistersPool::PhysicalSet::count_unused() const {
+    size_t count = 0;
+    for (const auto& isFree : m_is_free_index_vector) {
+        if (isFree) {
+            ++count;
+        }
+    }
+    return count;
+}
+
+size_t RegistersPool::PhysicalSet::get_first_free_index() {
+    for (size_t c = 0; c < m_is_free_index_vector.size(); ++c) {
+        if (m_is_free_index_vector[c]) {
+            return c;
+        }
+    }
+    OPENVINO_THROW("Not enough registers in the RegistersPool");
+}
+
+}  // namespace jit
+}  // namespace reference
+}  // namespace ov
+
+#endif  // OPENVINO_ARCH_X86 || OPENVINO_ARCH_X86_64
diff --git a/src/core/src/pass/serialize.cpp b/src/core/src/pass/serialize.cpp
index 409dcad066d7a6..3af6d2c4b5313f 100644
--- a/src/core/src/pass/serialize.cpp
+++ b/src/core/src/pass/serialize.cpp
@@ -23,6 +23,7 @@
 #include "openvino/pass/constant_folding.hpp"
 #include "openvino/reference/convert.hpp"
 #include "openvino/runtime/aligned_buffer.hpp"
+#include "openvino/runtime/compute_hash.hpp"
 #include "openvino/runtime/string_aligned_buffer.hpp"
 #include "openvino/util/file_util.hpp"
 #include "pugixml.hpp"
@@ -30,6 +31,18 @@
 #include "transformations/rt_info/disable_fp16_compression.hpp"
 #include "transformations/rt_info/primitives_priority_attribute.hpp"
 
+namespace ov {
+class OstreamHashWrapperBin final : public std::streambuf {
+    uint64_t m_res = 0lu;
+
+public:
+    uint64_t getResult() const {
+        return m_res;
+    }
+    std::streamsize xsputn(const char* s, std::streamsize n) override;
+};
+}  // namespace ov
+
 namespace {  // helpers
 template <typename Container>
 std::string join(const Container& c, const char* glue = ", ") {
@@ -69,23 +82,6 @@ std::string translate_type_name(const std::string& name) {
     return name;
 }
 
-size_t hash_combine(const void* v, int64_t size) {
-    constexpr auto cel_size = sizeof(size_t);
-    auto seed = static_cast<size_t>(size);
-    const auto data = static_cast<const size_t*>(v);
-    const auto d_end = std::next(data, size / cel_size);
-    // The constant value used as a magic number has been
-    // traditionally used e.g. in boost library's hash_combine.
-    // It happens to be derived from the golden ratio.
-    for (auto d = data; d != d_end; ++d) {
-        seed ^= *d + 0x9e3779b9 + (seed << 6) + (seed >> 2);
-    }
-    size_t last_bytes{0};
-    std::memcpy(&last_bytes, d_end, size % cel_size);
-    seed ^= last_bytes + 0x9e3779b9 + (seed << 6) + (seed >> 2);
-    return seed;
-}
-
 class ConstantWriter {
 public:
     using FilePosition = int64_t;
@@ -95,16 +91,18 @@ class ConstantWriter {
     ConstantWriter(std::ostream& bin_data, bool enable_compression = true)
         : m_binary_output(bin_data),
           m_enable_compression(enable_compression),
-          m_blob_offset(bin_data.tellp()) {}
+          m_blob_offset(bin_data.tellp()) {
+        m_write_hash_value = (dynamic_cast<ov::OstreamHashWrapperBin*>(bin_data.rdbuf())) ? true : false;
+    }
 
     FilePosition write(const char* ptr,
                        size_t size,
-                       size_t* new_size,
+                       size_t& new_size,
                        bool compress_to_fp16 = false,
                        ov::element::Type src_type = ov::element::dynamic) {
         const FilePosition write_pos = m_binary_output.tellp();
         const auto offset = write_pos - m_blob_offset;
-        *new_size = size;
+        new_size = size;
 
         if (!m_enable_compression) {
             if (!compress_to_fp16) {
@@ -112,7 +110,7 @@ class ConstantWriter {
             } else {
                 OPENVINO_ASSERT(size % src_type.size() == 0);
                 auto fp16_buffer = compress_data_to_fp16(ptr, size, src_type, new_size);
-                m_binary_output.write(fp16_buffer.get(), *new_size);
+                m_binary_output.write(fp16_buffer.get(), new_size);
             }
             return offset;
         } else {
@@ -132,18 +130,24 @@ class ConstantWriter {
             // the same hash for {2, 2} and {0, 128} arrays.
             // But even strong hashing algorithms sometimes give collisions.
             // Therefore we always have to compare values when finding a match in the hash multimap.
-            const HashValue hash = hash_combine(ptr_to_write, *new_size);
+            const HashValue hash = ov::runtime::compute_hash(ptr_to_write, new_size);
+
             auto found = m_hash_to_file_positions.find(hash);
             // iterate over all matches of the key in the multimap
             while (found != m_hash_to_file_positions.end()) {
-                if (memcmp(ptr, found->second.second, size) == 0)
+                if (memcmp(ptr, found->second.second, size) == 0) {
                     return found->second.first;
+                }
                 found++;
             }
             // Since fp16_compressed data will be disposed at exit point and since we cannot reread it from the ostream,
             // we store pointer to the original uncompressed blob.
             m_hash_to_file_positions.insert({hash, {offset, static_cast<void const*>(ptr)}});
-            m_binary_output.write(ptr_to_write, *new_size);
+            if (m_write_hash_value) {
+                m_binary_output.write(reinterpret_cast<const char*>(&hash), sizeof(uint64_t));
+            } else {
+                m_binary_output.write(ptr_to_write, new_size);
+            }
         }
         return offset;
     }
@@ -152,17 +156,17 @@ class ConstantWriter {
     static std::unique_ptr<char[]> compress_data_to_fp16(const char* ptr,
                                                          size_t size,
                                                          ov::element::Type src_type,
-                                                         size_t* compressed_size) {
+                                                         size_t& compressed_size) {
         auto num_src_elements = size / src_type.size();
-        *compressed_size = num_src_elements * ov::element::f16.size();
+        compressed_size = num_src_elements * ov::element::f16.size();
         if (src_type == ov::element::f32) {
-            auto new_ptr = std::unique_ptr<char[]>(new char[*compressed_size]);
+            auto new_ptr = std::unique_ptr<char[]>(new char[compressed_size]);
             auto dst_data = reinterpret_cast<ov::float16*>(new_ptr.get());
             auto src_data = reinterpret_cast<const float*>(ptr);
             ov::reference::convert_from_f32_to_f16_with_clamp(src_data, dst_data, num_src_elements);
             return new_ptr;
         } else if (src_type == ov::element::f64) {
-            auto new_ptr = std::unique_ptr<char[]>(new char[*compressed_size]);
+            auto new_ptr = std::unique_ptr<char[]>(new char[compressed_size]);
             auto dst_data = reinterpret_cast<ov::float16*>(new_ptr.get());
             auto src_data = reinterpret_cast<const double*>(ptr);
 
@@ -188,6 +192,7 @@ class ConstantWriter {
     ConstWritePositions m_hash_to_file_positions;
     std::ostream& m_binary_output;
     bool m_enable_compression;
+    bool m_write_hash_value = false;
     FilePosition m_blob_offset;  // blob offset inside output stream
 };
 
@@ -531,7 +536,7 @@ class XmlSerializer : public ov::AttributeVisitor {
 
                 int64_t offset = m_constant_write_handler.write(reinterpret_cast<const char*>(header_ptr.get()),
                                                                 header_size,
-                                                                &inter_size,
+                                                                inter_size,
                                                                 m_compress_to_fp16,
                                                                 m_output_element_type);
                 new_size += inter_size;
@@ -554,7 +559,7 @@ class XmlSerializer : public ov::AttributeVisitor {
 
                     m_constant_write_handler.write(raw_string_ptr,
                                                    raw_string_size,
-                                                   &inter_size,
+                                                   inter_size,
                                                    m_compress_to_fp16,
                                                    m_output_element_type);
                     new_size += inter_size;
@@ -568,7 +573,7 @@ class XmlSerializer : public ov::AttributeVisitor {
                 size_t new_size;
                 int64_t offset = m_constant_write_handler.write(static_cast<const char*>(a->get()->get_ptr()),
                                                                 size,
-                                                                &new_size,
+                                                                new_size,
                                                                 m_compress_to_fp16,
                                                                 m_output_element_type);
 
@@ -1393,10 +1398,19 @@ bool pass::StreamSerialize::run_on_model(const std::shared_ptr<ov::Model>& model
 /// -------- Hash calculation pass -------------
 
 namespace {
-template <typename T>
-static uint64_t hash_combine(uint64_t seed, const T& a) {
-    // Hash combine formula from boost
-    return seed ^ (std::hash<T>()(a) + 0x9e3779b9 + (seed << 6) + (seed >> 2));
+// Hash combine formula from boost for uint64_t.
+inline uint64_t hash_combine(uint64_t h, uint64_t k) {
+    constexpr uint64_t m = 0xc6a4a7935bd1e995;
+    constexpr int r = 47;
+
+    k *= m;
+    k ^= k >> r;
+    k *= m;
+
+    h ^= k;
+    h *= m;
+
+    return h + 0xe6546b64;
 }
 
 class OstreamHashWrapper final : public std::streambuf {
@@ -1408,28 +1422,23 @@ class OstreamHashWrapper final : public std::streambuf {
     }
 
     std::streamsize xsputn(const char* s, std::streamsize n) override {
-        // Reinterpret data as uint32_t and accumulate in uint64_t to avoid overflow fluctuations in parallel_sum.
-        auto* int_sum = reinterpret_cast<const uint32_t*>(s);
-        const uint64_t n32 = n / sizeof(uint32_t);
-
-        m_res += parallel_sum(n32, uint64_t(0lu), [&](size_t k) -> uint32_t {
-            return int_sum[k];
-        });
-
-        const uint64_t rest = n % sizeof(uint32_t);
-        for (uint64_t i = 0lu; i < rest; i++) {
-            m_res += s[n - rest + i];
-        }
+        uint64_t h = ov::runtime::compute_hash(s, n);
+        m_res = hash_combine(m_res, h);
 
         return n;
     }
 };
 }  // namespace
 
+std::streamsize OstreamHashWrapperBin::xsputn(const char* s, std::streamsize n) {
+    m_res = hash_combine(m_res, *reinterpret_cast<const uint64_t*>(s));
+    return n;
+}
+
 bool pass::Hash::run_on_model(const std::shared_ptr<ov::Model>& model) {
     RUN_ON_MODEL_SCOPE(Hash);
     OstreamHashWrapper xmlHash;
-    OstreamHashWrapper binHash;
+    OstreamHashWrapperBin binHash;
     std::ostream xml(&xmlHash);
     std::ostream bin(&binHash);
 
diff --git a/src/core/src/runtime/compute_hash.cpp b/src/core/src/runtime/compute_hash.cpp
new file mode 100644
index 00000000000000..c1a5a40c8638de
--- /dev/null
+++ b/src/core/src/runtime/compute_hash.cpp
@@ -0,0 +1,918 @@
+// Copyright (C) 2018-2024 Intel Corporation
+// SPDX-License-Identifier: Apache-2.0
+//
+
+// The CRC computation is used for x86.
+// The calculations were taken from the article
+// "Fast CRC Computation for Generic Polynomials Using PCLMULQDQ Instruction - Intel (December, 2009)".
+
+#include "openvino/runtime/compute_hash.hpp"
+
+#include <cmath>
+#include <cstring>
+#include <iterator>
+
+#include "openvino/core/visibility.hpp"
+
+#if !defined(OS_CHROMEOS) && (defined(OPENVINO_ARCH_X86) || defined(OPENVINO_ARCH_X86_64))
+#    define OV_CORE_USE_XBYAK_JIT
+#endif
+
+#ifdef OV_CORE_USE_XBYAK_JIT
+#    include "openvino/core/parallel.hpp"
+#    include "openvino/reference/utils/registers_pool.hpp"
+#endif  // OV_CORE_USE_XBYAK_JIT
+
+namespace ov {
+namespace runtime {
+
+#ifdef OV_CORE_USE_XBYAK_JIT
+
+using namespace ov::reference::jit;
+
+namespace jit {
+
+#    define GET_OFF(field) offsetof(ComputeHashCallArgs, field)
+#    define getReg64()     RegistersPool::Reg<Xbyak::Reg64>(m_registers_pool)
+#    define getVmm()       RegistersPool::Reg<Vmm>(m_registers_pool)
+#    define getXmm()       RegistersPool::Reg<Xbyak::Xmm>(m_registers_pool)
+
+enum KernelType { SINGLE_THREAD = 0, FIRST_THREAD, N_THREAD, FINAL_FOLD };
+
+struct ComputeHashCompileParams {
+    KernelType type;
+};
+
+struct ComputeHashCallArgs {
+    const void* src_ptr = nullptr;
+    void* dst_ptr = nullptr;
+    const void* k_ptr = nullptr;
+    void* intermediate_ptr = nullptr;
+    uint64_t work_amount = 0lu;
+    uint64_t size = 0lu;
+    uint64_t threads_num = 1lu;
+};
+
+typedef void (*hash_kernel)(const ComputeHashCallArgs*);
+
+static const uint8_t SHUF_MASK[16] = {0b00001111,
+                                      0b00001110,
+                                      0b00001101,
+                                      0b00001100,
+                                      0b00001011,
+                                      0b00001010,
+                                      0b00001001,
+                                      0b00001000,
+                                      0b00000111,
+                                      0b00000110,
+                                      0b00000101,
+                                      0b00000100,
+                                      0b00000011,
+                                      0b00000010,
+                                      0b00000001,
+                                      0b00000000};
+
+constexpr uint64_t CRC_VAL = 0xffffffffffffffff;
+
+// POLYNOM(x) = 0x42F0E1EBA9EA3693
+constexpr uint64_t K_2 = 0x05f5c3c7eb52fab6;  // x^(64*2)
+constexpr uint64_t P_1 = 0x578d29d06cc4f872;  // floor(x^128/P(x))-x^64
+constexpr uint64_t P_2 = 0x42f0e1eba9ea3693;  // P(x)-x^64
+static const uint64_t K_PULL[] = {
+    K_2,                 // x^(64*2)
+    0x4eb938a7d257740e,  // x^(64*3)
+    0x571bee0a227ef92b,  // x^(64*4)
+    0x44bef2a201b5200c,  // x^(64*5)
+    0x54819d8713758b2c,  // x^(64*6)
+    0x4a6b90073eb0af5a,  // x^(64*7)
+    0x5f6843ca540df020,  // x^(64*8)
+    0xddf4b6981205b83f,  // x^(64*9)
+    0x097c516e98bd2e73,  // x^(64*10)
+    0x0b76477b31e22e7b,  // x^(64*11)
+    0x9af04e1eff82d0dd,  // x^(64*12)
+    0x6e82e609297f8fe8,  // x^(64*13)
+    0xe464f4df5fb60ac1,  // x^(64*14)
+    0xb649c5b35a759cf2,  // x^(64*15)
+    0x05cf79dea9ac37d6,  // x^(64*16)
+    0x001067e571d7d5c2   // x^(64*17)
+};
+
+constexpr uint64_t K_2_3_OFF = 0lu * 2lu * sizeof(uint64_t);
+constexpr uint64_t K_4_5_OFF = 1lu * 2lu * sizeof(uint64_t);
+constexpr uint64_t K_6_7_OFF = 2lu * 2lu * sizeof(uint64_t);
+constexpr uint64_t K_8_9_OFF = 3lu * 2lu * sizeof(uint64_t);
+constexpr uint64_t K_10_11_OFF = 4lu * 2lu * sizeof(uint64_t);
+constexpr uint64_t K_12_13_OFF = 5lu * 2lu * sizeof(uint64_t);
+constexpr uint64_t K_14_15_OFF = 6lu * 2lu * sizeof(uint64_t);
+constexpr uint64_t K_16_17_OFF = 7lu * 2lu * sizeof(uint64_t);
+
+class HashBase : public Generator {
+protected:
+    void (*ker_fn)(const ComputeHashCallArgs*);
+
+public:
+    HashBase(cpu_isa_t isa) : Generator(isa) {}
+
+    virtual void generate() = 0;
+
+    void operator()(const ComputeHashCallArgs* args) {
+        ker_fn(args);
+    }
+
+    virtual void create_kernel() {
+        generate();
+        ker_fn = (decltype(ker_fn))getCode();
+        OPENVINO_ASSERT(ker_fn, "[ CORE ] Could not generate kernel code.");
+    }
+};
+
+template <cpu_isa_t isa>
+class ComputeHash : public HashBase {
+public:
+    explicit ComputeHash(const ComputeHashCompileParams& jcp) : HashBase(isa), m_jcp(jcp) {
+        if (!mayiuse(cpu_isa_t::pclmulqdq)) {
+            OPENVINO_THROW(
+                "The current CPU does not support pclmulqdq instruction, which is required for the CRC algorithm.");
+        }
+        if (mayiuse(cpu_isa_t::vpclmulqdq)) {
+            is_vpclmulqdq = true;
+        }
+    }
+
+    void generate() override {
+        m_registers_pool = RegistersPool::create(isa, {rax, rcx, rsp, rdi, k0});
+
+        r64_src_ptr = getReg64();
+        r64_dst_ptr = getReg64();
+        r64_work_amount = getReg64();
+        r64_k_ptr = getReg64();
+        r64_aux = getReg64();
+        v_k_2_3 = getVmm();
+        v_shuf_mask = getVmm();
+        auto v_dst = getVmm();
+
+        this->preamble();
+
+        initialize(v_dst);
+        bulk_fold(v_dst);
+        join(v_dst);
+        fold_to_128(v_dst);
+        fold_to_64(v_dst);
+
+        this->postamble();
+        m_registers_pool.reset();
+    }
+
+    static std::shared_ptr<HashBase> create(const ComputeHashCompileParams& params) {
+        auto kernel = std::make_shared<ComputeHash>(params);
+        OPENVINO_ASSERT(kernel, "[ CORE ] Could not create ComputeHash kernel.");
+        kernel->create_kernel();
+
+        return kernel;
+    }
+
+private:
+    using Vmm = typename std::conditional<isa == avx512_core, Xbyak::Zmm, Xbyak::Ymm>::type;
+    bool is_vpclmulqdq = false;
+
+    ComputeHashCompileParams m_jcp;
+    RegistersPool::Ptr m_registers_pool;
+
+    const Xbyak::Reg64 r64_params = abi_param1;
+
+    RegistersPool::Reg<Xbyak::Reg64> r64_src_ptr;
+    RegistersPool::Reg<Xbyak::Reg64> r64_dst_ptr;
+    RegistersPool::Reg<Xbyak::Reg64> r64_work_amount;
+    RegistersPool::Reg<Xbyak::Reg64> r64_k_ptr;
+    RegistersPool::Reg<Xbyak::Reg64> r64_aux;
+
+    // Vector registers
+    RegistersPool::Reg<Vmm> v_k_2_3;
+    RegistersPool::Reg<Vmm> v_shuf_mask;
+
+    void initialize(const Vmm& v_dst);
+
+    void bulk_fold(const Vmm& v_dst);
+
+    void join(const Vmm& v_dst);
+
+    void fold_to_128(const Vmm& v_dst);
+
+    void fold_to_64(const Vmm& v_dst);
+
+    void uni_vpxorq(const Xbyak::Xmm& v_dst, const Xbyak::Xmm& v_src_0, const Xbyak::Xmm& v_src_1);
+
+    void uni_vmovdqu64(const Xbyak::Xmm& v_dst, const Xbyak::Operand& v_src_0);
+
+    void uni_vmovdqu64(const Xbyak::Address& v_dst, const Xbyak::Xmm& v_src_0);
+
+    void uni_vbroadcasti64x2(const Xbyak::Ymm& v_dst, const Xbyak::Address& v_src_0);
+
+    void partial_load(const Xbyak::Xmm& xmm_dst, const Xbyak::Address& src_addr, const Xbyak::Reg64& r64_load_num);
+
+    void partial_load(const Xbyak::Ymm& ymm_dst, const Xbyak::Address& src_addr, const Xbyak::Reg64& r64_load_num);
+};
+
+template <>
+void ComputeHash<avx512_core>::uni_vpxorq(const Xbyak::Xmm& v_dst,
+                                          const Xbyak::Xmm& v_src_0,
+                                          const Xbyak::Xmm& v_src_1) {
+    vpxorq(v_dst, v_src_0, v_src_1);
+}
+template <cpu_isa_t isa>
+void ComputeHash<isa>::uni_vpxorq(const Xbyak::Xmm& v_dst, const Xbyak::Xmm& v_src_0, const Xbyak::Xmm& v_src_1) {
+    vpxor(v_dst, v_src_0, v_src_1);
+}
+template <>
+void ComputeHash<avx512_core>::uni_vmovdqu64(const Xbyak::Xmm& v_dst, const Xbyak::Operand& v_src_0) {
+    vmovdqu64(v_dst, v_src_0);
+}
+template <cpu_isa_t isa>
+void ComputeHash<isa>::uni_vmovdqu64(const Xbyak::Xmm& v_dst, const Xbyak::Operand& v_src_0) {
+    vmovdqu(v_dst, v_src_0);
+}
+template <>
+void ComputeHash<avx512_core>::uni_vmovdqu64(const Xbyak::Address& v_dst, const Xbyak::Xmm& v_src_0) {
+    vmovdqu64(v_dst, v_src_0);
+}
+template <cpu_isa_t isa>
+void ComputeHash<isa>::uni_vmovdqu64(const Xbyak::Address& v_dst, const Xbyak::Xmm& v_src_0) {
+    vmovdqu(v_dst, v_src_0);
+}
+template <>
+void ComputeHash<avx512_core>::uni_vbroadcasti64x2(const Xbyak::Ymm& v_dst, const Xbyak::Address& v_src_0) {
+    vbroadcasti64x2(v_dst, v_src_0);
+}
+template <cpu_isa_t isa>
+void ComputeHash<isa>::uni_vbroadcasti64x2(const Xbyak::Ymm& v_dst, const Xbyak::Address& v_src_0) {
+    vbroadcasti128(v_dst, v_src_0);
+}
+template <>
+void ComputeHash<avx512_core>::partial_load(const Xbyak::Xmm& xmm_dst,
+                                            const Xbyak::Address& src_addr,
+                                            const Xbyak::Reg64& r64_load_num) {
+    Xbyak::Label l_mv_mask;
+    auto rOnes = getReg64();
+    auto k_load_mask = RegistersPool::Reg<Xbyak::Opmask>(m_registers_pool);
+
+    mov(rOnes, 0xFFFFFFFFFFFFFFFF);
+    cmp(r64_load_num, 0x3f);
+    jg(l_mv_mask);
+
+    shlx(rOnes, rOnes, r64_load_num);
+    not_(rOnes);
+
+    L(l_mv_mask);
+    kmovq(k_load_mask, rOnes);
+
+    vmovdqu8(Vmm(xmm_dst.getIdx()) | k_load_mask | T_z, ptr[r64_src_ptr]);
+}
+template <cpu_isa_t isa>
+void ComputeHash<isa>::partial_load(const Xbyak::Xmm& xmm_dst,
+                                    const Xbyak::Address& src_addr,
+                                    const Xbyak::Reg64& r64_load_num) {
+    Xbyak::Label l_partial, l_end;
+
+    cmp(r64_load_num, xmm_len);
+    jl(l_partial, T_NEAR);
+    uni_vmovdqu64(xmm_dst, ptr[src_addr.getRegExp()]);
+    jmp(l_end, T_NEAR);
+
+    L(l_partial);
+    {
+        uni_vpxorq(xmm_dst, xmm_dst, xmm_dst);
+        for (size_t j = 0lu; j < xmm_len - 1; j++) {
+            cmp(r64_load_num, static_cast<uint32_t>(j));
+            jle(l_end, T_NEAR);
+            pinsrb(xmm_dst, ptr[src_addr.getRegExp() + j], static_cast<uint8_t>(j));
+        }
+    }
+
+    L(l_end);
+}
+template <>
+void ComputeHash<avx512_core>::partial_load(const Xbyak::Ymm& xmm_dst,
+                                            const Xbyak::Address& src_addr,
+                                            const Xbyak::Reg64& r64_load_num) {
+    partial_load(Xbyak::Xmm(xmm_dst.getIdx()), src_addr, r64_load_num);
+}
+template <cpu_isa_t isa>
+void ComputeHash<isa>::partial_load(const Xbyak::Ymm& ymm_dst,
+                                    const Xbyak::Address& src_addr,
+                                    const Xbyak::Reg64& r64_load_num) {
+    Xbyak::Label l_xmm, l_partial, l_end;
+    auto xmm_dst = Xbyak::Xmm(ymm_dst.getIdx());
+
+    cmp(r64_load_num, ymm_len);
+    jl(l_xmm, T_NEAR);
+    uni_vmovdqu64(ymm_dst, ptr[src_addr.getRegExp()]);
+    jmp(l_end, T_NEAR);
+
+    L(l_xmm);
+    uni_vpxorq(ymm_dst, ymm_dst, ymm_dst);
+    cmp(r64_load_num, xmm_len);
+    jl(l_partial, T_NEAR);
+    uni_vmovdqu64(xmm_dst, ptr[src_addr.getRegExp()]);
+    je(l_end, T_NEAR);
+
+    {
+        Xbyak::Label l_rest_loop, l_perm;
+
+        vperm2i128(ymm_dst, ymm_dst, ymm_dst, 0x1);
+        for (size_t j = 0lu; j < xmm_len - 1lu; j++) {
+            cmp(r64_load_num, static_cast<uint32_t>(xmm_len + j));
+            jle(l_perm, T_NEAR);
+            pinsrb(xmm_dst, ptr[src_addr.getRegExp() + xmm_len + j], static_cast<uint8_t>(j));
+        }
+        L(l_perm);
+        vperm2i128(ymm_dst, ymm_dst, ymm_dst, 0x1);
+    }
+    jmp(l_end, T_NEAR);
+
+    L(l_partial);
+    {
+        for (size_t j = 0lu; j < xmm_len - 1; j++) {
+            cmp(r64_load_num, static_cast<uint32_t>(j));
+            jle(l_end, T_NEAR);
+            pinsrb(xmm_dst, ptr[src_addr.getRegExp() + j], static_cast<uint8_t>(j));
+        }
+    }
+
+    L(l_end);
+}
+
+template <cpu_isa_t isa>
+void ComputeHash<isa>::initialize(const Vmm& v_dst) {
+    mov(r64_src_ptr, ptr[r64_params + GET_OFF(src_ptr)]);
+    mov(r64_dst_ptr, ptr[r64_params + GET_OFF(dst_ptr)]);
+    mov(r64_k_ptr, ptr[r64_params + GET_OFF(k_ptr)]);
+    mov(r64_work_amount, ptr[r64_params + GET_OFF(work_amount)]);
+
+    uni_vbroadcasti64x2(v_k_2_3, ptr[r64_k_ptr + K_2_3_OFF]);
+
+    mov(r64_aux, reinterpret_cast<uintptr_t>(SHUF_MASK));
+    uni_vbroadcasti64x2(v_shuf_mask, ptr[r64_aux]);
+
+    if (m_jcp.type == SINGLE_THREAD || m_jcp.type == FIRST_THREAD) {
+        auto xmm_dst = Xbyak::Xmm(v_dst.getIdx());
+        auto xmm_aux = getXmm();
+
+        // Initial CRC
+        mov(r64_aux, ptr[r64_params + GET_OFF(size)]);
+        vpinsrq(xmm_aux, xmm_aux, r64_aux, 0x0);
+        mov(r64_aux, CRC_VAL);
+        vpinsrq(xmm_aux, xmm_aux, r64_aux, 0x1);
+
+        // First xor with source.
+        partial_load(v_dst, ptr[r64_src_ptr], r64_work_amount);
+        vpshufb(v_dst, v_dst, v_shuf_mask);
+        pxor(xmm_dst, xmm_aux);  // The SSE version is used to avoid zeroing out the rest of the Vmm.
+        if (m_jcp.type == SINGLE_THREAD) {
+            add(r64_src_ptr, xmm_len);
+        }
+    } else if (m_jcp.type == N_THREAD) {
+        uni_vmovdqu64(v_dst, ptr[r64_src_ptr]);
+        vpshufb(v_dst, v_dst, v_shuf_mask);
+    }
+    if (m_jcp.type == SINGLE_THREAD || m_jcp.type == FIRST_THREAD || m_jcp.type == N_THREAD) {
+        sub(r64_work_amount, xmm_len);
+    }
+}
+
+template <>
+void ComputeHash<avx512_core>::bulk_fold(const Vmm& v_dst) {
+    if (m_jcp.type != SINGLE_THREAD && m_jcp.type != FIRST_THREAD && m_jcp.type != N_THREAD) {
+        return;
+    }
+    Xbyak::Label l_fold_loop, l_end;
+    cmp(r64_work_amount, static_cast<uint32_t>(get_vlen() * 2lu - xmm_len));
+    jl(l_end, T_NEAR);
+
+    auto v_src_0 = getVmm();
+    auto v_dst_0 = getVmm();
+    auto v_dst_1 = getVmm();
+    auto v_dst_2 = getVmm();
+    auto& v_dst_3 = v_dst;
+    auto v_k_loop = getVmm();
+    auto v_aux_0 = getVmm();
+
+    auto xmm_src_0 = Xbyak::Xmm(v_src_0.getIdx());
+    auto xmm_src_1 = getXmm();
+    auto xmm_dst_0 = Xbyak::Xmm(v_dst_0.getIdx());
+    auto xmm_dst_1 = Xbyak::Xmm(v_dst_1.getIdx());
+    auto xmm_dst_2 = Xbyak::Xmm(v_dst_2.getIdx());
+    auto xmm_dst_3 = Xbyak::Xmm(v_dst_3.getIdx());
+    auto xmm_k_loop = Xbyak::Xmm(v_k_loop.getIdx());
+    auto xmm_k_2_3 = Xbyak::Xmm(v_k_2_3.getIdx());
+    auto xmm_aux_0 = Xbyak::Xmm(v_aux_0.getIdx());
+
+    RegistersPool::Reg<Xbyak::Reg64> r64_bulk_step;
+    if (m_jcp.type == FIRST_THREAD || m_jcp.type == N_THREAD) {
+        r64_bulk_step = getReg64();
+        mov(r64_bulk_step, ptr[r64_params + GET_OFF(threads_num)]);
+        sal(r64_bulk_step, static_cast<int>(std::log2(get_vlen())));  // * vlen
+    }
+
+    if (m_jcp.type == SINGLE_THREAD) {
+        uni_vbroadcasti64x2(v_k_loop, ptr[r64_k_ptr + K_8_9_OFF]);
+    } else {
+        uni_vbroadcasti64x2(v_k_loop, ptr[r64_k_ptr + K_16_17_OFF]);
+    }
+
+    uni_vmovdqu64(v_dst_0, v_dst);
+
+    if (!is_vpclmulqdq) {
+        vextracti64x2(xmm_dst_1, v_dst_0, 0x1);
+        vextracti64x2(xmm_dst_2, v_dst_0, 0x2);
+        vextracti64x2(xmm_dst_3, v_dst_0, 0x3);
+    }
+
+    if (m_jcp.type == FIRST_THREAD || m_jcp.type == N_THREAD) {
+        add(r64_src_ptr, r64_bulk_step);
+        prefetcht2(ptr[r64_src_ptr + 16384]);
+    } else {
+        add(r64_src_ptr, static_cast<uint32_t>(get_vlen() - xmm_len));
+        prefetcht2(ptr[r64_src_ptr + 4096]);
+    }
+    prefetcht1(ptr[r64_src_ptr + 1024]);
+    prefetcht0(ptr[r64_src_ptr + 64]);
+
+    sub(r64_work_amount, static_cast<uint32_t>(get_vlen() * 2lu - xmm_len));
+
+    L(l_fold_loop);
+    {
+        uni_vmovdqu64(v_src_0, ptr[r64_src_ptr]);
+        vpshufb(v_src_0, v_src_0, v_shuf_mask);
+
+        if (m_jcp.type == FIRST_THREAD || m_jcp.type == N_THREAD) {
+            add(r64_src_ptr, r64_bulk_step);
+            prefetcht2(ptr[r64_src_ptr + 16384]);
+        } else {
+            add(r64_src_ptr, static_cast<uint32_t>(get_vlen()));
+            prefetcht2(ptr[r64_src_ptr + 4096]);
+        }
+        prefetcht1(ptr[r64_src_ptr + 1024]);
+        prefetcht0(ptr[r64_src_ptr + 64]);
+
+        if (is_vpclmulqdq) {
+            vpclmulqdq(v_aux_0, v_dst_0, v_k_loop, 0b00000000);
+            vpclmulqdq(v_dst_0, v_dst_0, v_k_loop, 0b00010001);
+            uni_vpxorq(v_aux_0, v_aux_0, v_src_0);
+            uni_vpxorq(v_dst_0, v_dst_0, v_aux_0);
+        } else {
+            // 0
+            vpclmulqdq(xmm_aux_0, xmm_dst_0, xmm_k_loop, 0b00000000);
+            vpclmulqdq(xmm_dst_0, xmm_dst_0, xmm_k_loop, 0b00010001);
+            uni_vpxorq(xmm_aux_0, xmm_aux_0, xmm_src_0);
+            uni_vpxorq(xmm_dst_0, xmm_dst_0, xmm_aux_0);
+
+            // 1
+            vextracti64x2(xmm_src_1, v_src_0, 0x1);
+            vpclmulqdq(xmm_aux_0, xmm_dst_1, xmm_k_loop, 0b00000000);
+            vpclmulqdq(xmm_dst_1, xmm_dst_1, xmm_k_loop, 0b00010001);
+            uni_vpxorq(xmm_aux_0, xmm_aux_0, xmm_src_1);
+            uni_vpxorq(xmm_dst_1, xmm_dst_1, xmm_aux_0);
+
+            // 2
+            vextracti64x2(xmm_src_1, v_src_0, 0x2);
+            vpclmulqdq(xmm_aux_0, xmm_dst_2, xmm_k_loop, 0b00000000);
+            vpclmulqdq(xmm_dst_2, xmm_dst_2, xmm_k_loop, 0b00010001);
+            uni_vpxorq(xmm_aux_0, xmm_aux_0, xmm_src_1);
+            uni_vpxorq(xmm_dst_2, xmm_dst_2, xmm_aux_0);
+
+            // 3
+            vextracti64x2(xmm_src_1, v_src_0, 0x3);
+            vpclmulqdq(xmm_aux_0, xmm_dst_3, xmm_k_loop, 0b00000000);
+            vpclmulqdq(xmm_dst_3, xmm_dst_3, xmm_k_loop, 0b00010001);
+            uni_vpxorq(xmm_aux_0, xmm_aux_0, xmm_src_1);
+            uni_vpxorq(xmm_dst_3, xmm_dst_3, xmm_aux_0);
+        }
+
+        sub(r64_work_amount, static_cast<uint32_t>(get_vlen()));
+        jge(l_fold_loop, T_NEAR);
+    }
+    add(r64_work_amount, static_cast<uint32_t>(get_vlen()));
+
+    if (m_jcp.type == SINGLE_THREAD) {
+        if (is_vpclmulqdq) {
+            vextracti64x2(xmm_dst_1, v_dst_0, 0x1);
+            vextracti64x2(xmm_dst_2, v_dst_0, 0x2);
+            vextracti64x2(xmm_dst_3, v_dst_0, 0x3);
+        }
+
+        vpclmulqdq(xmm_aux_0, xmm_dst_0, ptr[r64_k_ptr + K_6_7_OFF], 0b00000000);
+        vpclmulqdq(xmm_dst_0, xmm_dst_0, ptr[r64_k_ptr + K_6_7_OFF], 0b00010001);
+        uni_vpxorq(xmm_dst_3, xmm_dst_3, xmm_aux_0);
+        uni_vpxorq(xmm_dst_3, xmm_dst_3, xmm_dst_0);
+
+        vpclmulqdq(xmm_aux_0, xmm_dst_1, ptr[r64_k_ptr + K_4_5_OFF], 0b00000000);
+        vpclmulqdq(xmm_dst_1, xmm_dst_1, ptr[r64_k_ptr + K_4_5_OFF], 0b00010001);
+        uni_vpxorq(xmm_dst_3, xmm_dst_3, xmm_aux_0);
+        uni_vpxorq(xmm_dst_3, xmm_dst_3, xmm_dst_1);
+
+        vpclmulqdq(xmm_aux_0, xmm_dst_2, xmm_k_2_3, 0b00000000);
+        vpclmulqdq(xmm_dst_2, xmm_dst_2, xmm_k_2_3, 0b00010001);
+        uni_vpxorq(xmm_dst_3, xmm_dst_3, xmm_aux_0);
+        uni_vpxorq(xmm_dst_3, xmm_dst_3, xmm_dst_2);
+    } else {
+        if (is_vpclmulqdq) {
+            uni_vmovdqu64(ptr[r64_dst_ptr], v_dst_0);
+        } else {
+            uni_vmovdqu64(ptr[r64_dst_ptr + xmm_len * 0lu], xmm_dst_0);
+            uni_vmovdqu64(ptr[r64_dst_ptr + xmm_len * 1lu], xmm_dst_1);
+            uni_vmovdqu64(ptr[r64_dst_ptr + xmm_len * 2lu], xmm_dst_2);
+            uni_vmovdqu64(ptr[r64_dst_ptr + xmm_len * 3lu], xmm_dst_3);
+        }
+    }
+
+    L(l_end);
+}
+
+template <cpu_isa_t isa>
+void ComputeHash<isa>::bulk_fold(const Vmm& v_dst) {
+    if (m_jcp.type != SINGLE_THREAD && m_jcp.type != FIRST_THREAD && m_jcp.type != N_THREAD) {
+        return;
+    }
+    Xbyak::Label l_fold_loop, l_end;
+    cmp(r64_work_amount, static_cast<uint32_t>(get_vlen() * 2lu - xmm_len));
+    jl(l_end, T_NEAR);
+
+    auto v_src_0 = getVmm();
+    auto v_dst_0 = getVmm();
+    auto& v_dst_1 = v_dst;
+    auto v_aux_0 = getVmm();
+    auto v_k_loop = getVmm();
+
+    auto xmm_src_0 = Xbyak::Xmm(v_src_0.getIdx());
+    auto xmm_src_1 = getXmm();
+    auto xmm_dst_0 = Xbyak::Xmm(v_dst_0.getIdx());
+    auto xmm_dst_1 = Xbyak::Xmm(v_dst_1.getIdx());
+    auto xmm_k_loop = Xbyak::Xmm(v_k_loop.getIdx());
+    auto xmm_k_2_3 = Xbyak::Xmm(v_k_2_3.getIdx());
+    auto xmm_aux_0 = Xbyak::Xmm(v_aux_0.getIdx());
+
+    RegistersPool::Reg<Xbyak::Reg64> r64_bulk_step;
+    if (m_jcp.type == FIRST_THREAD || m_jcp.type == N_THREAD) {
+        r64_bulk_step = getReg64();
+        mov(r64_bulk_step, ptr[r64_params + GET_OFF(threads_num)]);
+        sal(r64_bulk_step, static_cast<int>(std::log2(get_vlen())));  // * vlen
+    }
+
+    if (m_jcp.type == SINGLE_THREAD) {
+        uni_vbroadcasti64x2(v_k_loop, ptr[r64_k_ptr + K_4_5_OFF]);
+    } else {
+        uni_vbroadcasti64x2(v_k_loop, ptr[r64_k_ptr + K_8_9_OFF]);
+    }
+
+    uni_vmovdqu64(v_dst_0, v_dst);
+
+    if (!is_vpclmulqdq) {
+        vextracti128(xmm_dst_1, v_dst_0, 0x1);
+    }
+
+    if (m_jcp.type == SINGLE_THREAD) {
+        add(r64_src_ptr, static_cast<uint32_t>(get_vlen() - xmm_len));
+    } else {
+        add(r64_src_ptr, r64_bulk_step);
+    }
+    prefetcht2(ptr[r64_src_ptr + 4096]);
+    prefetcht1(ptr[r64_src_ptr + 1024]);
+    prefetcht0(ptr[r64_src_ptr + 64]);
+
+    sub(r64_work_amount, static_cast<uint32_t>(get_vlen() * 2lu - xmm_len));
+
+    L(l_fold_loop);
+    {
+        uni_vmovdqu64(v_src_0, ptr[r64_src_ptr]);
+        vpshufb(v_src_0, v_src_0, v_shuf_mask);
+
+        if (m_jcp.type == SINGLE_THREAD) {
+            add(r64_src_ptr, static_cast<uint32_t>(get_vlen()));
+        } else {
+            add(r64_src_ptr, r64_bulk_step);
+        }
+        prefetcht2(ptr[r64_src_ptr + 4096]);
+        prefetcht1(ptr[r64_src_ptr + 1024]);
+        prefetcht0(ptr[r64_src_ptr + 64]);
+
+        if (is_vpclmulqdq) {
+            vpclmulqdq(v_aux_0, v_dst_0, v_k_loop, 0b00000000);
+            vpclmulqdq(v_dst_0, v_dst_0, v_k_loop, 0b00010001);
+            uni_vpxorq(v_aux_0, v_aux_0, v_src_0);
+            uni_vpxorq(v_dst_0, v_dst_0, v_aux_0);
+        } else {
+            // 0
+            vpclmulqdq(xmm_aux_0, xmm_dst_0, xmm_k_loop, 0b00000000);
+            vpclmulqdq(xmm_dst_0, xmm_dst_0, xmm_k_loop, 0b00010001);
+            uni_vpxorq(xmm_aux_0, xmm_aux_0, xmm_src_0);
+            uni_vpxorq(xmm_dst_0, xmm_dst_0, xmm_aux_0);
+            // 1
+            vextracti128(xmm_src_1, v_src_0, 0x1);
+            vpclmulqdq(xmm_aux_0, xmm_dst_1, xmm_k_loop, 0b00000000);
+            vpclmulqdq(xmm_dst_1, xmm_dst_1, xmm_k_loop, 0b00010001);
+            uni_vpxorq(xmm_aux_0, xmm_aux_0, xmm_src_1);
+            uni_vpxorq(xmm_dst_1, xmm_dst_1, xmm_aux_0);
+        }
+
+        sub(r64_work_amount, static_cast<uint32_t>(get_vlen()));
+        jge(l_fold_loop, T_NEAR);
+    }
+    add(r64_work_amount, static_cast<uint32_t>(get_vlen()));
+
+    if (m_jcp.type == SINGLE_THREAD) {
+        if (is_vpclmulqdq) {
+            vextracti128(xmm_dst_1, v_dst_0, 0x1);
+        }
+        vpclmulqdq(xmm_aux_0, xmm_dst_0, xmm_k_2_3, 0b00000000);
+        vpclmulqdq(xmm_dst_0, xmm_dst_0, xmm_k_2_3, 0b00010001);
+        uni_vpxorq(xmm_dst_1, xmm_dst_1, xmm_aux_0);
+        uni_vpxorq(xmm_dst_1, xmm_dst_1, xmm_dst_0);
+    } else {
+        if (is_vpclmulqdq) {
+            uni_vmovdqu64(ptr[r64_dst_ptr], v_dst_0);
+        } else {
+            uni_vmovdqu64(ptr[r64_dst_ptr + xmm_len * 0lu], xmm_dst_0);
+            uni_vmovdqu64(ptr[r64_dst_ptr + xmm_len * 1lu], xmm_dst_1);
+        }
+    }
+
+    L(l_end);
+}
+
+template <>
+void ComputeHash<avx512_core>::join(const Vmm& v_dst) {
+    if (m_jcp.type != FINAL_FOLD) {
+        return;
+    }
+
+    mov(r64_aux, ptr[r64_params + GET_OFF(intermediate_ptr)]);
+    prefetcht0(ptr[r64_aux + 1024]);
+
+    auto xmm_src_0 = getXmm();
+    auto xmm_src_last = Xbyak::Xmm(v_dst.getIdx());
+    auto xmm_aux_0 = getXmm();
+    auto xmm_k_2_3 = Xbyak::Xmm(v_k_2_3.getIdx());
+
+    uni_vmovdqu64(xmm_src_last, ptr[r64_aux + xmm_len * 7]);
+
+    uni_vmovdqu64(xmm_src_0, ptr[r64_aux]);
+    vpclmulqdq(xmm_aux_0, xmm_src_0, ptr[r64_k_ptr + K_14_15_OFF], 0b00000000);
+    vpclmulqdq(xmm_src_0, xmm_src_0, ptr[r64_k_ptr + K_14_15_OFF], 0b00010001);
+    uni_vpxorq(xmm_src_last, xmm_src_last, xmm_aux_0);
+    uni_vpxorq(xmm_src_last, xmm_src_last, xmm_src_0);
+
+    uni_vmovdqu64(xmm_src_0, ptr[r64_aux + xmm_len]);
+    vpclmulqdq(xmm_aux_0, xmm_src_0, ptr[r64_k_ptr + K_12_13_OFF], 0b00000000);
+    vpclmulqdq(xmm_src_0, xmm_src_0, ptr[r64_k_ptr + K_12_13_OFF], 0b00010001);
+    uni_vpxorq(xmm_src_last, xmm_src_last, xmm_aux_0);
+    uni_vpxorq(xmm_src_last, xmm_src_last, xmm_src_0);
+
+    uni_vmovdqu64(xmm_src_0, ptr[r64_aux + xmm_len * 2lu]);
+    vpclmulqdq(xmm_aux_0, xmm_src_0, ptr[r64_k_ptr + K_10_11_OFF], 0b00000000);
+    vpclmulqdq(xmm_src_0, xmm_src_0, ptr[r64_k_ptr + K_10_11_OFF], 0b00010001);
+    uni_vpxorq(xmm_src_last, xmm_src_last, xmm_aux_0);
+    uni_vpxorq(xmm_src_last, xmm_src_last, xmm_src_0);
+
+    uni_vmovdqu64(xmm_src_0, ptr[r64_aux + xmm_len * 3lu]);
+    vpclmulqdq(xmm_aux_0, xmm_src_0, ptr[r64_k_ptr + K_8_9_OFF], 0b00000000);
+    vpclmulqdq(xmm_src_0, xmm_src_0, ptr[r64_k_ptr + K_8_9_OFF], 0b00010001);
+    uni_vpxorq(xmm_src_last, xmm_src_last, xmm_aux_0);
+    uni_vpxorq(xmm_src_last, xmm_src_last, xmm_src_0);
+
+    uni_vmovdqu64(xmm_src_0, ptr[r64_aux + xmm_len * 4lu]);
+    vpclmulqdq(xmm_aux_0, xmm_src_0, ptr[r64_k_ptr + K_6_7_OFF], 0b00000000);
+    vpclmulqdq(xmm_src_0, xmm_src_0, ptr[r64_k_ptr + K_6_7_OFF], 0b00010001);
+    uni_vpxorq(xmm_src_last, xmm_src_last, xmm_aux_0);
+    uni_vpxorq(xmm_src_last, xmm_src_last, xmm_src_0);
+
+    uni_vmovdqu64(xmm_src_0, ptr[r64_aux + xmm_len * 5lu]);
+    vpclmulqdq(xmm_aux_0, xmm_src_0, ptr[r64_k_ptr + K_4_5_OFF], 0b00000000);
+    vpclmulqdq(xmm_src_0, xmm_src_0, ptr[r64_k_ptr + K_4_5_OFF], 0b00010001);
+    uni_vpxorq(xmm_src_last, xmm_src_last, xmm_aux_0);
+    uni_vpxorq(xmm_src_last, xmm_src_last, xmm_src_0);
+
+    uni_vmovdqu64(xmm_src_0, ptr[r64_aux + xmm_len * 6lu]);
+    vpclmulqdq(xmm_aux_0, xmm_src_0, xmm_k_2_3, 0b00000000);
+    vpclmulqdq(xmm_src_0, xmm_src_0, xmm_k_2_3, 0b00010001);
+    uni_vpxorq(xmm_src_last, xmm_src_last, xmm_aux_0);
+    uni_vpxorq(xmm_src_last, xmm_src_last, xmm_src_0);
+}
+
+template <cpu_isa_t isa>
+void ComputeHash<isa>::join(const Vmm& v_dst) {
+    if (m_jcp.type != FINAL_FOLD) {
+        return;
+    }
+
+    mov(r64_aux, ptr[r64_params + GET_OFF(intermediate_ptr)]);
+    prefetcht0(ptr[r64_aux + 1024]);
+
+    auto xmm_src_0 = getXmm();
+    auto xmm_src_last = Xbyak::Xmm(v_dst.getIdx());
+    auto xmm_aux_0 = getXmm();
+    auto xmm_k_2_3 = Xbyak::Xmm(v_k_2_3.getIdx());
+
+    uni_vmovdqu64(xmm_src_last, ptr[r64_aux + xmm_len * 3]);
+
+    uni_vmovdqu64(xmm_src_0, ptr[r64_aux + xmm_len * 0lu]);
+    vpclmulqdq(xmm_aux_0, xmm_src_0, ptr[r64_k_ptr + K_6_7_OFF], 0b00000000);
+    vpclmulqdq(xmm_src_0, xmm_src_0, ptr[r64_k_ptr + K_6_7_OFF], 0b00010001);
+    uni_vpxorq(xmm_src_last, xmm_src_last, xmm_aux_0);
+    uni_vpxorq(xmm_src_last, xmm_src_last, xmm_src_0);
+
+    uni_vmovdqu64(xmm_src_0, ptr[r64_aux + xmm_len * 1lu]);
+    vpclmulqdq(xmm_aux_0, xmm_src_0, ptr[r64_k_ptr + K_4_5_OFF], 0b00000000);
+    vpclmulqdq(xmm_src_0, xmm_src_0, ptr[r64_k_ptr + K_4_5_OFF], 0b00010001);
+    uni_vpxorq(xmm_src_last, xmm_src_last, xmm_aux_0);
+    uni_vpxorq(xmm_src_last, xmm_src_last, xmm_src_0);
+
+    uni_vmovdqu64(xmm_src_0, ptr[r64_aux + xmm_len * 2lu]);
+    vpclmulqdq(xmm_aux_0, xmm_src_0, xmm_k_2_3, 0b00000000);
+    vpclmulqdq(xmm_src_0, xmm_src_0, xmm_k_2_3, 0b00010001);
+    uni_vpxorq(xmm_src_last, xmm_src_last, xmm_aux_0);
+    uni_vpxorq(xmm_src_last, xmm_src_last, xmm_src_0);
+}
+
+template <cpu_isa_t isa>
+void ComputeHash<isa>::fold_to_128(const Vmm& v_dst) {
+    if (m_jcp.type != SINGLE_THREAD && m_jcp.type != FINAL_FOLD) {
+        return;
+    }
+    Xbyak::Label l_fold_loop, l_end;
+    cmp(r64_work_amount, xmm_len);
+    jl(l_end, T_NEAR);
+
+    auto xmm_src = getXmm();
+    auto xmm_dst = Xbyak::Xmm(v_dst.getIdx());
+    auto xmm_k_2_3 = Xbyak::Xmm(v_k_2_3.getIdx());
+    auto xmm_shuf_mask = Xbyak::Xmm(v_shuf_mask.getIdx());
+    auto xmm_aux = getXmm();
+
+    L(l_fold_loop);
+    {
+        uni_vmovdqu64(xmm_src, ptr[r64_src_ptr]);
+        vpshufb(xmm_src, xmm_src, xmm_shuf_mask);
+
+        vpclmulqdq(xmm_aux, xmm_dst, xmm_k_2_3, 0b00000000);
+        vpclmulqdq(xmm_dst, xmm_dst, xmm_k_2_3, 0b00010001);
+        uni_vpxorq(xmm_dst, xmm_dst, xmm_aux);
+        uni_vpxorq(xmm_dst, xmm_dst, xmm_src);
+
+        add(r64_src_ptr, xmm_len);
+        sub(r64_work_amount, xmm_len);
+        cmp(r64_work_amount, xmm_len);
+        jge(l_fold_loop, T_NEAR);
+    }
+
+    L(l_end);
+}
+
+template <cpu_isa_t isa>
+void ComputeHash<isa>::fold_to_64(const Vmm& v_dst) {
+    if (m_jcp.type != SINGLE_THREAD && m_jcp.type != FINAL_FOLD) {
+        return;
+    }
+    Xbyak::Label l_fold_to_64;
+    cmp(r64_work_amount, 0);
+    jle(l_fold_to_64, T_NEAR);
+
+    auto xmm_src = getXmm();
+    auto xmm_dst = Xbyak::Xmm(v_dst.getIdx());
+    auto xmm_k_2_3 = Xbyak::Xmm(v_k_2_3.getIdx());
+    auto xmm_shuf_mask = Xbyak::Xmm(v_shuf_mask.getIdx());
+    auto xmm_aux = getXmm();
+    auto xmm_aux_1 = getXmm();
+    auto xmm_aux_2 = getXmm();
+
+    partial_load(xmm_src, ptr[r64_src_ptr], r64_work_amount);
+    vpshufb(xmm_src, xmm_src, xmm_shuf_mask);
+
+    vpclmulqdq(xmm_aux, xmm_dst, xmm_k_2_3, 0b00000000);
+    vpclmulqdq(xmm_dst, xmm_dst, xmm_k_2_3, 0b00010001);
+    uni_vpxorq(xmm_aux, xmm_aux, xmm_src);
+    uni_vpxorq(xmm_dst, xmm_dst, xmm_aux);
+
+    L(l_fold_to_64);
+
+    mov(r64_aux, K_2);
+    vpinsrq(xmm_aux, xmm_aux, r64_aux, 0x0);
+    vpclmulqdq(xmm_aux, xmm_dst, xmm_aux, 0b00000001);
+    vpslldq(xmm_dst, xmm_dst, 0x8);
+    uni_vpxorq(xmm_dst, xmm_dst, xmm_aux);
+
+    mov(r64_aux, P_1);
+    vpinsrq(xmm_aux_2, xmm_aux_2, r64_aux, 0x0);
+    vpclmulqdq(xmm_aux, xmm_dst, xmm_aux_2, 0b00000001);
+    mov(r64_aux, 0x0);
+    vpinsrq(xmm_aux_1, xmm_dst, r64_aux, 0x0);
+    uni_vpxorq(xmm_aux, xmm_aux, xmm_aux_1);
+    vpinsrq(xmm_aux_1, xmm_aux, r64_aux, 0x0);
+
+    mov(r64_aux, P_2);
+    vpinsrq(xmm_aux_2, xmm_aux_2, r64_aux, 0x1);
+    vpclmulqdq(xmm_aux, xmm_aux, xmm_aux_2, 0b00010001);
+    uni_vpxorq(xmm_aux, xmm_aux, xmm_aux_1);
+    uni_vpxorq(xmm_dst, xmm_dst, xmm_aux);
+
+    vpextrq(ptr[r64_dst_ptr], xmm_dst, 0x0);
+}
+
+}  // namespace jit
+#endif  // OV_CORE_USE_XBYAK_JIT
+
+size_t compute_hash(const void* src, size_t size) {
+#ifdef OV_CORE_USE_XBYAK_JIT
+    if (Generator::mayiuse(avx2)) {
+        uint64_t result = 0lu;
+
+        // Parallel section
+        constexpr uint64_t min_wa_per_thread = 131072lu;  // 2^17
+        const uint64_t size_u64 = static_cast<uint64_t>(size);
+        if (size_u64 >= min_wa_per_thread * 2lu) {
+            static auto first_thr_kernel = Generator::mayiuse(avx512_core)
+                                               ? jit::ComputeHash<avx512_core>::create({jit::FIRST_THREAD})
+                                               : jit::ComputeHash<avx2>::create({jit::FIRST_THREAD});
+            static auto n_thr_kernel = Generator::mayiuse(avx512_core)
+                                           ? jit::ComputeHash<avx512_core>::create({jit::N_THREAD})
+                                           : jit::ComputeHash<avx2>::create({jit::N_THREAD});
+            static auto final_fold_kernel = Generator::mayiuse(avx512_core)
+                                                ? jit::ComputeHash<avx512_core>::create({jit::FINAL_FOLD})
+                                                : jit::ComputeHash<avx2>::create({jit::FINAL_FOLD});
+
+            static const uint64_t max_thr_num = 2lu;
+            uint64_t thr_num = std::min(size_u64 / min_wa_per_thread, max_thr_num);
+            const uint64_t el_per_thread =
+                first_thr_kernel->get_vlen() * ((size_u64 / thr_num) / first_thr_kernel->get_vlen());
+            std::vector<uint8_t> intermediate(thr_num * first_thr_kernel->get_vlen());
+
+            parallel_nt_static(static_cast<int>(thr_num), [&](const int ithr, const int nthr) {
+                uint64_t start = el_per_thread * ithr;
+                if (start >= size_u64) {
+                    return;
+                }
+                uint64_t work_amount = (el_per_thread + start > size_u64) ? size_u64 - start : el_per_thread;
+
+                jit::ComputeHashCallArgs args;
+
+                args.src_ptr = reinterpret_cast<const uint8_t*>(src) + first_thr_kernel->get_vlen() * ithr;
+                args.dst_ptr = &(intermediate[first_thr_kernel->get_vlen() * ithr]);
+                args.k_ptr = jit::K_PULL;
+                args.work_amount = work_amount;
+                args.size = size_u64;
+                args.threads_num = thr_num;
+
+                if (ithr == 0) {
+                    (*first_thr_kernel)(&args);
+                } else {
+                    (*n_thr_kernel)(&args);
+                }
+            });
+
+            jit::ComputeHashCallArgs args;
+            args.work_amount = size_u64 - el_per_thread * thr_num;
+            args.src_ptr = reinterpret_cast<const uint8_t*>(src) + size_u64 - args.work_amount;
+            args.dst_ptr = &result;
+            args.k_ptr = jit::K_PULL;
+            args.size = size_u64;
+            args.intermediate_ptr = intermediate.data();
+
+            (*final_fold_kernel)(&args);
+        } else {
+            static auto single_thr_kernel = Generator::mayiuse(avx512_core)
+                                                ? jit::ComputeHash<avx512_core>::create({jit::SINGLE_THREAD})
+                                                : jit::ComputeHash<avx2>::create({jit::SINGLE_THREAD});
+
+            jit::ComputeHashCallArgs args;
+            args.src_ptr = src;
+            args.dst_ptr = &result;
+            args.k_ptr = jit::K_PULL;
+            args.work_amount = size_u64;
+            args.size = size_u64;
+
+            (*single_thr_kernel)(&args);
+        }
+
+        return result;
+    }
+
+#endif  // OV_CORE_USE_XBYAK_JIT
+
+    constexpr auto cel_size = sizeof(size_t);
+    size_t seed = size;
+    const auto data = static_cast<const size_t*>(src);
+    const auto d_end = std::next(data, size / cel_size);
+    // The constant value used as a magic number has been
+    // traditionally used e.g. in boost library's hash_combine.
+    // It happens to be derived from the golden ratio.
+    for (auto d = data; d != d_end; ++d) {
+        seed ^= *d + 0x9e3779b9 + (seed << 6) + (seed >> 2);
+    }
+    size_t last_bytes{0};
+    std::memcpy(&last_bytes, d_end, size % cel_size);
+    seed ^= last_bytes + 0x9e3779b9 + (seed << 6) + (seed >> 2);
+
+    return seed;
+}
+
+}  // namespace runtime
+}  // namespace ov

From 8a33df72760f075bb277161f3b5b2ad3768963bf Mon Sep 17 00:00:00 2001
From: "Anastasiya(Asya) Pronina" <anastasiya.pronina@intel.com>
Date: Mon, 21 Oct 2024 22:34:31 +0100
Subject: [PATCH 20/24] Added i8 for DQMatMulCwi (#27112)

### Details:
 - *Added i8 for DQMatMulCwi*

### Tickets:
 - *N/A*

---------

Co-authored-by: Dmitry Matveev <dmitry.matveev@intel.com>
---
 .../intel_npu/src/plugin/npuw/partitioning/patterns/opt.cpp  | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/src/plugins/intel_npu/src/plugin/npuw/partitioning/patterns/opt.cpp b/src/plugins/intel_npu/src/plugin/npuw/partitioning/patterns/opt.cpp
index 077fb6d6660132..ddf1449adb9d59 100644
--- a/src/plugins/intel_npu/src/plugin/npuw/partitioning/patterns/opt.cpp
+++ b/src/plugins/intel_npu/src/plugin/npuw/partitioning/patterns/opt.cpp
@@ -149,8 +149,9 @@ DQMatMulCWi::DQMatMulCWi() {
 
         auto qcoeff_shape = matched_qcoeff->output(0).get_shape();
 
-        if (ov::element::i4 == matched_qweight->get_element_type() && qcoeff_shape[1] == 1 &&
-            !matched_matmul->get_transpose_a() && matched_matmul->get_transpose_b()) {
+        if ((ov::element::i4 == matched_qweight->get_element_type() ||
+             ov::element::i8 == matched_qweight->get_element_type()) &&
+            qcoeff_shape[1] == 1 && !matched_matmul->get_transpose_a() && matched_matmul->get_transpose_b()) {
             auto matched_node_cvtw = node_to_output.at(qcvtw).get_node_shared_ptr();
             auto matched_node_cvtm = node_to_output.at(qcvtm).get_node_shared_ptr();
             auto matched_node_muls = node_to_output.at(qmuls).get_node_shared_ptr();

From de0dc7199945e132142cfd7bbd72f0755799eb06 Mon Sep 17 00:00:00 2001
From: Georgy Krivoruchko <georgy.krivoruchko@intel.com>
Date: Tue, 22 Oct 2024 09:02:08 +0400
Subject: [PATCH 21/24] [ONNX] Direct loading from ModelProto object (#27124)

### Details:
 - Direct loading from ModelProto object

### Tickets:
 - 155265
---
 src/frontends/onnx/frontend/src/editor.cpp    |  8 +++
 src/frontends/onnx/frontend/src/editor.hpp    |  9 +++
 src/frontends/onnx/frontend/src/frontend.cpp  | 31 ++++++++-
 .../onnx/frontend/src/input_model.cpp         |  3 +
 .../onnx/frontend/src/input_model.hpp         |  3 +
 src/frontends/onnx/tests/CMakeLists.txt       |  5 +-
 src/frontends/onnx/tests/load_from.cpp        | 66 +++++++++++++++++++
 7 files changed, 123 insertions(+), 2 deletions(-)

diff --git a/src/frontends/onnx/frontend/src/editor.cpp b/src/frontends/onnx/frontend/src/editor.cpp
index eaa7b31a61c03f..4ad576cd9d5b96 100644
--- a/src/frontends/onnx/frontend/src/editor.cpp
+++ b/src/frontends/onnx/frontend/src/editor.cpp
@@ -343,6 +343,14 @@ ONNXModelEditor::ONNXModelEditor(std::istream& model_stream,
                   delete impl;
               }} {}
 
+ONNXModelEditor::ONNXModelEditor(std::shared_ptr<ModelProto> model_proto, frontend::ExtensionHolder extensions)
+    : m_model_path{""},
+      m_mmap_cache{nullptr},
+      m_extensions{std::move(extensions)},
+      m_pimpl{new ONNXModelEditor::Impl{model_proto}, [](Impl* impl) {
+                  delete impl;
+              }} {}
+
 const std::string& ONNXModelEditor::model_path() const {
     return m_model_path;
 }
diff --git a/src/frontends/onnx/frontend/src/editor.hpp b/src/frontends/onnx/frontend/src/editor.hpp
index 81d2527c88b9cf..5c7619ed87dbf2 100644
--- a/src/frontends/onnx/frontend/src/editor.hpp
+++ b/src/frontends/onnx/frontend/src/editor.hpp
@@ -16,6 +16,8 @@
 #include "openvino/op/constant.hpp"
 #include "utils/tensor_external_data.hpp"
 
+using ::ONNX_NAMESPACE::ModelProto;
+
 namespace ov {
 namespace frontend {
 namespace onnx {
@@ -54,6 +56,13 @@ class ONNXModelEditor final {
                              const bool enable_mmap = false,
                              frontend::ExtensionHolder extensions = {});
 
+    /// \brief Creates an editor from a ModelProto. The model_proto is
+    ///        stored in m_model_proto member variable.
+    ///
+    /// \param model_proto A shared pointer on ModelProto object.
+    /// \param extensions Holder for custom extensions (like custom ops).
+    ONNXModelEditor(std::shared_ptr<ModelProto> model_proto, frontend::ExtensionHolder extensions = {});
+
     /// \brief Modifies the in-memory representation of the model by setting
     ///        custom input types for all inputs specified in the provided map.
     ///
diff --git a/src/frontends/onnx/frontend/src/frontend.cpp b/src/frontends/onnx/frontend/src/frontend.cpp
index d4b83fee20db82..8afc9b661ec28d 100644
--- a/src/frontends/onnx/frontend/src/frontend.cpp
+++ b/src/frontends/onnx/frontend/src/frontend.cpp
@@ -32,6 +32,8 @@
 using namespace ov;
 using namespace ov::frontend::onnx;
 using namespace ov::frontend::onnx::common;
+using ::ONNX_NAMESPACE::ModelProto;
+using ::ONNX_NAMESPACE::Version;
 
 ONNX_FRONTEND_C_API ov::frontend::FrontEndVersion get_api_version() {
     return OV_FRONTEND_API_VERSION;
@@ -83,6 +85,17 @@ InputModel::Ptr FrontEnd::load_impl(const std::vector<ov::Any>& variants) const
 #endif
         return std::make_shared<InputModel>(*stream, enable_mmap, m_extensions);
     }
+    // !!! Experimental feature, it may be changed or removed in the future !!!
+    if (variants[0].is<uint64_t>()) {
+        void* model_proto_addr = reinterpret_cast<void*>(variants[0].as<uint64_t>());
+        FRONT_END_GENERAL_CHECK(model_proto_addr != 0, "Wrong address of a ModelProto object is passed");
+        ModelProto* model_proto_ptr = static_cast<ModelProto*>(model_proto_addr);
+        FRONT_END_GENERAL_CHECK(
+            model_proto_ptr->has_ir_version() && model_proto_ptr->ir_version() < Version::IR_VERSION,
+            "A ModelProto object contains unsupported IR version");
+        return std::make_shared<InputModel>(std::make_shared<ModelProto>(*model_proto_ptr), m_extensions);
+    }
+    // !!! End of Experimental feature
     return nullptr;
 }
 
@@ -213,7 +226,23 @@ bool FrontEnd::supported_impl(const std::vector<ov::Any>& variants) const {
         StreamRewinder rwd{*stream};
         return is_valid_model(*stream);
     }
-
+    // !!! Experimental feature, it may be changed or removed in the future !!!
+    if (variants[0].is<uint64_t>()) {
+        void* model_proto_addr = reinterpret_cast<void*>(variants[0].as<uint64_t>());
+        if (model_proto_addr == 0) {
+            return false;
+        }
+        ModelProto* model_proto_ptr = static_cast<ModelProto*>(model_proto_addr);
+        try {
+            if (!model_proto_ptr->has_ir_version() || model_proto_ptr->ir_version() > Version::IR_VERSION) {
+                return false;
+            }
+        } catch (...) {
+            return false;
+        }
+        return true;
+    }
+    // !!! End of Experimental feature
     return false;
 }
 
diff --git a/src/frontends/onnx/frontend/src/input_model.cpp b/src/frontends/onnx/frontend/src/input_model.cpp
index 108690a6d645d9..87f1439eb18b38 100644
--- a/src/frontends/onnx/frontend/src/input_model.cpp
+++ b/src/frontends/onnx/frontend/src/input_model.cpp
@@ -37,6 +37,9 @@ InputModel::InputModel(std::istream& model_stream,
     : InputModel(model_stream, ov::util::wstring_to_string(path), enable_mmap, std::move(extensions)) {}
 #endif
 
+InputModel::InputModel(std::shared_ptr<ModelProto> model_proto, frontend::ExtensionHolder extensions)
+    : m_editor{std::make_shared<ONNXModelEditor>(model_proto, std::move(extensions))} {}
+
 std::vector<ov::frontend::Place::Ptr> InputModel::get_inputs() const {
     const auto& inputs = m_editor->model_inputs();
     std::vector<ov::frontend::Place::Ptr> in_places;
diff --git a/src/frontends/onnx/frontend/src/input_model.hpp b/src/frontends/onnx/frontend/src/input_model.hpp
index 9bf44a5672fb28..246696621f1fd4 100644
--- a/src/frontends/onnx/frontend/src/input_model.hpp
+++ b/src/frontends/onnx/frontend/src/input_model.hpp
@@ -10,6 +10,8 @@
 
 #include "openvino/frontend/extension/holder.hpp"
 
+using ::ONNX_NAMESPACE::ModelProto;
+
 namespace ov {
 namespace frontend {
 namespace onnx {
@@ -33,6 +35,7 @@ class InputModel : public ov::frontend::InputModel {
                const bool enable_mmap = false,
                ExtensionHolder extensions = {});
 #endif
+    InputModel(std::shared_ptr<ModelProto> model_proto, ExtensionHolder extensions = {});
 
     std::vector<ov::frontend::Place::Ptr> get_inputs() const override;
     std::vector<ov::frontend::Place::Ptr> get_outputs() const override;
diff --git a/src/frontends/onnx/tests/CMakeLists.txt b/src/frontends/onnx/tests/CMakeLists.txt
index 599c7c43b05395..9b928773b7d65a 100644
--- a/src/frontends/onnx/tests/CMakeLists.txt
+++ b/src/frontends/onnx/tests/CMakeLists.txt
@@ -182,8 +182,11 @@ add_custom_command(TARGET ov_onnx_frontend_tests POST_BUILD
                    ${custom_commands}
                    COMMENT "Copy test manifest files to ${TEST_MODEL_ZOO}/onnx")
 
-# process models
+# Process models
 add_dependencies(ov_onnx_frontend_tests test_model_zoo)
 
+# Working with ModelProto
+ov_link_system_libraries(ov_onnx_frontend_tests PUBLIC onnx_proto onnx)
+
 add_subdirectory(standalone_build)
 add_dependencies(ov_onnx_frontend_tests onnx_fe_standalone_build_test)
diff --git a/src/frontends/onnx/tests/load_from.cpp b/src/frontends/onnx/tests/load_from.cpp
index 617f4a917567d5..547937ac52171f 100644
--- a/src/frontends/onnx/tests/load_from.cpp
+++ b/src/frontends/onnx/tests/load_from.cpp
@@ -4,6 +4,7 @@
 #include "load_from.hpp"
 
 #include <gtest/gtest.h>
+#include <onnx/onnx_pb.h>
 
 #include <fstream>
 
@@ -61,3 +62,68 @@ INSTANTIATE_TEST_SUITE_P(ONNXLoadTest,
                          FrontEndLoadFromTest,
                          ::testing::Values(getTestData()),
                          FrontEndLoadFromTest::getTestCaseName);
+
+// !!! Experimental feature, it may be changed or removed in the future !!!
+using ::ONNX_NAMESPACE::ModelProto;
+using ::ONNX_NAMESPACE::Version;
+
+TEST_P(FrontEndLoadFromTest, testLoadFromModelProtoUint64) {
+    const auto path =
+        ov::util::path_join({ov::test::utils::getExecutableDirectory(), TEST_ONNX_MODELS_DIRNAME, "abs.onnx"});
+    std::ifstream ifs(path, std::ios::in | std::ios::binary);
+    ASSERT_TRUE(ifs.is_open()) << "Could not open an ifstream for the model path: " << path;
+    std::vector<std::string> frontends;
+    FrontEnd::Ptr fe;
+
+    {
+        auto model_proto = std::make_shared<ModelProto>();
+        ASSERT_TRUE(model_proto->ParseFromIstream(&ifs)) << "Could not parse ModelProto from file: " << path;
+
+        uint64_t model_proto_ptr = reinterpret_cast<uint64_t>(model_proto.get());
+
+        ASSERT_NO_THROW(m_frontEnd = m_fem.load_by_model(model_proto_ptr))
+            << "Could not create the ONNX FE using a pointer on ModelProto object as uint64_t";
+        ASSERT_NE(m_frontEnd, nullptr);
+        ASSERT_NO_THROW(m_inputModel = m_frontEnd->load(model_proto_ptr)) << "Could not load the model";
+        ASSERT_NE(m_inputModel, nullptr);
+    }
+
+    std::shared_ptr<ov::Model> model;
+    ASSERT_NO_THROW(model = m_frontEnd->convert(m_inputModel)) << "Could not convert the model to OV representation";
+    ASSERT_NE(model, nullptr);
+
+    ASSERT_TRUE(model->get_ordered_ops().size() > 0);
+}
+
+TEST_P(FrontEndLoadFromTest, testLoadFromModelProtoUint64_Negative) {
+    const auto path =
+        ov::util::path_join({ov::test::utils::getExecutableDirectory(), TEST_ONNX_MODELS_DIRNAME, "abs.onnx"});
+    std::ifstream ifs(path, std::ios::in | std::ios::binary);
+    ASSERT_TRUE(ifs.is_open()) << "Could not open an ifstream for the model path: " << path;
+    std::vector<std::string> frontends;
+    FrontEnd::Ptr fe;
+
+    auto model_proto = std::make_shared<ModelProto>();
+    ASSERT_TRUE(model_proto->ParseFromIstream(&ifs)) << "Could not parse ModelProto from file: " << path;
+
+    uint64_t model_proto_ptr = reinterpret_cast<uint64_t>(model_proto.get());
+
+    ASSERT_NO_THROW(m_frontEnd = m_fem.load_by_model(model_proto_ptr))
+        << "Could not create the ONNX FE using a pointer on ModelProto object as uint64_t";
+    ASSERT_NE(m_frontEnd, nullptr);
+    // Should say unsupported if an address is 0
+    ASSERT_FALSE(m_frontEnd->supported(static_cast<uint64_t>(0)));
+    // Should throw an ov::Exception if address is 0
+    OV_EXPECT_THROW(m_inputModel = m_frontEnd->load(static_cast<uint64_t>(0)),
+                    ov::Exception,
+                    testing::HasSubstr("Wrong address"));
+
+    model_proto->set_ir_version(Version::IR_VERSION + 1);
+    // Should say unsupported if ModelProto has IR_VERSION higher than supported
+    ASSERT_FALSE(m_frontEnd->supported(model_proto_ptr));
+    // Should throw an ov::Exception if address is 0
+    OV_EXPECT_THROW(m_inputModel = m_frontEnd->load(model_proto_ptr),
+                    ov::Exception,
+                    testing::HasSubstr("unsupported IR version"));
+}
+// !!! End of Experimental feature !!!

From d6dc4952c9c7f1a53dd2f0f64fd2f24e0b8ceb16 Mon Sep 17 00:00:00 2001
From: Roman Kazantsev <roman.kazantsev@intel.com>
Date: Tue, 22 Oct 2024 09:13:11 +0400
Subject: [PATCH 22/24] [PT FE] Fix aten.index.Tensor for indices with None
 values (#27122)

**Details:** Support Stable Diffusion in ExportedProgram format

**Ticket:** 149983

---------

Signed-off-by: Kazantsev, Roman <roman.kazantsev@intel.com>
---
 src/frontends/pytorch/src/op/index.cpp        | 200 +-----------------
 .../src/transforms/aten_index_replacer.cpp    | 182 +---------------
 src/frontends/pytorch/src/utils.cpp           | 193 +++++++++++++++++
 src/frontends/pytorch/src/utils.hpp           |   9 +
 .../pytorch_tests/test_index_tensor.py        |  49 +++++
 .../pytorch_tests/test_upsample.py            |   3 +
 6 files changed, 272 insertions(+), 364 deletions(-)
 create mode 100644 tests/layer_tests/pytorch_tests/test_index_tensor.py

diff --git a/src/frontends/pytorch/src/op/index.cpp b/src/frontends/pytorch/src/op/index.cpp
index a1e286cad93adc..880e0acee0f983 100644
--- a/src/frontends/pytorch/src/op/index.cpp
+++ b/src/frontends/pytorch/src/op/index.cpp
@@ -26,191 +26,6 @@ namespace op {
 
 using namespace ov::op;
 
-namespace {
-Output<Node> flatten(ov::pass::NodeRegistry& rg, const Output<Node>& value, size_t axis) {
-    // First dimension of output tensor is the product of [d_0, ... d_{axis-1}] dimensions of
-    // input tensor. The last dimension is the product of the rest of input tensor dimensions:
-    // [d_{axis}, ..., d_n]
-    Output<Node> output_shape;
-    if (axis == 0) {
-        output_shape = v0::Constant::create(element::i32, Shape{2}, {1, -1});
-    } else if (axis == 1) {
-        output_shape = v0::Constant::create(element::i32, Shape{2}, {0, -1});
-    } else {
-        const auto value_shape = rg.make<v3::ShapeOf>(value, element::i32);
-        const auto value_rank = rg.make<v3::ShapeOf>(value_shape, element::i32);
-        const auto axis_node = v0::Constant::create(element::i32, Shape{1}, {axis});
-        auto start = v0::Constant::create(element::i32, Shape{1}, {0});
-        auto step = v0::Constant::create(element::i32, Shape{1}, {1});
-        const auto first_part_dims = rg.make<v8::Slice>(value_shape, start, axis_node, step);
-        auto zero = v0::Constant::create(element::i32, {}, {0});
-        auto first_part_dims_length = rg.make<v1::ReduceProd>(first_part_dims, zero, true);
-
-        auto remaining_part_length = v0::Constant::create(element::i32, {1}, {-1});
-
-        output_shape = rg.make<v0::Concat>(OutputVector{first_part_dims_length, remaining_part_length}, 0);
-    }
-    return rg.make<v1::Reshape>(value, output_shape, true);
-}
-
-OutputVector index_on_list(ov::pass::NodeRegistry& rg,
-                           const Output<Node>& data,
-                           std::deque<Output<Node>> ids,
-                           int64_t rank) {
-    // Multiple tensors as indices. Each tensor could either be
-    //   1. prim::Constant()
-    //           representing ":" in python indexing. E.g. tensor[:, :]
-    //   2. prim::Constant[value=...] or tensor output
-    //           representing advanced indexing. E.g. tensor[[0, 1], [2, 0]].
-    // For more info on advanced indexing,
-    // check https://docs.scipy.org/doc/numpy/reference/arrays.indexing.html#advanced-indexing
-
-    // Consider a general case of
-    //       t: [x_1, y_1, y_2, ..., x_m, ..., y_n]
-    // where t is a tensor of rank m+n, {x_i} are axes where tensor index is provided, and {y_i} are axes for
-    // ":". Same results can be achieved through transposing t into
-    //       t: [x_1, x_2, ..., x_m, y_1, y_2, ..., y_n]
-    // and use gather
-    //       t: [x_1 * x_2 * ... * x_m, y_1 * y_2 * ... * y_n]
-    //       tensor index = \sum_{i=1}^m (ind_i * \prod_{j=i+1}^m (x_j))
-    // After gather, reshape and transpose back.
-    std::vector<size_t> advanced_ids;
-    std::vector<bool> is_masked_bool;
-    OutputVector masked_indicies;
-    // for case when index is bool e.g. x[x>0], replace index with non_zero
-    for (size_t i = 0; i < ids.size(); i++) {
-        // skip dimensions where index is None
-        bool is_none = false;
-        if (!ids[i].get_node_shared_ptr()) {
-            is_none = true;
-        }
-        if (auto const_input = cast_fw_node(ids[i].get_node_shared_ptr(), "prim::Constant")) {
-            const auto& attrs = const_input->get_attrs();
-            if (attrs.find("none_value") != attrs.end()) {
-                is_none = true;
-            }
-        }
-        if (is_none) {
-            masked_indicies.push_back(ids[i]);
-            is_masked_bool.push_back(false);
-            continue;
-        }
-        auto id_dtype = ids[i].get_element_type();
-        if (id_dtype == element::boolean || id_dtype == element::u8) {
-            auto idx = rg.make<v0::Convert>(ids[i], element::u8);
-            auto nonzero = rg.make<v3::NonZero>(idx, element::i32);
-            auto input_order = v0::Constant::create(element::i32, Shape{2}, {1, 0});
-            auto masked_id = rg.make<v1::Transpose>(nonzero, input_order);
-            masked_indicies.push_back(masked_id);
-            is_masked_bool.push_back(true);
-        } else {
-            masked_indicies.push_back(ids[i]);
-            is_masked_bool.push_back(false);
-        }
-        advanced_ids.push_back(i);
-    }
-
-    // all indicies prim::Constant(None), return input as is
-    if (advanced_ids.size() == 0) {
-        return {data};
-    }
-    // perform gather for single element case
-    if (advanced_ids.size() == 1) {
-        auto index = masked_indicies[advanced_ids[0]];
-        if (is_masked_bool[advanced_ids[0]]) {
-            auto gather = rg.make<v8::GatherND>(data, index);
-            return {gather};
-        }
-        index = rg.make<v0::Convert>(index, element::i32);
-        auto dim = v0::Constant::create(element::i32, Shape{}, {advanced_ids[0]});
-        auto gather = rg.make<v8::Gather>(data, index, dim);
-        return {gather};
-    }
-    auto adv_idx_count = advanced_ids.size();
-    auto input_shape = rg.make<v3::ShapeOf>(data, element::i32);
-    auto zero = v0::Constant::create(element::i32, Shape{}, {0});
-    auto input_dims = rg.make<v1::Split>(input_shape, zero, rank);
-    std::vector<size_t> non_used_dims;
-    for (auto i = 0; i < rank; i++) {
-        if (std::find(advanced_ids.begin(), advanced_ids.end(), i) == advanced_ids.end()) {
-            non_used_dims.push_back(i);
-        }
-    }
-    std::vector<size_t> permutation_dims;
-    permutation_dims.insert(permutation_dims.end(), advanced_ids.begin(), advanced_ids.end());
-    permutation_dims.insert(permutation_dims.end(), non_used_dims.begin(), non_used_dims.end());
-    auto transpose_dims = v0::Constant::create(element::i32, Shape{permutation_dims.size()}, permutation_dims);
-    auto transposed_input = rg.make<v1::Transpose>(data, transpose_dims);
-    auto flatten_input = flatten(rg, transposed_input, adv_idx_count);
-    auto cum_adv_index = masked_indicies[advanced_ids.back()];
-    cum_adv_index = rg.make<v0::Convert>(cum_adv_index, element::i32);
-    auto multiplier = input_dims->output(advanced_ids.back());
-    for (int i = static_cast<int>(adv_idx_count) - 2; i > -1; i--) {
-        auto input_id = advanced_ids[i];
-        auto m_idx = rg.make<v0::Convert>(masked_indicies[input_id], element::i32);
-        auto adv_index = rg.make<v1::Multiply>(m_idx, multiplier);
-        cum_adv_index = rg.make<v1::Add>(cum_adv_index, adv_index);
-        multiplier = rg.make<v1::Multiply>(multiplier, input_dims->output(input_id));
-    }
-    std::shared_ptr<Node> gather = rg.make<v8::Gather>(flatten_input, cum_adv_index, zero);
-    OutputVector concat_dims;
-    // check if all advanced indices are consecutive.
-    std::vector<size_t> consequence_dims;
-    auto cum_adv_index_shape_tensor = rg.make<v3::ShapeOf>(cum_adv_index, element::i32);
-    for (size_t i = advanced_ids[0]; i <= advanced_ids[advanced_ids.back()]; i++) {
-        consequence_dims.push_back(i);
-    }
-    // unfold regular index axes
-    if (advanced_ids == consequence_dims) {
-        OutputVector folded_adv_idx_shape_vector;
-        auto minus_one = v0::Constant::create(element::i32, Shape{1}, {-1});
-        folded_adv_idx_shape_vector.push_back(minus_one);
-        for (auto i : non_used_dims) {
-            folded_adv_idx_shape_vector.push_back(input_dims->output(i));
-        }
-        auto folded_adv_idx_shape = rg.make<v0::Concat>(folded_adv_idx_shape_vector, 0);
-        gather = rg.make<v1::Reshape>(gather, folded_adv_idx_shape, false);
-        std::vector<size_t> adv_idx_permute;
-        for (size_t i = 1; i < advanced_ids[0] + 1; i++) {
-            adv_idx_permute.push_back(i);
-        }
-        adv_idx_permute.push_back(0);
-        for (size_t i = advanced_ids[0] + 1; i < (rank - adv_idx_count + 1); i++) {
-            adv_idx_permute.push_back(i);
-        }
-        // Transpose folded advanced indexed axis to its original location.
-        auto permute_indicies = v0::Constant::create(element::i32, Shape{adv_idx_permute.size()}, adv_idx_permute);
-        gather = rg.make<v1::Transpose>(gather, permute_indicies);
-        // unfold advanced index axes
-        for (size_t i = 0; i < advanced_ids[0]; i++) {
-            concat_dims.push_back(input_dims->output(i));
-        }
-        concat_dims.push_back(cum_adv_index_shape_tensor);
-        for (auto i : non_used_dims) {
-            if (i < advanced_ids[0]) {
-                continue;
-            }
-            concat_dims.push_back(input_dims->output(i));
-        }
-
-    } else {
-        size_t i = 0;
-        auto one = v0::Constant::create(element::i32, Shape{1}, {1});
-        while (i < non_used_dims.size() && non_used_dims[i] < advanced_ids[0]) {
-            concat_dims.push_back(one);
-            i++;
-        }
-        concat_dims.push_back(cum_adv_index_shape_tensor);
-        for (; i < non_used_dims.size(); i++) {
-            concat_dims.push_back(input_dims->output(non_used_dims[i]));
-        }
-    }
-    auto final_shape = rg.make<v0::Concat>(concat_dims, 0);
-    gather = rg.make<v1::Reshape>(gather, final_shape, false);
-    return {gather};
-}
-}  // namespace
-
 OutputVector translate_index(const NodeContext& context) {
     num_inputs_check(context, 2, 2);
     auto x = context.get_input(0);
@@ -225,9 +40,12 @@ OutputVector translate_index(const NodeContext& context) {
         auto rank = x.get_partial_shape().rank();
         // index transformation supports only tensors with static rank
         PYTORCH_OP_CONVERSION_CHECK(rank.is_static(), "Dynamic rank for aten::index input is not supported.");
-        auto res = index_on_list(rg, x, list_elems, rank.get_length());
+        OutputVector ids{list_elems.begin(), list_elems.end()};
+        ov::Output<ov::Node> res;
+        bool use_input_as_output = true;
+        index_tensor_on_list(rg, x, ids, rank.get_length(), res, use_input_as_output);
         context.mark_nodes(rg.get());
-        return res;
+        return {res};
     }
     auto index_ov_type = indices.get_element_type();
     if (index_ov_type.is_dynamic()) {
@@ -267,9 +85,13 @@ OutputVector translate_index_fx(const NodeContext& context) {
     }
     // index transformation supports only tensors with static rank
     PYTORCH_OP_CONVERSION_CHECK(rank.is_static(), "Dynamic rank for aten::index input is not supported.");
-    auto res = index_on_list(rg, x, list_elems, rank.get_length());
+
+    OutputVector ids{list_elems.begin(), list_elems.end()};
+    ov::Output<ov::Node> res;
+    bool use_input_as_output = true;
+    index_tensor_on_list(rg, x, ids, rank, res, use_input_as_output);
     context.mark_nodes(rg.get());
-    return res;
+    return {res};
 };
 
 }  // namespace op
diff --git a/src/frontends/pytorch/src/transforms/aten_index_replacer.cpp b/src/frontends/pytorch/src/transforms/aten_index_replacer.cpp
index 39a9bc710ca08d..9294409a565691 100644
--- a/src/frontends/pytorch/src/transforms/aten_index_replacer.cpp
+++ b/src/frontends/pytorch/src/transforms/aten_index_replacer.cpp
@@ -34,34 +34,6 @@ namespace pass {
 
 using namespace ov::op;
 
-namespace {
-Output<Node> flatten(ov::pass::NodeRegistry& rg, const Output<Node>& value, size_t axis) {
-    // First dimension of output tensor is the product of [d_0, ... d_{axis-1}] dimensions of
-    // input tensor. The last dimension is the product of the rest of input tensor dimensions:
-    // [d_{axis}, ..., d_n]
-    Output<Node> output_shape;
-    if (axis == 0) {
-        output_shape = v0::Constant::create(element::i32, Shape{2}, {1, -1});
-    } else if (axis == 1) {
-        output_shape = v0::Constant::create(element::i32, Shape{2}, {0, -1});
-    } else {
-        const auto value_shape = rg.make<v3::ShapeOf>(value, element::i32);
-        const auto value_rank = rg.make<v3::ShapeOf>(value_shape, element::i32);
-        const auto axis_node = v0::Constant::create(element::i32, Shape{1}, {axis});
-        auto start = v0::Constant::create(element::i32, Shape{1}, {0});
-        auto step = v0::Constant::create(element::i32, Shape{1}, {1});
-        const auto first_part_dims = rg.make<v8::Slice>(value_shape, start, axis_node, step);
-        auto zero = v0::Constant::create(element::i32, {}, {0});
-        auto first_part_dims_length = rg.make<v1::ReduceProd>(first_part_dims, zero, true);
-
-        auto remaining_part_length = v0::Constant::create(element::i32, {1}, {-1});
-
-        output_shape = rg.make<v0::Concat>(OutputVector{first_part_dims_length, remaining_part_length}, 0);
-    }
-    return rg.make<v1::Reshape>(value, output_shape, true);
-}
-};  // namespace
-
 AtenIndexToSelect::AtenIndexToSelect() {
     auto index_op = ov::pass::pattern::wrap_type<ov::op::util::FrameworkNode>();
 
@@ -75,162 +47,22 @@ AtenIndexToSelect::AtenIndexToSelect() {
         auto indicies = index_op->input_value(1).get_node_shared_ptr();
         auto list_indicies = cast_fw_node(indicies, "prim::ListConstruct");
         if (list_indicies) {
-            // Multiple tensors as indices. Each tensor could either be
-            //   1. prim::Constant()
-            //           representing ":" in python indexing. E.g. tensor[:, :]
-            //   2. prim::Constant[value=...] or tensor output
-            //           representing advanced indexing. E.g. tensor[[0, 1], [2, 0]].
-            // For more info on advanced indexing,
-            // check https://docs.scipy.org/doc/numpy/reference/arrays.indexing.html#advanced-indexing
-
-            // Consider a general case of
-            //       t: [x_1, y_1, y_2, ..., x_m, ..., y_n]
-            // where t is a tensor of rank m+n, {x_i} are axes where tensor index is provided, and {y_i} are axes for
-            // ":". Same results can be achieved through transposing t into
-            //       t: [x_1, x_2, ..., x_m, y_1, y_2, ..., y_n]
-            // and use gather
-            //       t: [x_1 * x_2 * ... * x_m, y_1 * y_2 * ... * y_n]
-            //       tensor index = \sum_{i=1}^m (ind_i * \prod_{j=i+1}^m (x_j))
-            // After gather, reshape and transpose back.
             auto ids = list_indicies->input_values();
-            std::vector<size_t> advanced_ids;
-            std::vector<bool> is_masked_bool;
-            OutputVector masked_indicies;
-            // for case when index is bool e.g. x[x>0], replace index with non_zero
-            for (size_t i = 0; i < ids.size(); i++) {
-                auto const_input = cast_fw_node(ids[i].get_node_shared_ptr(), "prim::Constant");
-
-                // skip dimensions where index is None
-                if (const_input) {
-                    const auto& attrs = const_input->get_attrs();
-                    if (attrs.find("none_value") != attrs.end()) {
-                        masked_indicies.push_back(ids[i]);
-                        is_masked_bool.push_back(false);
-                        continue;
-                    }
-                }
-                auto id_dtype = ids[i].get_element_type();
-                if (id_dtype == element::boolean || id_dtype == element::u8) {
-                    auto idx = rg.make<v0::Convert>(ids[i], element::u8);
-                    auto nonzero = rg.make<v3::NonZero>(idx, element::i32);
-                    auto input_order = v0::Constant::create(element::i32, Shape{2}, {1, 0});
-                    auto masked_id = rg.make<v1::Transpose>(nonzero, input_order);
-                    masked_indicies.push_back(masked_id);
-                    is_masked_bool.push_back(true);
-                } else {
-                    masked_indicies.push_back(ids[i]);
-                    is_masked_bool.push_back(false);
-                }
-                advanced_ids.push_back(i);
-            }
-
-            // all indicies prim::Constant(None), return input as is
-            if (advanced_ids.size() == 0) {
-                index_op->output(0).replace(index_op->get_input_source_output(0));
-                return true;
-            }
-            // perform gather for single element case
-            if (advanced_ids.size() == 1) {
-                auto index = masked_indicies[advanced_ids[0]];
-                if (is_masked_bool[advanced_ids[0]]) {
-                    auto gather = rg.make<v8::GatherND>(input_node, index);
-                    copy_runtime_info_and_name(index_op, rg.get());
-                    replace_node(index_op, gather);
-                    return true;
-                }
-                index = rg.make<v0::Convert>(index, element::i32);
-                auto dim = v0::Constant::create(element::i32, Shape{}, {advanced_ids[0]});
-                auto gather = rg.make<v8::Gather>(input_node, index, dim);
-                copy_runtime_info_and_name(index_op, rg.get());
-                replace_node(index_op, gather);
-                return true;
-            }
-            auto adv_idx_count = advanced_ids.size();
             auto rank = input_node.get_partial_shape().rank();
             // index transformation supports only tensors with static rank
-            if (rank.is_dynamic()) {
+            ov::Output<ov::Node> new_output;
+            bool use_input_as_output = true;
+            if (!index_tensor_on_list(rg, input_node, ids, rank, new_output, use_input_as_output)) {
                 add_exception_to_fw_node(index_op, "aten::index: dynamic rank for aten::index input is not supported.");
                 return false;
             }
-            auto input_shape = rg.make<v3::ShapeOf>(input_node, element::i32);
-            auto zero = v0::Constant::create(element::i32, Shape{}, {0});
-            auto input_dims = rg.make<v1::Split>(input_shape, zero, rank.get_length());
-            std::vector<size_t> non_used_dims;
-            for (auto i = 0; i < rank.get_length(); i++) {
-                if (std::find(advanced_ids.begin(), advanced_ids.end(), i) == advanced_ids.end()) {
-                    non_used_dims.push_back(i);
-                }
-            }
-            std::vector<size_t> permutation_dims;
-            permutation_dims.insert(permutation_dims.end(), advanced_ids.begin(), advanced_ids.end());
-            permutation_dims.insert(permutation_dims.end(), non_used_dims.begin(), non_used_dims.end());
-            auto transpose_dims = v0::Constant::create(element::i32, Shape{permutation_dims.size()}, permutation_dims);
-            auto transposed_input = rg.make<v1::Transpose>(input_node, transpose_dims);
-            auto flatten_input = flatten(rg, transposed_input, adv_idx_count);
-            auto cum_adv_index = masked_indicies[advanced_ids[adv_idx_count - 1]];
-            cum_adv_index = rg.make<v0::Convert>(cum_adv_index, element::i32);
-            auto multiplier = input_dims->output(advanced_ids[adv_idx_count - 1]);
-            for (int i = static_cast<int>(adv_idx_count) - 2; i > -1; i--) {
-                auto input_id = advanced_ids[i];
-                auto m_idx = rg.make<v0::Convert>(masked_indicies[input_id], element::i32);
-                auto adv_index = rg.make<v1::Multiply>(m_idx, multiplier);
-                cum_adv_index = rg.make<v1::Add>(cum_adv_index, adv_index);
-                multiplier = rg.make<v1::Multiply>(multiplier, input_dims->output(input_id));
-            }
-            std::shared_ptr<Node> gather = rg.make<v8::Gather>(flatten_input, cum_adv_index, zero);
-            OutputVector concat_dims;
-            // check if all advanced indices are consecutive.
-            std::vector<size_t> consequence_dims;
-            auto cum_adv_index_shape_tensor = rg.make<v3::ShapeOf>(cum_adv_index, element::i32);
-            for (size_t i = advanced_ids[0]; i <= advanced_ids[advanced_ids.size() - 1]; i++) {
-                consequence_dims.push_back(i);
-            }
-            // unfold regular index axes
-            if (advanced_ids == consequence_dims) {
-                OutputVector folded_adv_idx_shape_vector;
-                auto minus_one = v0::Constant::create(element::i32, Shape{1}, {-1});
-                folded_adv_idx_shape_vector.push_back(minus_one);
-                for (auto i : non_used_dims) {
-                    folded_adv_idx_shape_vector.push_back(input_dims->output(i));
-                }
-                auto folded_adv_idx_shape = rg.make<v0::Concat>(folded_adv_idx_shape_vector, 0);
-                gather = rg.make<v1::Reshape>(gather, folded_adv_idx_shape, false);
-                std::vector<size_t> adv_idx_permute;
-                for (size_t i = 1; i < advanced_ids[0] + 1; i++) {
-                    adv_idx_permute.push_back(i);
-                }
-                adv_idx_permute.push_back(0);
-                for (size_t i = advanced_ids[0] + 1; i < (rank.get_length() - adv_idx_count + 1); i++) {
-                    adv_idx_permute.push_back(i);
-                }
-                // Transpose folded advanced indexed axis to its original location.
-                auto permute_indicies =
-                    v0::Constant::create(element::i32, Shape{adv_idx_permute.size()}, adv_idx_permute);
-                gather = rg.make<v1::Transpose>(gather, permute_indicies);
-                // unfold advanced index axes
-                for (size_t i = 0; i < advanced_ids[0]; i++) {
-                    concat_dims.push_back(input_dims->output(i));
-                }
-                concat_dims.push_back(cum_adv_index_shape_tensor);
-                for (auto i : non_used_dims) {
-                    if (i < advanced_ids[0]) {
-                        continue;
-                    }
-                    concat_dims.push_back(input_dims->output(i));
-                }
-
-            } else {
-                concat_dims.push_back(cum_adv_index_shape_tensor);
-                for (auto i : non_used_dims) {
-                    concat_dims.push_back(input_dims->output(i));
-                }
+            if (use_input_as_output) {
+                index_op->output(0).replace(index_op->get_input_source_output(0));
+                return true;
             }
-            auto final_shape = rg.make<v0::Concat>(concat_dims, 0);
-            gather = rg.make<v1::Reshape>(gather, final_shape, false);
             copy_runtime_info_and_name(index_op, rg.get());
-            replace_node(index_op, gather);
+            replace_node(index_op, new_output.get_node_shared_ptr());
             return true;
-
         } else {
             auto const_input = cast_fw_node(indicies, "prim::Constant");
 
diff --git a/src/frontends/pytorch/src/utils.cpp b/src/frontends/pytorch/src/utils.cpp
index 852de6e90fa25b..752b9accb71d01 100644
--- a/src/frontends/pytorch/src/utils.cpp
+++ b/src/frontends/pytorch/src/utils.cpp
@@ -17,6 +17,7 @@
 #include "openvino/op/gather.hpp"
 #include "openvino/op/gather_nd.hpp"
 #include "openvino/op/mod.hpp"
+#include "openvino/op/multiply.hpp"
 #include "openvino/op/non_zero.hpp"
 #include "openvino/op/range.hpp"
 #include "openvino/op/reduce_prod.hpp"
@@ -24,6 +25,7 @@
 #include "openvino/op/select.hpp"
 #include "openvino/op/shape_of.hpp"
 #include "openvino/op/slice.hpp"
+#include "openvino/op/split.hpp"
 #include "openvino/op/squeeze.hpp"
 #include "openvino/op/subtract.hpp"
 #include "openvino/op/transpose.hpp"
@@ -664,6 +666,197 @@ Output<Node> masked_select(const NodeContext& context, const Output<Node>& data,
     return context.mark_node(std::make_shared<v8::GatherND>(data, masked_id));
 }
 
+Output<Node> flatten(ov::pass::NodeRegistry& rg, const Output<Node>& value, size_t axis) {
+    // First dimension of output tensor is the product of [d_0, ... d_{axis-1}] dimensions of
+    // input tensor. The last dimension is the product of the rest of input tensor dimensions:
+    // [d_{axis}, ..., d_n]
+    Output<Node> output_shape;
+    if (axis == 0) {
+        output_shape = v0::Constant::create(element::i32, Shape{2}, {1, -1});
+    } else if (axis == 1) {
+        output_shape = v0::Constant::create(element::i32, Shape{2}, {0, -1});
+    } else {
+        const auto value_shape = rg.make<v3::ShapeOf>(value, element::i32);
+        const auto value_rank = rg.make<v3::ShapeOf>(value_shape, element::i32);
+        const auto axis_node = v0::Constant::create(element::i32, Shape{1}, {axis});
+        auto start = v0::Constant::create(element::i32, Shape{1}, {0});
+        auto step = v0::Constant::create(element::i32, Shape{1}, {1});
+        const auto first_part_dims = rg.make<v8::Slice>(value_shape, start, axis_node, step);
+        auto zero = v0::Constant::create(element::i32, {}, {0});
+        auto first_part_dims_length = rg.make<v1::ReduceProd>(first_part_dims, zero, true);
+
+        auto remaining_part_length = v0::Constant::create(element::i32, {1}, {-1});
+
+        output_shape = rg.make<v0::Concat>(OutputVector{first_part_dims_length, remaining_part_length}, 0);
+    }
+    return rg.make<v1::Reshape>(value, output_shape, true);
+}
+
+bool index_tensor_on_list(ov::pass::NodeRegistry& rg,
+                          const Output<Node>& data,
+                          const ov::OutputVector& indices,
+                          const ov::Rank& rank,
+                          Output<Node>& new_output,
+                          bool& use_input_as_output) {
+    // Multiple tensors as indices. Each tensor could either be
+    //   1. prim::Constant()
+    //           representing ":" in python indexing. E.g. tensor[:, :]
+    //   2. prim::Constant[value=...] or tensor output
+    //           representing advanced indexing. E.g. tensor[[0, 1], [2, 0]].
+    // For more info on advanced indexing,
+    // check https://docs.scipy.org/doc/numpy/reference/arrays.indexing.html#advanced-indexing
+
+    // Consider a general case of
+    //       t: [x_1, y_1, y_2, ..., x_m, ..., y_n]
+    // where t is a tensor of rank m+n, {x_i} are axes where tensor index is provided, and {y_i} are axes for
+    // ":". Same results can be achieved through transposing t into
+    //       t: [x_1, x_2, ..., x_m, y_1, y_2, ..., y_n]
+    // and use gather
+    //       t: [x_1 * x_2 * ... * x_m, y_1 * y_2 * ... * y_n]
+    //       tensor index = \sum_{i=1}^m (ind_i * \prod_{j=i+1}^m (x_j))
+    // After gather, reshape and transpose back.
+    std::vector<size_t> advanced_ids;
+    std::vector<bool> is_masked_bool;
+    OutputVector masked_indicies;
+    // for case when index is bool e.g. x[x>0], replace index with non_zero
+    for (size_t i = 0; i < indices.size(); ++i) {
+        // skip dimensions where index is None
+        bool is_none = false;
+        if (!indices[i].get_node_shared_ptr()) {
+            is_none = true;
+        }
+        if (auto const_input = cast_fw_node(indices[i].get_node_shared_ptr(), "prim::Constant")) {
+            const auto& attrs = const_input->get_attrs();
+            if (attrs.find("none_value") != attrs.end()) {
+                is_none = true;
+            }
+        }
+        if (is_none) {
+            masked_indicies.push_back(indices[i]);
+            is_masked_bool.push_back(false);
+            continue;
+        }
+        auto id_dtype = indices[i].get_element_type();
+        if (id_dtype == element::boolean || id_dtype == element::u8) {
+            auto idx = rg.make<v0::Convert>(indices[i], element::u8);
+            auto nonzero = rg.make<v3::NonZero>(idx, element::i32);
+            auto input_order = rg.make<v0::Constant>(element::i32, Shape{2}, std::vector<int32_t>{1, 0});
+            auto masked_id = rg.make<v1::Transpose>(nonzero, input_order);
+            masked_indicies.push_back(masked_id);
+            is_masked_bool.push_back(true);
+        } else {
+            masked_indicies.push_back(indices[i]);
+            is_masked_bool.push_back(false);
+        }
+        advanced_ids.push_back(i);
+    }
+
+    // all indicies prim::Constant(None), return input as is
+    if (advanced_ids.size() == 0) {
+        new_output = data;
+        use_input_as_output = true;
+        return true;
+    }
+    // perform gather for single element case
+    if (advanced_ids.size() == 1) {
+        auto index = masked_indicies[advanced_ids[0]];
+        if (is_masked_bool[advanced_ids[0]]) {
+            auto gather = rg.make<v8::GatherND>(data, index);
+            new_output = gather->output(0);
+            use_input_as_output = false;
+            return true;
+        }
+        index = rg.make<v0::Convert>(index, element::i32);
+        auto dim = rg.make<v0::Constant>(element::i32, Shape{}, static_cast<int32_t>(advanced_ids[0]));
+        auto gather = rg.make<v8::Gather>(data, index, dim);
+        new_output = gather->output(0);
+        use_input_as_output = false;
+        return true;
+    }
+    // index transformation supports only tensors with static rank
+    if (rank.is_dynamic()) {
+        return false;
+    }
+    auto adv_idx_count = advanced_ids.size();
+    auto input_shape = rg.make<v3::ShapeOf>(data, element::i32);
+    auto zero = rg.make<v0::Constant>(element::i32, Shape{}, 0);
+    auto input_dims = rg.make<v1::Split>(input_shape, zero, rank.get_length());
+    std::vector<size_t> non_used_dims;
+    for (auto i = 0; i < rank.get_length(); i++) {
+        if (std::find(advanced_ids.begin(), advanced_ids.end(), i) == advanced_ids.end()) {
+            non_used_dims.push_back(i);
+        }
+    }
+    std::vector<size_t> permutation_dims;
+    permutation_dims.insert(permutation_dims.end(), advanced_ids.begin(), advanced_ids.end());
+    permutation_dims.insert(permutation_dims.end(), non_used_dims.begin(), non_used_dims.end());
+    auto transpose_dims = rg.make<v0::Constant>(element::i32, Shape{permutation_dims.size()}, permutation_dims);
+    auto transposed_input = rg.make<v1::Transpose>(data, transpose_dims);
+    auto flatten_input = flatten(rg, transposed_input, adv_idx_count);
+    auto cum_adv_index = masked_indicies[advanced_ids[adv_idx_count - 1]];
+    cum_adv_index = rg.make<v0::Convert>(cum_adv_index, element::i32);
+    auto multiplier = input_dims->output(advanced_ids[adv_idx_count - 1]);
+    for (int i = static_cast<int>(adv_idx_count) - 2; i > -1; i--) {
+        auto input_id = advanced_ids[i];
+        auto m_idx = rg.make<v0::Convert>(masked_indicies[input_id], element::i32);
+        auto adv_index = rg.make<v1::Multiply>(m_idx, multiplier);
+        cum_adv_index = rg.make<v1::Add>(cum_adv_index, adv_index);
+        multiplier = rg.make<v1::Multiply>(multiplier, input_dims->output(input_id));
+    }
+    std::shared_ptr<Node> gather = rg.make<v8::Gather>(flatten_input, cum_adv_index, zero);
+    OutputVector concat_dims;
+    // check if all advanced indices are consecutive.
+    std::vector<size_t> consequence_dims;
+    auto cum_adv_index_shape_tensor = rg.make<v3::ShapeOf>(cum_adv_index, element::i32);
+    for (size_t i = advanced_ids[0]; i <= advanced_ids[advanced_ids.size() - 1]; i++) {
+        consequence_dims.push_back(i);
+    }
+    // unfold regular index axes
+    if (advanced_ids == consequence_dims) {
+        OutputVector folded_adv_idx_shape_vector;
+        auto minus_one = rg.make<v0::Constant>(element::i32, Shape{1}, -1);
+        folded_adv_idx_shape_vector.push_back(minus_one);
+        for (auto i : non_used_dims) {
+            folded_adv_idx_shape_vector.push_back(input_dims->output(i));
+        }
+        auto folded_adv_idx_shape = rg.make<v0::Concat>(folded_adv_idx_shape_vector, 0);
+        gather = rg.make<v1::Reshape>(gather, folded_adv_idx_shape, false);
+        std::vector<size_t> adv_idx_permute;
+        for (size_t i = 1; i < advanced_ids[0] + 1; i++) {
+            adv_idx_permute.push_back(i);
+        }
+        adv_idx_permute.push_back(0);
+        for (size_t i = advanced_ids[0] + 1; i < (rank.get_length() - adv_idx_count + 1); i++) {
+            adv_idx_permute.push_back(i);
+        }
+        // Transpose folded advanced indexed axis to its original location.
+        auto permute_indicies = rg.make<v0::Constant>(element::i32, Shape{adv_idx_permute.size()}, adv_idx_permute);
+        gather = rg.make<v1::Transpose>(gather, permute_indicies);
+        // unfold advanced index axes
+        for (size_t i = 0; i < advanced_ids[0]; i++) {
+            concat_dims.push_back(input_dims->output(i));
+        }
+        concat_dims.push_back(cum_adv_index_shape_tensor);
+        for (auto i : non_used_dims) {
+            if (i < advanced_ids[0]) {
+                continue;
+            }
+            concat_dims.push_back(input_dims->output(i));
+        }
+
+    } else {
+        concat_dims.push_back(cum_adv_index_shape_tensor);
+        for (auto i : non_used_dims) {
+            concat_dims.push_back(input_dims->output(i));
+        }
+    }
+    auto final_shape = rg.make<v0::Concat>(concat_dims, 0);
+    gather = rg.make<v1::Reshape>(gather, final_shape, false);
+    new_output = gather->output(0);
+    use_input_as_output = false;
+    return true;
+}
+
 }  // namespace pytorch
 }  // namespace frontend
 }  // namespace ov
diff --git a/src/frontends/pytorch/src/utils.hpp b/src/frontends/pytorch/src/utils.hpp
index f4104a83ae3252..9346b9e18b94a3 100644
--- a/src/frontends/pytorch/src/utils.hpp
+++ b/src/frontends/pytorch/src/utils.hpp
@@ -129,6 +129,15 @@ Output<Node> concat_list_from_inputs(const NodeContext& context, size_t begin, s
 
 Output<Node> masked_select(const NodeContext& context, const Output<Node>& data, const Output<Node>& mask);
 
+Output<Node> flatten(ov::pass::NodeRegistry& rg, const Output<Node>& value, size_t axis);
+
+bool index_tensor_on_list(ov::pass::NodeRegistry& rg,
+                          const Output<Node>& data,
+                          const ov::OutputVector& indices,
+                          const ov::Rank& rank,
+                          Output<Node>& new_output,
+                          bool& use_input_as_output);
+
 namespace op {
 template <OutputVector (*T)(const NodeContext&), size_t idx = 0>
 OutputVector inplace_op(const NodeContext& context) {
diff --git a/tests/layer_tests/pytorch_tests/test_index_tensor.py b/tests/layer_tests/pytorch_tests/test_index_tensor.py
new file mode 100644
index 00000000000000..d2055b5f5a4ec5
--- /dev/null
+++ b/tests/layer_tests/pytorch_tests/test_index_tensor.py
@@ -0,0 +1,49 @@
+# Copyright (C) 2018-2024 Intel Corporation
+# SPDX-License-Identifier: Apache-2.0
+
+import pytest
+
+from pytorch_layer_test_class import PytorchLayerTest
+
+
+class TestIndexTensor(PytorchLayerTest):
+    def _prepare_input(self, input_shape):
+        import numpy as np
+        return (np.random.randn(*input_shape).astype(np.float32),)
+
+    def create_model(self, indices_list):
+        import torch
+
+        class aten_index_tensor(torch.nn.Module):
+            def __init__(self, indices_list):
+                super(aten_index_tensor, self).__init__()
+                self.indices_list = indices_list
+
+            def forward(self, x):
+                return torch.ops.aten.index.Tensor(x, self.indices_list)
+
+        ref_net = None
+
+        adjusted_indices_list = []
+        for indices in indices_list:
+            if indices is not None:
+                adjusted_indices_list.append(torch.tensor(indices, dtype=torch.int32))
+                continue
+            adjusted_indices_list.append(None)
+
+        return aten_index_tensor(adjusted_indices_list), ref_net, None
+
+    @pytest.mark.nightly
+    @pytest.mark.precommit_torch_export
+    @pytest.mark.parametrize(('input_shape', 'indices_list'), [
+        ([3, 7], [[0], [5, 3, 0]]),
+        ([3, 7, 6], [[0], None, None]),
+        ([3, 7, 6], [[0], None, [5, 0, 3]]),
+        ([3, 7, 6], [[0, 2, 1], None, [5, 0, 3]]),
+        ([3, 7, 6], [[0, 2, 1], [4], [5, 0, 3]]),
+    ])
+    def test_index_tensor(self, input_shape, indices_list, ie_device, precision, ir_version):
+        if not PytorchLayerTest.use_torch_export():
+            pytest.skip(reason='aten.index.Tensor test is supported only on torch.export()')
+        self._test(*self.create_model(indices_list), ie_device, precision, ir_version,
+                   kwargs_to_prepare_input={'input_shape': input_shape})
diff --git a/tests/layer_tests/pytorch_tests/test_upsample.py b/tests/layer_tests/pytorch_tests/test_upsample.py
index 34ffb9880c7f62..aa5cec1080f7d0 100644
--- a/tests/layer_tests/pytorch_tests/test_upsample.py
+++ b/tests/layer_tests/pytorch_tests/test_upsample.py
@@ -43,6 +43,7 @@ def forward(self, x):
     ])
     @pytest.mark.nightly
     @pytest.mark.precommit
+    @pytest.mark.precommit_torch_export
     @pytest.mark.skipif(platform == 'darwin', reason="Ticket - 122182")
     def test_upsample1d(self, mode, size, scale, ie_device, precision, ir_version):
         if ie_device == "GPU" and mode == "linear":
@@ -96,6 +97,7 @@ def forward(self, x):
     ])
     @pytest.mark.nightly
     @pytest.mark.precommit
+    @pytest.mark.precommit_torch_export
     def test_upsample2d(self, mode, size, scale, ie_device, precision, ir_version):
         self._test(*self.create_model(size, scale, mode), ie_device,
                    precision, ir_version, trace_model=True, **{"custom_eps": 1e-3})
@@ -213,6 +215,7 @@ def forward(self, x):
     @pytest.mark.parametrize("mode", ['nearest', 'bilinear', 'bicubic'])
     @pytest.mark.nightly
     @pytest.mark.precommit
+    @pytest.mark.precommit_torch_export
     def test_upsample2d_list_sizes(self, mode, ie_device, precision, ir_version):
         self._test(*self.create_model(mode), ie_device,
                    precision, ir_version, trace_model=True)

From e3ad821bcca52b2ff86550d9354655f405f1e401 Mon Sep 17 00:00:00 2001
From: Alexandra Sidorova <alexandra.sidorova@intel.com>
Date: Tue, 22 Oct 2024 10:17:52 +0400
Subject: [PATCH 23/24] [CPU] Implemented "jit_exp_emitter" (#26974)

### Details:
- *Previously, we used dnnl-injector for `Exp` op which require 2
`aux_vec_regs`. The snippets kernel have some pool of aux vec registers
which can be used by emitters in their implementations. However, dnnl
cannot work with user-provided aux registers and always spill them on
stack while plugin emitters can do it. To avoid extra push-pop in
Snippets kernel (it leads to performance degradations), we implemented
own emitter for `Exp` with the same logic to have opportunity to pass
free aux vec registers*
- *Updated `jit_erf_emitter`: reused new `jit_exp_emitter` to compute
exponent and now we work only with `vmm_dst` to avoid `vmm_src` data
corruption (input registers must not be corrupted)*

### Tickets:
 - *155236*
---
 .../plugin/x64/jit_dnnl_ext_emitters.hpp      |  13 -
 .../plugin/x64/jit_eltwise_emitters.cpp       | 252 ++++++++++--------
 .../plugin/x64/jit_eltwise_emitters.hpp       |  27 ++
 src/plugins/intel_cpu/src/nodes/eltwise.cpp   |   7 +-
 4 files changed, 175 insertions(+), 124 deletions(-)

diff --git a/src/plugins/intel_cpu/src/emitters/plugin/x64/jit_dnnl_ext_emitters.hpp b/src/plugins/intel_cpu/src/emitters/plugin/x64/jit_dnnl_ext_emitters.hpp
index 835605756f9014..7a4d1e31277e3b 100644
--- a/src/plugins/intel_cpu/src/emitters/plugin/x64/jit_dnnl_ext_emitters.hpp
+++ b/src/plugins/intel_cpu/src/emitters/plugin/x64/jit_dnnl_ext_emitters.hpp
@@ -64,19 +64,6 @@ class jit_elu_emitter : public jit_dnnl_emitter {
         }
 };
 
-class jit_exp_emitter : public jit_dnnl_emitter {
-public:
-    jit_exp_emitter(dnnl::impl::cpu::x64::jit_generator *host, dnnl::impl::cpu::x64::cpu_isa_t host_isa, const std::shared_ptr<ov::Node>& n,
-                       ov::element::Type exec_prc = ov::element::f32)
-        : jit_dnnl_emitter(host, host_isa, n, exec_prc) {
-            kind = dnnl_eltwise_exp;
-            alpha = 0.f;
-            beta = 0.f;
-
-            set_injector();
-        }
-};
-
 class jit_abs_emitter : public jit_dnnl_emitter {
 public:
     jit_abs_emitter(dnnl::impl::cpu::x64::jit_generator *host, dnnl::impl::cpu::x64::cpu_isa_t host_isa, const std::shared_ptr<ov::Node>& n,
diff --git a/src/plugins/intel_cpu/src/emitters/plugin/x64/jit_eltwise_emitters.cpp b/src/plugins/intel_cpu/src/emitters/plugin/x64/jit_eltwise_emitters.cpp
index fb74c196f6a289..0331a3ee4908b9 100644
--- a/src/plugins/intel_cpu/src/emitters/plugin/x64/jit_eltwise_emitters.cpp
+++ b/src/plugins/intel_cpu/src/emitters/plugin/x64/jit_eltwise_emitters.cpp
@@ -1822,29 +1822,25 @@ void jit_negative_emitter::emit_isa(const std::vector<size_t> &in_vec_idxs, cons
     h->uni_vsubps(vmm_dst, vmm_dst, vmm_src);
 }
 
-/// ERF ///
-jit_erf_emitter::jit_erf_emitter(x64::jit_generator* host, x64::cpu_isa_t host_isa, ov::element::Type exec_prc)
+
+/// EXP ///
+jit_exp_emitter::jit_exp_emitter(x64::jit_generator* host, x64::cpu_isa_t host_isa, ov::element::Type exec_prc)
     : jit_emitter(host, host_isa, exec_prc) {
     prepare_table();
 }
 
-jit_erf_emitter::jit_erf_emitter(x64::jit_generator* host,
-                                 x64::cpu_isa_t host_isa,
-                                 const std::shared_ptr<ov::Node>& node,
-                                 ov::element::Type exec_prc)
+jit_exp_emitter::jit_exp_emitter(x64::jit_generator* host, x64::cpu_isa_t host_isa, const std::shared_ptr<ov::Node>& node, ov::element::Type exec_prc)
     : jit_emitter(host, host_isa, exec_prc) {
     prepare_table();
 }
 
-size_t jit_erf_emitter::get_inputs_num() const { return 1; }
+size_t jit_exp_emitter::get_inputs_num() const { return 1; }
 
-std::set<std::vector<element::Type>> jit_erf_emitter::get_supported_precisions(const std::shared_ptr<ov::Node>& node) {
+std::set<std::vector<element::Type>> jit_exp_emitter::get_supported_precisions(const std::shared_ptr<ov::Node>& node) {
     return {{element::f32}};
 }
 
-void jit_erf_emitter::emit_impl(
-    const std::vector<size_t> &in_vec_idxs,
-    const std::vector<size_t> &out_vec_idxs) const {
+void jit_exp_emitter::emit_impl(const std::vector<size_t> &in_vec_idxs, const std::vector<size_t> &out_vec_idxs) const {
     if (host_isa_ == x64::sse41) {
         emit_isa<x64::sse41>(in_vec_idxs, out_vec_idxs);
     } else if (host_isa_ == x64::avx2) {
@@ -1857,20 +1853,16 @@ void jit_erf_emitter::emit_impl(
 }
 
 template <x64::cpu_isa_t isa>
-void jit_erf_emitter::emit_isa(const std::vector<size_t> &in_vec_idxs, const std::vector<size_t> &out_vec_idxs) const {
+void jit_exp_emitter::emit_isa(const std::vector<size_t> &in_vec_idxs, const std::vector<size_t> &out_vec_idxs) const {
     using Vmm = typename conditional3<isa == x64::sse41, Xmm, isa == x64::avx2, Ymm, Zmm>::type;
     Vmm vmm_src = Vmm(in_vec_idxs[0]);
     Vmm vmm_dst = Vmm(out_vec_idxs[0]);
 
-    Vmm vmm_mask = Vmm(aux_vec_idxs[0]);
-    Vmm vmm_aux0 = Vmm(aux_vec_idxs[0]);
-    Vmm vmm_aux1 = Vmm(aux_vec_idxs[1]);
-    Vmm vmm_aux2 = Vmm(aux_vec_idxs[2]);
-    Vmm vmm_aux3 = Vmm(aux_vec_idxs[3]);
-    Vmm vmm_aux4 = Vmm(aux_vec_idxs[4]);
+    Vmm vmm_mask = need_vmm_mask() ? Vmm(aux_vec_idxs[0]) : Vmm();
+    Vmm vmm_aux0 = Vmm(aux_vec_idxs[0 + static_cast<size_t>(need_vmm_mask())]);
+    Vmm vmm_aux1 = Vmm(aux_vec_idxs[1 + static_cast<size_t>(need_vmm_mask())]);
 
-    auto compute_cmp_mask = [&](const Vmm &vmm_src,
-        const Xbyak::Operand &compare_operand, int cmp_predicate) {
+    auto compute_cmp_mask = [&](const Vmm &vmm_src, const Xbyak::Operand &compare_operand, int cmp_predicate) {
         if (host_isa_ == x64::avx512_core) {
             h->vcmpps(k_mask, vmm_src, compare_operand, cmp_predicate);
         } else {
@@ -1886,66 +1878,123 @@ void jit_erf_emitter::emit_isa(const std::vector<size_t> &in_vec_idxs, const std
         }
     };
 
-    auto exp_compute_vector_fwd = [&](const Vmm &vmm_src) {
-        // get mask of values lower than log(FLT_MIN) to zero them in the output
-        compute_cmp_mask(vmm_src, table_val("exp_ln_flt_min_f"), _cmp_lt_os);
-
-        h->uni_vminps(vmm_src, vmm_src, table_val("exp_ln_flt_max_f"));
-        h->uni_vmaxps(vmm_src, vmm_src, table_val("exp_ln_flt_min_f"));
-        h->uni_vmovups(vmm_aux1, vmm_src);
-
-        // calculate exp(x)
-        // fx = x * log2ef + 0.5
-        h->uni_vmulps(vmm_src, vmm_src, table_val("exp_log2ef"));
-        h->uni_vaddps(vmm_src, vmm_src, table_val("half"));
-
-        // tmp = floorf(fx)
-        const auto _op_floor = 1u;
-        h->uni_vroundps(vmm_aux2, vmm_src, _op_floor);
-
-        // keep vmm_src = fx for further computations
-        h->uni_vmovups(vmm_src, vmm_aux2);
-
-        // x = x - fx * ln2
-        h->uni_vfnmadd231ps(vmm_aux1, vmm_aux2, table_val("ln2f"));
-
-        // compute 2^n
-        h->uni_vcvtps2dq(vmm_aux2, vmm_src);
-        h->uni_vpaddd(vmm_aux2, vmm_aux2, table_val("exponent_bias"));
-        const int n_mantissa_bits = 23;
-        h->uni_vpslld(vmm_aux2, vmm_aux2, n_mantissa_bits); //Vmm(6) = 2^-fx
-
-                                                            // use vmm_src as tmp vmm_zero when applying mask
-        h->uni_vpxor(vmm_src, vmm_src, vmm_src);
-        // set zeroes at those points which were < log(FLT_MIN)
-        blend_with_mask(vmm_aux2, vmm_src);
-
-        // compute polynomial
-        h->uni_vmovups(vmm_src, table_val("ex_pol5"));
-        h->uni_vfmadd213ps(vmm_src, vmm_aux1, table_val("ex_pol4"));
-        h->uni_vfmadd213ps(vmm_src, vmm_aux1, table_val("ex_pol3"));
-        h->uni_vfmadd213ps(vmm_src, vmm_aux1, table_val("ex_pol2"));
-        h->uni_vfmadd213ps(vmm_src, vmm_aux1, table_val("ex_pol1"));
-        h->uni_vfmadd213ps(vmm_src, vmm_aux1, table_val("one"));
-        // y = y * 2^n
-        h->uni_vmulps(vmm_src, vmm_src, vmm_aux2);
-    };
+    h->uni_vmovups(vmm_aux1, table_val("ln_flt_min_f"));
+    // get mask of values lower than log(FLT_MIN) to zero them in the output
+    compute_cmp_mask(vmm_src, vmm_aux1, _cmp_lt_os);
 
-    auto abs_compute_vector_fwd = [&](const Vmm &vmm_src) {
-        // compute abs(x) = _mm_and_ps(x, 01111..111));
-        h->uni_vandps(vmm_src, vmm_src, table_val("positive_mask"));
-    };
+    h->uni_vminps(vmm_dst, vmm_src, table_val("ln_flt_max_f"));
+    h->uni_vmaxps(vmm_dst, vmm_dst, vmm_aux1);
+    h->uni_vmovups(vmm_aux0, vmm_dst);
+
+    // calculate exp(x)
+    // fx = x * log2ef + 0.5
+    h->uni_vmulps(vmm_dst, vmm_dst, table_val("log2ef"));
+    h->uni_vaddps(vmm_dst, vmm_dst, table_val("half"));
+
+    // tmp = floorf(fx)
+    const auto _op_floor = 1u;
+    h->uni_vroundps(vmm_aux1, vmm_dst, _op_floor);
+
+    // keep vmm_dst = fx for further computations
+    h->uni_vmovups(vmm_dst, vmm_aux1);
+
+    // x = x - fx * ln2
+    h->uni_vfnmadd231ps(vmm_aux0, vmm_aux1, table_val("ln2f"));
+
+    // compute 2^n
+    h->uni_vcvtps2dq(vmm_aux1, vmm_dst);
+    h->uni_vpaddd(vmm_aux1, vmm_aux1, table_val("exponent_bias"));
+    const int n_mantissa_bits = 23;
+    h->uni_vpslld(vmm_aux1, vmm_aux1, n_mantissa_bits);
+
+    // use vmm_dst as tmp vmm_zero when applying mask
+    h->uni_vpxor(vmm_dst, vmm_dst, vmm_dst);
+    // set zeroes at those points which were < log(FLT_MIN)
+    blend_with_mask(vmm_aux1, vmm_dst);
+
+    // compute polynomial
+    h->uni_vmovups(vmm_dst, table_val("pol5"));
+    h->uni_vfmadd213ps(vmm_dst, vmm_aux0, table_val("pol4"));
+    h->uni_vfmadd213ps(vmm_dst, vmm_aux0, table_val("pol3"));
+    h->uni_vfmadd213ps(vmm_dst, vmm_aux0, table_val("pol2"));
+    h->uni_vfmadd213ps(vmm_dst, vmm_aux0, table_val("pol1"));
+    h->uni_vfmadd213ps(vmm_dst, vmm_aux0, table_val("one"));
+    // y = y * 2^n
+    h->uni_vmulps(vmm_dst, vmm_dst, vmm_aux1);
+}
+
+void jit_exp_emitter::register_table_entries() {
+    push_arg_entry_of("pol1", 0x3f7ffffb, true); // p1 = 0.999999701f
+    push_arg_entry_of("pol2", 0x3efffee3, true); // p2 = 0.499991506f
+    push_arg_entry_of("pol3", 0x3e2aad40, true); // p3 = 0.166676521f
+    push_arg_entry_of("pol4", 0x3d2b9d0d, true); // p4 = 0.0418978221f
+    push_arg_entry_of("pol5", 0x3c07cfce, true); // p5 = 0.00828929059f
+
+    push_arg_entry_of("one", CONST_1_F, true);
+    push_arg_entry_of("half", 0x3f000000, true);
+    push_arg_entry_of("ln2f", 0x3f317218, true);
+    push_arg_entry_of("log2ef", 0x3fb8aa3b, true);
+    push_arg_entry_of("ln_flt_max_f", 0x42b17218, true);
+    push_arg_entry_of("ln_flt_min_f", 0xc2aeac50, true);
+    push_arg_entry_of("exponent_bias", 0x0000007f, true);
+}
+
+size_t jit_exp_emitter::aux_vecs_count() const {
+    return need_vmm_mask() ? 3 : 2;
+}
+
+/// ERF ///
+jit_erf_emitter::jit_erf_emitter(x64::jit_generator* host, x64::cpu_isa_t host_isa, ov::element::Type exec_prc)
+    : jit_emitter(host, host_isa, exec_prc) {
+    m_exp_emitter.reset(new jit_exp_emitter(host, host_isa, exec_prc));
+    prepare_table();
+}
+
+jit_erf_emitter::jit_erf_emitter(x64::jit_generator* host,  x64::cpu_isa_t host_isa, const std::shared_ptr<ov::Node>& node, ov::element::Type exec_prc)
+    : jit_erf_emitter(host, host_isa, exec_prc) {}
+
+size_t jit_erf_emitter::get_inputs_num() const { return 1; }
+
+std::set<std::vector<element::Type>> jit_erf_emitter::get_supported_precisions(const std::shared_ptr<ov::Node>& node) {
+    return {{element::f32}};
+}
+
+void jit_erf_emitter::emit_impl(const std::vector<size_t> &in_vec_idxs, const std::vector<size_t> &out_vec_idxs) const {
+    if (host_isa_ == x64::sse41) {
+        emit_isa<x64::sse41>(in_vec_idxs, out_vec_idxs);
+    } else if (host_isa_ == x64::avx2) {
+        emit_isa<x64::avx2>(in_vec_idxs, out_vec_idxs);
+    } else if (host_isa_ == x64::avx512_core) {
+        emit_isa<x64::avx512_core>(in_vec_idxs, out_vec_idxs);
+    } else {
+        OV_CPU_JIT_EMITTER_THROW("Unsupported ISA ", host_isa_);
+    }
+}
+
+template <x64::cpu_isa_t isa>
+void jit_erf_emitter::emit_isa(const std::vector<size_t> &in_vec_idxs, const std::vector<size_t> &out_vec_idxs) const {
+    using Vmm = typename conditional3<isa == x64::sse41, Xmm, isa == x64::avx2, Ymm, Zmm>::type;
+    Vmm vmm_src = Vmm(in_vec_idxs[0]);
+    Vmm vmm_dst = Vmm(out_vec_idxs[0]);
+
+    Vmm vmm_aux0 = Vmm(aux_vec_idxs[0]);
+    Vmm vmm_aux1 = Vmm(aux_vec_idxs[1]);
+    Vmm vmm_aux2 = Vmm(aux_vec_idxs[2]);
+    Vmm vmm_aux3 = Vmm(aux_vec_idxs[3]);
 
     // IMPORTANT: we use vmm_aux3 to save `x` as exp_compute does not use it.
     h->uni_vmovups(vmm_aux3, vmm_src);
 
     // -exp(-x*x)
-    h->uni_vmulps(vmm_src, vmm_src, vmm_src);
-    h->uni_vxorps(vmm_src, vmm_src, table_val("sign_mask"));
+    h->uni_vmulps(vmm_dst, vmm_src, vmm_src);
+    h->uni_vxorps(vmm_dst, vmm_dst, table_val("sign_mask"));
 
-    exp_compute_vector_fwd(vmm_src);
+    // pass the current `aux_vec_idxs` to `exp_emitter` excepting `vmm_aux3`
+    auto exp_aux_vec_idxs = aux_vec_idxs;
+    exp_aux_vec_idxs.erase(std::find(exp_aux_vec_idxs.begin(), exp_aux_vec_idxs.end(), static_cast<size_t>(vmm_aux3.getIdx())));
+    m_exp_emitter->emit_code({static_cast<size_t>(vmm_dst.getIdx())}, {static_cast<size_t>(vmm_dst.getIdx())}, exp_aux_vec_idxs);
 
-    h->uni_vxorps(vmm_src, vmm_src, table_val("sign_mask"));
+    h->uni_vxorps(vmm_dst, vmm_dst, table_val("sign_mask"));
 
     // get sign
     h->uni_vmovups(vmm_aux0, vmm_aux3);
@@ -1954,60 +2003,49 @@ void jit_erf_emitter::emit_isa(const std::vector<size_t> &in_vec_idxs, const std
     // abs(x)
     h->uni_vmovups(vmm_aux1, vmm_aux3);
     // compute abs(x) = _mm_and_ps(x, 01111..111));
-    abs_compute_vector_fwd(vmm_aux1);
+    h->uni_vandps(vmm_aux1, vmm_aux1, table_val("positive_mask"));
 
     // t = 1 / (p*x + 1)
     h->uni_vmovups(vmm_aux2, table_val("approx_const"));
     h->uni_vfmadd213ps(vmm_aux2, vmm_aux1, table_val("one"));
-    h->uni_vmovups(vmm_aux4, table_val("one"));
-    h->uni_vdivps(vmm_aux4, vmm_aux4, vmm_aux2);
+    h->uni_vmovups(vmm_aux3, table_val("one"));
+    h->uni_vdivps(vmm_aux3, vmm_aux3, vmm_aux2);
 
     // -exp(-x*x)*t
-    h->uni_vmulps(vmm_src, vmm_src, vmm_aux4);
+    h->uni_vmulps(vmm_dst, vmm_dst, vmm_aux3);
 
     // compute polynomialial r
-    h->uni_vmovups(vmm_aux1, table_val("erf_pol5"));
-    h->uni_vfmadd213ps(vmm_aux1, vmm_aux4, table_val("erf_pol4"));
-    h->uni_vfmadd213ps(vmm_aux1, vmm_aux4, table_val("erf_pol3"));
-    h->uni_vfmadd213ps(vmm_aux1, vmm_aux4, table_val("erf_pol2"));
-    h->uni_vfmadd213ps(vmm_aux1, vmm_aux4, table_val("erf_pol1"));
+    h->uni_vmovups(vmm_aux1, table_val("pol5"));
+    h->uni_vfmadd213ps(vmm_aux1, vmm_aux3, table_val("pol4"));
+    h->uni_vfmadd213ps(vmm_aux1, vmm_aux3, table_val("pol3"));
+    h->uni_vfmadd213ps(vmm_aux1, vmm_aux3, table_val("pol2"));
+    h->uni_vfmadd213ps(vmm_aux1, vmm_aux3, table_val("pol1"));
 
     // erf = sign * (1 - r * t * exp(-x*x))
-    h->uni_vfmadd213ps(vmm_src, vmm_aux1, table_val("one"));
-    h->uni_vxorps(vmm_dst, vmm_src, vmm_aux0);
+    h->uni_vfmadd213ps(vmm_dst, vmm_aux1, table_val("one"));
+    h->uni_vxorps(vmm_dst, vmm_dst, vmm_aux0);
 }
 
 void jit_erf_emitter::register_table_entries() {
     push_arg_entry_of("approx_const", 0x3ea7ba05, true); // 0.3275911
-    push_arg_entry_of("one_over_sqrt_two", 0x3f3504f3, true);
-    push_arg_entry_of("sign_mask", 0x80000000, true);
-
-    push_arg_entry_of("ex_pol1", 0x3f7ffffb, true); // p1 = 0.999999701f
-    push_arg_entry_of("ex_pol2", 0x3efffee3, true); // p2 = 0.499991506f
-    push_arg_entry_of("ex_pol3", 0x3e2aad40, true); // p3 = 0.166676521f
-    push_arg_entry_of("ex_pol4", 0x3d2b9d0d, true); // p4 = 0.0418978221f
-    push_arg_entry_of("ex_pol5", 0x3c07cfce, true); // p5 = 0.00828929059f
-
-    push_arg_entry_of("erf_pol1", 0x3e827906, true); // p1 = 0.254829592f
-    push_arg_entry_of("erf_pol2", 0xbe91a98e, true); // p2 = -0.284496736f
-    push_arg_entry_of("erf_pol3", 0x3fb5f0e3, true); // p3 = 1.421413741f
-    push_arg_entry_of("erf_pol4", 0xbfba00e3, true); // p4 = -1.453152027f
-    push_arg_entry_of("erf_pol5", 0x3f87dc22, true); // p5 = 1.061405429f
-
     push_arg_entry_of("one", CONST_1_F, true);
-    push_arg_entry_of("half", 0x3f000000, true);
-
-    push_arg_entry_of("exp_log2ef", 0x3fb8aa3b, true);
-    push_arg_entry_of("exp_ln_flt_max_f", 0x42b17218, true);
-    push_arg_entry_of("exp_ln_flt_min_f", 0xc2aeac50, true);
-
-    push_arg_entry_of("ln2f", 0x3f317218, true);
-    push_arg_entry_of("exponent_bias", 0x0000007f, true);
+    push_arg_entry_of("sign_mask", 0x80000000, true);
     push_arg_entry_of("positive_mask", 0x7fffffff, true);
+
+    push_arg_entry_of("pol1", 0x3e827906, true); // p1 = 0.254829592f
+    push_arg_entry_of("pol2", 0xbe91a98e, true); // p2 = -0.284496736f
+    push_arg_entry_of("pol3", 0x3fb5f0e3, true); // p3 = 1.421413741f
+    push_arg_entry_of("pol4", 0xbfba00e3, true); // p4 = -1.453152027f
+    push_arg_entry_of("pol5", 0x3f87dc22, true); // p5 = 1.061405429f
 }
 
 size_t jit_erf_emitter::aux_vecs_count() const {
-    return 5ul;
+    return 4ul;
+}
+
+void jit_erf_emitter::emit_data() const {
+    jit_emitter::emit_data();
+    m_exp_emitter->emit_data();
 }
 
 /// SOFT SIGN ///
diff --git a/src/plugins/intel_cpu/src/emitters/plugin/x64/jit_eltwise_emitters.hpp b/src/plugins/intel_cpu/src/emitters/plugin/x64/jit_eltwise_emitters.hpp
index 606b0ef1ef90c8..c8c4b06d6f3347 100644
--- a/src/plugins/intel_cpu/src/emitters/plugin/x64/jit_eltwise_emitters.hpp
+++ b/src/plugins/intel_cpu/src/emitters/plugin/x64/jit_eltwise_emitters.hpp
@@ -525,6 +525,29 @@ class jit_negative_emitter : public jit_emitter {
     void emit_isa(const std::vector<size_t> &in_vec_idxs, const std::vector<size_t> &out_vec_idxs) const;
 };
 
+class jit_exp_emitter : public jit_emitter {
+public:
+    jit_exp_emitter(dnnl::impl::cpu::x64::jit_generator *host, dnnl::impl::cpu::x64::cpu_isa_t host_isa,
+                    ov::element::Type exec_prc = ov::element::f32);
+
+    jit_exp_emitter(dnnl::impl::cpu::x64::jit_generator *host, dnnl::impl::cpu::x64::cpu_isa_t host_isa, const std::shared_ptr<ov::Node>& n,
+                    ov::element::Type exec_prc = ov::element::f32);
+
+    size_t get_inputs_num() const override;
+    static std::set<std::vector<element::Type>> get_supported_precisions(const std::shared_ptr<ov::Node>& node = nullptr);
+
+private:
+    void emit_impl(const std::vector<size_t> &in_vec_idxs, const std::vector<size_t> &out_vec_idxs) const override;
+
+    template <dnnl::impl::cpu::x64::cpu_isa_t isa>
+    void emit_isa(const std::vector<size_t> &in_vec_idxs, const std::vector<size_t> &out_vec_idxs) const;
+
+    bool need_vmm_mask() const { return host_isa_ != dnnl::impl::cpu::x64::avx512_core; }
+
+    void register_table_entries() override;
+    size_t aux_vecs_count() const override;
+};
+
 class jit_erf_emitter : public jit_emitter {
 public:
     jit_erf_emitter(dnnl::impl::cpu::x64::jit_generator *host, dnnl::impl::cpu::x64::cpu_isa_t host_isa,
@@ -533,6 +556,8 @@ class jit_erf_emitter : public jit_emitter {
     jit_erf_emitter(dnnl::impl::cpu::x64::jit_generator *host, dnnl::impl::cpu::x64::cpu_isa_t host_isa, const std::shared_ptr<ov::Node>& n,
                     ov::element::Type exec_prc = ov::element::f32);
 
+    void emit_data() const override;
+
     size_t get_inputs_num() const override;
     static std::set<std::vector<element::Type>> get_supported_precisions(const std::shared_ptr<ov::Node>& node = nullptr);
 
@@ -546,6 +571,8 @@ class jit_erf_emitter : public jit_emitter {
 
     void register_table_entries() override;
     size_t aux_vecs_count() const override;
+
+    std::unique_ptr<jit_exp_emitter> m_exp_emitter {nullptr};
 };
 
 class jit_soft_sign_emitter : public jit_emitter {
diff --git a/src/plugins/intel_cpu/src/nodes/eltwise.cpp b/src/plugins/intel_cpu/src/nodes/eltwise.cpp
index f2f6ce503bd5e4..ed4d936fa49ae6 100644
--- a/src/plugins/intel_cpu/src/nodes/eltwise.cpp
+++ b/src/plugins/intel_cpu/src/nodes/eltwise.cpp
@@ -244,7 +244,6 @@ std::set<std::vector<element::Type>> eltwise_precision_helper::get_supported_pre
         OV_CASE(Algorithm::EltwiseAbs, jit_dnnl_aux_emitter),
         OV_CASE(Algorithm::EltwiseSqrt, jit_dnnl_aux_emitter),
         OV_CASE(Algorithm::EltwiseSoftRelu, jit_dnnl_aux_emitter),
-        OV_CASE(Algorithm::EltwiseExp, jit_dnnl_aux_emitter),
         OV_CASE(Algorithm::EltwiseClamp, jit_dnnl_aux_emitter),
         OV_CASE(Algorithm::EltwiseSwish, jit_dnnl_aux_emitter),
         OV_CASE(Algorithm::EltwiseHswish, jit_dnnl_aux_emitter),
@@ -262,6 +261,7 @@ std::set<std::vector<element::Type>> eltwise_precision_helper::get_supported_pre
         OV_CASE(Algorithm::EltwiseMod, jit_mod_emitter),
         OV_CASE(Algorithm::EltwiseMaximum, jit_maximum_emitter),
         OV_CASE(Algorithm::EltwiseMinimum, jit_minimum_emitter),
+        OV_CASE(Algorithm::EltwiseExp, jit_exp_emitter),
         OV_CASE(Algorithm::EltwiseSquaredDifference, jit_squared_difference_emitter),
         OV_CASE(Algorithm::EltwisePowerDynamic, jit_power_dynamic_emitter),
         OV_CASE(Algorithm::EltwiseEqual, jit_equal_emitter),
@@ -623,7 +623,6 @@ struct jit_uni_eltwise_generic : public jit_uni_eltwise_kernel, public jit_gener
         OV_CASE(Algorithm::EltwiseAbs, jit_dnnl_aux_emitter),
         OV_CASE(Algorithm::EltwiseSqrt, jit_dnnl_aux_emitter),
         OV_CASE(Algorithm::EltwiseSoftRelu, jit_dnnl_aux_emitter),
-        OV_CASE(Algorithm::EltwiseExp, jit_dnnl_aux_emitter),
         OV_CASE(Algorithm::EltwiseClamp, jit_dnnl_aux_emitter),
         OV_CASE(Algorithm::EltwiseSwish, jit_dnnl_aux_emitter),
         OV_CASE(Algorithm::EltwiseHswish, jit_dnnl_aux_emitter),
@@ -641,6 +640,7 @@ struct jit_uni_eltwise_generic : public jit_uni_eltwise_kernel, public jit_gener
         OV_CASE(Algorithm::EltwiseMod, jit_mod_emitter),
         OV_CASE(Algorithm::EltwiseMaximum, jit_maximum_emitter),
         OV_CASE(Algorithm::EltwiseMinimum, jit_minimum_emitter),
+        OV_CASE(Algorithm::EltwiseExp, jit_exp_emitter),
         OV_CASE(Algorithm::EltwiseSquaredDifference, jit_squared_difference_emitter),
         OV_CASE(Algorithm::EltwisePowerDynamic, jit_power_dynamic_emitter),
         OV_CASE(Algorithm::EltwiseEqual, jit_equal_emitter),
@@ -1213,7 +1213,6 @@ const std::map<const ov::DiscreteTypeInfo, Eltwise::Initializer>& Eltwise::getIn
         }},
         {ov::op::v0::Exp::get_type_info_static(), [](const std::shared_ptr<ov::Node>& op, Eltwise& node) {
             node.algorithm = Algorithm::EltwiseExp;
-            node.onednnAlgorithm = dnnl::algorithm::eltwise_exp;
         }},
         {SwishNode::get_type_info_static(), [](const std::shared_ptr<ov::Node>& op, Eltwise& node) {
             auto swishOp = getNgraphOpAs<SwishNode>(op);
@@ -1873,7 +1872,6 @@ class EltwiseRefExecutor : public EltwiseRefBaseExecutor<T> {
                     case Algorithm::EltwiseAbs:
                     case Algorithm::EltwiseSqrt:
                     case Algorithm::EltwiseSoftRelu:
-                    case Algorithm::EltwiseExp:
                     case Algorithm::EltwiseClamp:
                     case Algorithm::EltwiseSwish:
                     case Algorithm::EltwiseHswish:
@@ -1893,6 +1891,7 @@ class EltwiseRefExecutor : public EltwiseRefBaseExecutor<T> {
                     case Algorithm::EltwiseMod:               *dst_ptr_f = src_f[0] - truncf(src_f[0] / src_f[1]) * src_f[1]; break;
                     case Algorithm::EltwiseMaximum:           *dst_ptr_f = std::max(src_f[0], src_f[1]); break;
                     case Algorithm::EltwiseMinimum:           *dst_ptr_f = std::min(src_f[0], src_f[1]); break;
+                    case Algorithm::EltwiseExp:               *dst_ptr_f = expf(src_f[0]); break;
                     case Algorithm::EltwiseSquaredDifference: *dst_ptr_f = powf((src_f[0] - src_f[1]), 2.f); break;
                     case Algorithm::EltwisePowerDynamic:      *dst_ptr_f = powf(src_f[0], src_f[1]); break;
                     case Algorithm::EltwiseEqual:             *dst_ptr_f = src_f[0] == src_f[1]; break;

From adeb3d2e0296db45a745ad6c02d4566570b18750 Mon Sep 17 00:00:00 2001
From: Alexey Moskalev <avmoskal@gmail.com>
Date: Tue, 22 Oct 2024 11:43:25 +0400
Subject: [PATCH 24/24] Adding CODE_OF_CONDUCT.md (#27100)

Adding CODE_OF_CONDUCT.md to meet LF requirements.
---
 CODE_OF_CONDUCT.md | 119 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 119 insertions(+)
 create mode 100644 CODE_OF_CONDUCT.md

diff --git a/CODE_OF_CONDUCT.md b/CODE_OF_CONDUCT.md
new file mode 100644
index 00000000000000..5044453266940d
--- /dev/null
+++ b/CODE_OF_CONDUCT.md
@@ -0,0 +1,119 @@
+# Contributor Covenant Code of Conduct
+
+## Our Pledge
+
+We as members, contributors, and leaders pledge to make participation in our
+community a harassment-free experience for everyone, regardless of age, body
+size, visible or invisible disability, ethnicity, sex characteristics, gender
+identity and expression, level of experience, education, socio-economic status,
+nationality, personal appearance, race, religion, or sexual identity
+and orientation.
+
+We pledge to act and interact in ways that contribute to an open, welcoming,
+diverse, inclusive, and healthy community.
+
+## Our Standards
+
+Examples of behavior that contributes to a positive environment for our
+community include:
+
+* Demonstrating empathy and kindness toward other people
+* Being respectful of differing opinions, viewpoints, and experiences
+* Giving and gracefully accepting constructive feedback
+* Accepting responsibility and apologizing to those affected by our mistakes,
+  and learning from the experience
+* Focusing on what is best not just for us as individuals, but for the
+  overall community
+
+Examples of unacceptable behavior include:
+
+* The use of sexualized language or imagery, and sexual attention or
+  advances of any kind
+* Trolling, insulting or derogatory comments, and personal or political attacks
+* Public or private harassment
+* Publishing others' private information, such as a physical or email
+  address, without their explicit permission
+* Other conduct which could reasonably be considered inappropriate in a
+  professional setting
+
+## Enforcement Responsibilities
+
+Community leaders are responsible for clarifying and enforcing our standards of
+acceptable behavior and will take appropriate and fair corrective action in
+response to any behavior that they deem inappropriate, threatening, offensive,
+or harmful.
+
+Community leaders have the right and responsibility to remove, edit, or reject
+comments, commits, code, wiki edits, issues, and other contributions that are
+not aligned to this Code of Conduct, and will communicate reasons for moderation
+decisions when appropriate.
+
+## Scope
+
+This Code of Conduct applies within all community spaces, and also applies when
+an individual is officially representing the community in public spaces.
+Examples of representing our community include using an official email address,
+posting via an official social media account, or acting as an appointed
+representative at an online or offline event.
+
+## Enforcement
+
+Instances of abusive, harassing, or otherwise unacceptable behavior may be
+reported to the community leaders responsible for enforcement at
+openvino_codeofconduct At intel DOT com.
+All complaints will be reviewed and investigated promptly and fairly.
+
+All community leaders are obligated to respect the privacy and security of the
+reporter of any incident.
+
+## Enforcement Guidelines
+
+Community leaders will follow these Community Impact Guidelines in determining
+the consequences for any action they deem in violation of this Code of Conduct:
+
+### 1. Correction
+
+**Community Impact**: Use of inappropriate language or other behavior deemed
+unprofessional or unwelcome in the community.
+
+**Consequence**: A private, written warning from community leaders, providing
+clarity around the nature of the violation and an explanation of why the
+behavior was inappropriate. A public apology may be requested.
+
+### 2. Warning
+
+**Community Impact**: A violation through a single incident or series
+of actions.
+
+**Consequence**: A warning with consequences for continued behavior. No
+interaction with the people involved, including unsolicited interaction with
+those enforcing the Code of Conduct, for a specified period of time. This
+includes avoiding interactions in community spaces as well as external channels
+like social media. Violating these terms may lead to a temporary or
+permanent ban.
+
+### 3. Temporary Ban
+
+**Community Impact**: A serious violation of community standards, including
+sustained inappropriate behavior.
+
+**Consequence**: A temporary ban from any sort of interaction or public
+communication with the community for a specified period of time. No public or
+private interaction with the people involved, including unsolicited interaction
+with those enforcing the Code of Conduct, is allowed during this period.
+Violating these terms may lead to a permanent ban.
+
+### 4. Permanent Ban
+
+**Community Impact**: Demonstrating a pattern of violation of community
+standards, including sustained inappropriate behavior,  harassment of an
+individual, or aggression toward or disparagement of classes of individuals.
+
+**Consequence**: A permanent ban from any sort of public interaction within
+the community.
+
+## Attribution
+
+This Code of Conduct is adapted from the [Contributor Covenant][homepage],
+version 2.0, available at
+[https://www.contributor-covenant.org/version/2/0/code_of_conduct.html][v2.0].