Merge branch 'develop' into memory-logger-tool

openvinotoolkit · Jul 17, 2024 · 7043204 · 7043204
2 parents 45226e5 + d113c2a
commit 7043204
Show file tree

Hide file tree

Showing 68 changed files with 792 additions and 649 deletions.
diff --git a/.github/workflows/nightly.yml b/.github/workflows/nightly.yml
@@ -11,4 +11,6 @@ jobs:
     runs-on: ubuntu-22.04
     steps:
       - uses: actions/checkout@a5ac7e51b41094c92402da3b24376905380afc29 # v4.1.6
-      - uses: AlexanderDokuchaev/md-dead-link-check@76ecefc7f64753bba30a36179f46d903e9f77669 # v0.8
+      - uses: AlexanderDokuchaev/md-dead-link-check@cc3ed55268899a1a6d5fd7068abbc4591eab1f74 # v0.9
+        with:
+          config: md_dead_link_check.toml
diff --git a/.github/workflows/pre-commit-linters.yml b/.github/workflows/pre-commit-linters.yml
@@ -24,4 +24,6 @@ jobs:
     runs-on: ubuntu-22.04
     steps:
       - uses: actions/checkout@a5ac7e51b41094c92402da3b24376905380afc29 # v4.1.6
-      - uses: AlexanderDokuchaev/md-dead-link-check@76ecefc7f64753bba30a36179f46d903e9f77669 # v0.8
+      - uses: AlexanderDokuchaev/md-dead-link-check@cc3ed55268899a1a6d5fd7068abbc4591eab1f74 # v0.9
+        with:
+          config: md_dead_link_check.toml
diff --git a/README.md b/README.md
diff --git a/docs/Algorithms.md b/docs/Algorithms.md
@@ -11,6 +11,7 @@
   - Symmetric 8 bit compression mode
   - Symmetric and asymmetric 4 bit compression mode
   - NF4 compression mode
+  - E2M1 weights with E8M0 scales compression mode
   - Mixed precision weights compression
   - Grouped weights compression
 

diff --git a/docs/usage/post_training_compression/weights_compression/Usage.md b/docs/usage/post_training_compression/weights_compression/Usage.md
@@ -9,7 +9,7 @@ The Weights Compression algorithm is aimed at compressing the weights of the mod
 #### Supported modes
 
 By default, weights are compressed asymmetrically to 8-bit integer data type - "INT8_ASYM" mode.
-OpenVINO backend also supports 3 modes of mixed precision weight quantization with a 4-bit data type as a primary precision - INT4_SYM, INT4_ASYM and NF4. The primary precision in case of INT4_SYM mode is signed 4-bit integer and weights are quantized to it [symmetrically](/docs/usage/training_time_compression/other_algorithms/LegacyQuantization.md#symmetric-quantization) without zero point. In case of INT4_ASYM mode - unsigned 4-bit integer and weight are quantized to it [asymmetrically](/docs/usage/training_time_compression/other_algorithms/LegacyQuantization.md#asymmetric-quantization) with a typical non-fixed zero point. In case of NF4 mode - [nf4](https://arxiv.org/pdf/2305.14314v1.pdf) data type without zero point.
+OpenVINO backend also supports 3 modes of mixed precision weight quantization with a 4-bit data type as a primary precision - INT4_SYM, INT4_ASYM, NF4, E2M1. The primary precision in case of INT4_SYM mode is signed 4-bit integer and weights are quantized to it [symmetrically](/docs/usage/training_time_compression/other_algorithms/LegacyQuantization.md#symmetric-quantization) without zero point. In case of INT4_ASYM mode - unsigned 4-bit integer and weight are quantized to it [asymmetrically](/docs/usage/training_time_compression/other_algorithms/LegacyQuantization.md#asymmetric-quantization) with a typical non-fixed zero point. In case of NF4 mode - [nf4](https://arxiv.org/pdf/2305.14314v1.pdf) data type without zero point. In case of E2M1 mode - [e2m1](https://www.opencompute.org/documents/ocp-microscaling-formats-mx-v1-0-spec-final-pdf) data type without zero point and has 8bit [E8M0](https://www.opencompute.org/documents/ocp-microscaling-formats-mx-v1-0-spec-final-pdf) scale.
 All 4-bit modes have a grouped quantization support, when small group of weights (e.g. 128) in the channel dimension share quantization parameters (scale).
 All embeddings, convolutions and last linear layers are always compressed to 8-bit integer data type. To quantize embeddings and last linear layers to 4-bit, use `all_layers=True`.
 Percent of the rest layers compressed to 4-bit can be configured by "ratio" parameter. E.g. ratio=0.9 means 90% of layers compressed to the corresponding 4-bit data type and the rest to 8-bit asymmetric integer data type.
@@ -40,7 +40,7 @@ compressed_model = compress_weights(model, mode=CompressWeightsMode.INT4_SYM) #
 - Generally, `INT4_SYM` mode is the fastest mixed-precision mode, but it may lead to a significant accuracy degradation or perplexity increase.
   Compressing weights asymmetrically (`INT4_ASYM` mode) is the way to increase accuracy, however in turns it slows down inference a bit.
   If the accuracy or perplexity is still not satisfying, there are 2 more hyper-parameters to tune: `group_size` and `ratio`. Please refer to the [example](https://github.com/openvinotoolkit/nncf/blob/develop/examples/llm_compression/openvino/tiny_llama_find_hyperparams) how to automatically tune these parameters.
-  Lower group size and less ratio of 4-bit layers usually improve accuracy at the sacrifice of inference speed.
+  Lower group size and less ratio of 4-bit layers usually improve accuracy at the sacrifice of inference speed. To disable grouped quantization and quantize weights per-channel, set `group_size = -1`.
   Below is the example how to compress weights of 90% of layers to 4-bit integer asymmetrically with the group size 64, and
   the rest of layers to 8-bit asymmetric integer data type. The same parametrization is applicable for `INT4_SYM` mode.
 
@@ -144,6 +144,15 @@ from nncf import compress_weights, CompressWeightsMode
 compressed_model = compress_weights(model, mode=CompressWeightsMode.NF4)
 ```
 
+- `E2M1` mode can be considered for improving accuracy, but currently models quantized to e2m1 should not be faster models
+  quantized to 8-bit asymmetric integer. Here's the example how to compress weights to e2m1 data type with group size = 32 (recommended).
+  Different `group_size` and `ratio` are also supported.
+
+```python
+from nncf import compress_weights, CompressWeightsMode
+compressed_model = compress_weights(model, mode=CompressWeightsMode.E2M1, group_size=32, all_layers=True)
+```
+
 #### Evaluation results
 
 Here is the perplexity and model size before and after weight compression for different language models on the [Lambada OpenAI dataset](https://github.com/openai/gpt-2/issues/131#issuecomment-497136199).
@@ -512,8 +521,9 @@ Here is the perplexity and accuracy with data-free and data-aware mixed-precisio
 - The algorithm is supported for OpenVINO and PyTorch models.
 - The compression applies in-place.
 - The compressed model is not trainable.
-- INT4_SYM, INT4_ASYM and NF4 modes, grouped quantization and mixed precision selection is available for OpenVINO backend only.
+- INT4_SYM, INT4_ASYM, NF4 and E2M1 modes, grouped quantization and mixed precision selection is available for OpenVINO backend only.
 - NF4 support is experimental - models quantized to nf4 should not be faster models quantized to 8-bit integer.
+- E2M1 support is experimental - models quantized to e2m1 should not be faster models quantized to 8-bit integer.
 
 #### Additional resources
 

diff --git a/examples/llm_compression/openvino/tiny_llama/main.py b/examples/llm_compression/openvino/tiny_llama/main.py
@@ -67,7 +67,7 @@ def transform_fn(data, model, tokenizer):
     )
     model.save_pretrained(OUTPUT_DIR)
 
-    model = OVModelForCausalLM.from_pretrained(OUTPUT_DIR)
+    model = OVModelForCausalLM.from_pretrained(OUTPUT_DIR, ov_config={"DYNAMIC_QUANTIZATION_GROUP_SIZE": "0"})
     input_ids = tokenizer("What is PyTorch?", return_tensors="pt").to(device=model.device)
 
     start_t = time.time()

diff --git a/examples/llm_compression/openvino/tiny_llama_find_hyperparams/main.py b/examples/llm_compression/openvino/tiny_llama_find_hyperparams/main.py
@@ -241,7 +241,12 @@ def gen_pkv(num_heads, head_dim, num_layers):
 
 def main():
     model_id = "TinyLlama/TinyLlama-1.1B-step-50K-105b"  # <YOUR_MODEL_ID>
-    ov_config = {"PERFORMANCE_HINT": "LATENCY", "NUM_STREAMS": "1", "CACHE_DIR": ""}
+    ov_config = {
+        "PERFORMANCE_HINT": "LATENCY",
+        "NUM_STREAMS": "1",
+        "CACHE_DIR": "",
+        "DYNAMIC_QUANTIZATION_GROUP_SIZE": "0",
+    }
     model = OVModelForCausalLM.from_pretrained(
         model_id,
         export=True,

diff --git a/examples/post_training_quantization/tensorflow/mobilenet_v2/main.py b/examples/post_training_quantization/tensorflow/mobilenet_v2/main.py
@@ -22,6 +22,7 @@
 
 import nncf
 
+tfds.display_progress_bar(enable=False)
 ROOT = Path(__file__).parent.resolve()
 WEIGHTS_URL = "https://huggingface.co/alexsu52/mobilenet_v2_imagenette/resolve/main/tf_model.h5"
 DATASET_CLASSES = 10

diff --git a/md_dead_link_check.toml b/md_dead_link_check.toml
@@ -0,0 +1,2 @@
+[tool.md_dead_link_check]
+exclude_files = ["ReleaseNotes.md"]
diff --git a/nncf/common/graph/patterns/patterns.py b/nncf/common/graph/patterns/patterns.py
@@ -330,6 +330,8 @@ class HWFusedPatternNames(Enum):
     ACTIVATIONS_SCALE_SHIFT = PatternDesc("activations_scale_shift")
     ARITHMETIC_ACTIVATIONS = PatternDesc("arithmetic_activations")
     ARITHMETIC_ACTIVATIONS_BATCH_NORM = PatternDesc("arithmetic_activations_batch_norm")
+    # StyleGan2
+    ARITHMETIC_ACTIVATIONS_ARITHMETIC = PatternDesc("arithmetic_activations_arithmetic")
     ARITHMETIC_ACTIVATIONS_SCALE_SHIFT = PatternDesc("arithmetic_activations_scale_shift")
     ARITHMETIC_BATCH_NORM = PatternDesc("arithmetic_batch_norm")
     ARITHMETIC_BATCH_NORM_ACTIVATIONS = PatternDesc("arithmetic_batch_norm_activations")

diff --git a/nncf/common/insertion_point_graph.py b/nncf/common/insertion_point_graph.py
@@ -12,7 +12,7 @@
 from collections import defaultdict
 from copy import deepcopy
 from enum import Enum
-from typing import Dict, List, Optional, Set
+from typing import Dict, List, Set
 
 import networkx as nx
 
@@ -23,7 +23,6 @@
 from nncf.common.graph.layer_attributes import Dtype
 from nncf.common.graph.operator_metatypes import INPUT_NOOP_METATYPES
 from nncf.common.graph.patterns import GraphPattern
-from nncf.common.logging import nncf_logger
 
 
 class InsertionPointGraphNodeType(Enum):
@@ -393,34 +392,3 @@ def get_pre_hook_node_key(node_key: str, input_port_id: int = 0) -> str:
     @staticmethod
     def get_post_hook_node_key(node_key: str) -> str:
         return InsertionPointGraph.POST_HOOK_ID_PREFIX + node_key
-
-
-class ConstantNodesFilter:
-    @staticmethod
-    def filter(ip_graph: InsertionPointGraph, start_traversing_node_keys: Optional[List[str]]) -> InsertionPointGraph:
-        """
-        Removes all Constant nodes from InsertionPointGraph, making it inference graph.
-        The traversing starts from the input nodes and nodes with weights.
-
-        :param ip_graph: The original InsertionPointGraph.
-        :param start_traversing_node_keys: Keys of the nodes from which the traversing will be start.
-        :return: InsertionPointGraph without Constant nodes.
-        """
-        input_nodes = ip_graph.get_input_nodes()
-        if not input_nodes:
-            nncf_logger.debug("Skipped filtering - no input nodes found")
-            return ip_graph
-        weight_nodes = []
-        if start_traversing_node_keys is not None:
-            weight_nodes = [
-                ip_graph.get_merged_node_from_single_node_key(weight_node) for weight_node in start_traversing_node_keys
-            ]
-        visited_nodes = set()
-        start_nodes = input_nodes + weight_nodes
-        for node in start_nodes:
-            for node_from, node_to in nx.bfs_edges(ip_graph, source=node):
-                visited_nodes.add(node_from)
-                visited_nodes.add(node_to)
-        constant_nodes = [node for node in ip_graph.nodes if node not in visited_nodes]
-        ip_graph.remove_nodes_from(constant_nodes)
-        return ip_graph
diff --git a/nncf/openvino/graph/metatypes/openvino_metatypes.py b/nncf/openvino/graph/metatypes/openvino_metatypes.py
@@ -129,7 +129,7 @@ class OVEluMetatype(OVOpMetatype):
 @OV_OPERATOR_METATYPES.register()
 class OVPReluMetatype(OVOpMetatype):
     name = "PReluOp"
-    op_names = ["PReLU"]
+    op_names = ["PRelu"]
 
 
 @OV_OPERATOR_METATYPES.register()
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,2 @@
		[tool.md_dead_link_check]
		exclude_files = ["ReleaseNotes.md"]