Skip to content

Commit

Permalink
Merge branch 'develop' into memory-logger-tool
Browse files Browse the repository at this point in the history
  • Loading branch information
nikita-savelyevv committed Jul 17, 2024
2 parents 45226e5 + d113c2a commit 7043204
Show file tree
Hide file tree
Showing 68 changed files with 792 additions and 649 deletions.
4 changes: 3 additions & 1 deletion .github/workflows/nightly.yml
Original file line number Diff line number Diff line change
Expand Up @@ -11,4 +11,6 @@ jobs:
runs-on: ubuntu-22.04
steps:
- uses: actions/checkout@a5ac7e51b41094c92402da3b24376905380afc29 # v4.1.6
- uses: AlexanderDokuchaev/md-dead-link-check@76ecefc7f64753bba30a36179f46d903e9f77669 # v0.8
- uses: AlexanderDokuchaev/md-dead-link-check@cc3ed55268899a1a6d5fd7068abbc4591eab1f74 # v0.9
with:
config: md_dead_link_check.toml
4 changes: 3 additions & 1 deletion .github/workflows/pre-commit-linters.yml
Original file line number Diff line number Diff line change
Expand Up @@ -24,4 +24,6 @@ jobs:
runs-on: ubuntu-22.04
steps:
- uses: actions/checkout@a5ac7e51b41094c92402da3b24376905380afc29 # v4.1.6
- uses: AlexanderDokuchaev/md-dead-link-check@76ecefc7f64753bba30a36179f46d903e9f77669 # v0.8
- uses: AlexanderDokuchaev/md-dead-link-check@cc3ed55268899a1a6d5fd7068abbc4591eab1f74 # v0.9
with:
config: md_dead_link_check.toml
92 changes: 52 additions & 40 deletions README.md

Large diffs are not rendered by default.

1 change: 1 addition & 0 deletions docs/Algorithms.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@
- Symmetric 8 bit compression mode
- Symmetric and asymmetric 4 bit compression mode
- NF4 compression mode
- E2M1 weights with E8M0 scales compression mode
- Mixed precision weights compression
- Grouped weights compression

Expand Down
16 changes: 13 additions & 3 deletions docs/usage/post_training_compression/weights_compression/Usage.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ The Weights Compression algorithm is aimed at compressing the weights of the mod
#### Supported modes

By default, weights are compressed asymmetrically to 8-bit integer data type - "INT8_ASYM" mode.
OpenVINO backend also supports 3 modes of mixed precision weight quantization with a 4-bit data type as a primary precision - INT4_SYM, INT4_ASYM and NF4. The primary precision in case of INT4_SYM mode is signed 4-bit integer and weights are quantized to it [symmetrically](/docs/usage/training_time_compression/other_algorithms/LegacyQuantization.md#symmetric-quantization) without zero point. In case of INT4_ASYM mode - unsigned 4-bit integer and weight are quantized to it [asymmetrically](/docs/usage/training_time_compression/other_algorithms/LegacyQuantization.md#asymmetric-quantization) with a typical non-fixed zero point. In case of NF4 mode - [nf4](https://arxiv.org/pdf/2305.14314v1.pdf) data type without zero point.
OpenVINO backend also supports 3 modes of mixed precision weight quantization with a 4-bit data type as a primary precision - INT4_SYM, INT4_ASYM, NF4, E2M1. The primary precision in case of INT4_SYM mode is signed 4-bit integer and weights are quantized to it [symmetrically](/docs/usage/training_time_compression/other_algorithms/LegacyQuantization.md#symmetric-quantization) without zero point. In case of INT4_ASYM mode - unsigned 4-bit integer and weight are quantized to it [asymmetrically](/docs/usage/training_time_compression/other_algorithms/LegacyQuantization.md#asymmetric-quantization) with a typical non-fixed zero point. In case of NF4 mode - [nf4](https://arxiv.org/pdf/2305.14314v1.pdf) data type without zero point. In case of E2M1 mode - [e2m1](https://www.opencompute.org/documents/ocp-microscaling-formats-mx-v1-0-spec-final-pdf) data type without zero point and has 8bit [E8M0](https://www.opencompute.org/documents/ocp-microscaling-formats-mx-v1-0-spec-final-pdf) scale.
All 4-bit modes have a grouped quantization support, when small group of weights (e.g. 128) in the channel dimension share quantization parameters (scale).
All embeddings, convolutions and last linear layers are always compressed to 8-bit integer data type. To quantize embeddings and last linear layers to 4-bit, use `all_layers=True`.
Percent of the rest layers compressed to 4-bit can be configured by "ratio" parameter. E.g. ratio=0.9 means 90% of layers compressed to the corresponding 4-bit data type and the rest to 8-bit asymmetric integer data type.
Expand Down Expand Up @@ -40,7 +40,7 @@ compressed_model = compress_weights(model, mode=CompressWeightsMode.INT4_SYM) #
- Generally, `INT4_SYM` mode is the fastest mixed-precision mode, but it may lead to a significant accuracy degradation or perplexity increase.
Compressing weights asymmetrically (`INT4_ASYM` mode) is the way to increase accuracy, however in turns it slows down inference a bit.
If the accuracy or perplexity is still not satisfying, there are 2 more hyper-parameters to tune: `group_size` and `ratio`. Please refer to the [example](https://github.com/openvinotoolkit/nncf/blob/develop/examples/llm_compression/openvino/tiny_llama_find_hyperparams) how to automatically tune these parameters.
Lower group size and less ratio of 4-bit layers usually improve accuracy at the sacrifice of inference speed.
Lower group size and less ratio of 4-bit layers usually improve accuracy at the sacrifice of inference speed. To disable grouped quantization and quantize weights per-channel, set `group_size = -1`.
Below is the example how to compress weights of 90% of layers to 4-bit integer asymmetrically with the group size 64, and
the rest of layers to 8-bit asymmetric integer data type. The same parametrization is applicable for `INT4_SYM` mode.

Expand Down Expand Up @@ -144,6 +144,15 @@ from nncf import compress_weights, CompressWeightsMode
compressed_model = compress_weights(model, mode=CompressWeightsMode.NF4)
```

- `E2M1` mode can be considered for improving accuracy, but currently models quantized to e2m1 should not be faster models
quantized to 8-bit asymmetric integer. Here's the example how to compress weights to e2m1 data type with group size = 32 (recommended).
Different `group_size` and `ratio` are also supported.

```python
from nncf import compress_weights, CompressWeightsMode
compressed_model = compress_weights(model, mode=CompressWeightsMode.E2M1, group_size=32, all_layers=True)
```

#### Evaluation results

Here is the perplexity and model size before and after weight compression for different language models on the [Lambada OpenAI dataset](https://github.com/openai/gpt-2/issues/131#issuecomment-497136199).
Expand Down Expand Up @@ -512,8 +521,9 @@ Here is the perplexity and accuracy with data-free and data-aware mixed-precisio
- The algorithm is supported for OpenVINO and PyTorch models.
- The compression applies in-place.
- The compressed model is not trainable.
- INT4_SYM, INT4_ASYM and NF4 modes, grouped quantization and mixed precision selection is available for OpenVINO backend only.
- INT4_SYM, INT4_ASYM, NF4 and E2M1 modes, grouped quantization and mixed precision selection is available for OpenVINO backend only.
- NF4 support is experimental - models quantized to nf4 should not be faster models quantized to 8-bit integer.
- E2M1 support is experimental - models quantized to e2m1 should not be faster models quantized to 8-bit integer.

#### Additional resources

Expand Down
2 changes: 1 addition & 1 deletion examples/llm_compression/openvino/tiny_llama/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -67,7 +67,7 @@ def transform_fn(data, model, tokenizer):
)
model.save_pretrained(OUTPUT_DIR)

model = OVModelForCausalLM.from_pretrained(OUTPUT_DIR)
model = OVModelForCausalLM.from_pretrained(OUTPUT_DIR, ov_config={"DYNAMIC_QUANTIZATION_GROUP_SIZE": "0"})
input_ids = tokenizer("What is PyTorch?", return_tensors="pt").to(device=model.device)

start_t = time.time()
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -241,7 +241,12 @@ def gen_pkv(num_heads, head_dim, num_layers):

def main():
model_id = "TinyLlama/TinyLlama-1.1B-step-50K-105b" # <YOUR_MODEL_ID>
ov_config = {"PERFORMANCE_HINT": "LATENCY", "NUM_STREAMS": "1", "CACHE_DIR": ""}
ov_config = {
"PERFORMANCE_HINT": "LATENCY",
"NUM_STREAMS": "1",
"CACHE_DIR": "",
"DYNAMIC_QUANTIZATION_GROUP_SIZE": "0",
}
model = OVModelForCausalLM.from_pretrained(
model_id,
export=True,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@

import nncf

tfds.display_progress_bar(enable=False)
ROOT = Path(__file__).parent.resolve()
WEIGHTS_URL = "https://huggingface.co/alexsu52/mobilenet_v2_imagenette/resolve/main/tf_model.h5"
DATASET_CLASSES = 10
Expand Down
2 changes: 2 additions & 0 deletions md_dead_link_check.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
[tool.md_dead_link_check]
exclude_files = ["ReleaseNotes.md"]
2 changes: 2 additions & 0 deletions nncf/common/graph/patterns/patterns.py
Original file line number Diff line number Diff line change
Expand Up @@ -330,6 +330,8 @@ class HWFusedPatternNames(Enum):
ACTIVATIONS_SCALE_SHIFT = PatternDesc("activations_scale_shift")
ARITHMETIC_ACTIVATIONS = PatternDesc("arithmetic_activations")
ARITHMETIC_ACTIVATIONS_BATCH_NORM = PatternDesc("arithmetic_activations_batch_norm")
# StyleGan2
ARITHMETIC_ACTIVATIONS_ARITHMETIC = PatternDesc("arithmetic_activations_arithmetic")
ARITHMETIC_ACTIVATIONS_SCALE_SHIFT = PatternDesc("arithmetic_activations_scale_shift")
ARITHMETIC_BATCH_NORM = PatternDesc("arithmetic_batch_norm")
ARITHMETIC_BATCH_NORM_ACTIVATIONS = PatternDesc("arithmetic_batch_norm_activations")
Expand Down
34 changes: 1 addition & 33 deletions nncf/common/insertion_point_graph.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@
from collections import defaultdict
from copy import deepcopy
from enum import Enum
from typing import Dict, List, Optional, Set
from typing import Dict, List, Set

import networkx as nx

Expand All @@ -23,7 +23,6 @@
from nncf.common.graph.layer_attributes import Dtype
from nncf.common.graph.operator_metatypes import INPUT_NOOP_METATYPES
from nncf.common.graph.patterns import GraphPattern
from nncf.common.logging import nncf_logger


class InsertionPointGraphNodeType(Enum):
Expand Down Expand Up @@ -393,34 +392,3 @@ def get_pre_hook_node_key(node_key: str, input_port_id: int = 0) -> str:
@staticmethod
def get_post_hook_node_key(node_key: str) -> str:
return InsertionPointGraph.POST_HOOK_ID_PREFIX + node_key


class ConstantNodesFilter:
@staticmethod
def filter(ip_graph: InsertionPointGraph, start_traversing_node_keys: Optional[List[str]]) -> InsertionPointGraph:
"""
Removes all Constant nodes from InsertionPointGraph, making it inference graph.
The traversing starts from the input nodes and nodes with weights.
:param ip_graph: The original InsertionPointGraph.
:param start_traversing_node_keys: Keys of the nodes from which the traversing will be start.
:return: InsertionPointGraph without Constant nodes.
"""
input_nodes = ip_graph.get_input_nodes()
if not input_nodes:
nncf_logger.debug("Skipped filtering - no input nodes found")
return ip_graph
weight_nodes = []
if start_traversing_node_keys is not None:
weight_nodes = [
ip_graph.get_merged_node_from_single_node_key(weight_node) for weight_node in start_traversing_node_keys
]
visited_nodes = set()
start_nodes = input_nodes + weight_nodes
for node in start_nodes:
for node_from, node_to in nx.bfs_edges(ip_graph, source=node):
visited_nodes.add(node_from)
visited_nodes.add(node_to)
constant_nodes = [node for node in ip_graph.nodes if node not in visited_nodes]
ip_graph.remove_nodes_from(constant_nodes)
return ip_graph
2 changes: 1 addition & 1 deletion nncf/openvino/graph/metatypes/openvino_metatypes.py
Original file line number Diff line number Diff line change
Expand Up @@ -129,7 +129,7 @@ class OVEluMetatype(OVOpMetatype):
@OV_OPERATOR_METATYPES.register()
class OVPReluMetatype(OVOpMetatype):
name = "PReluOp"
op_names = ["PReLU"]
op_names = ["PRelu"]


@OV_OPERATOR_METATYPES.register()
Expand Down
Loading

0 comments on commit 7043204

Please sign in to comment.