Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Computation of compression parameters via OpenVINO models #2727

Open
wants to merge 77 commits into
base: develop
Choose a base branch
from

Conversation

nikita-savelyevv
Copy link
Collaborator

@nikita-savelyevv nikita-savelyevv commented Jun 11, 2024

Changes

  • Implemented OpenVINO model graphs which are used for calculation of compressed and decompressed weights. Since these models are compiled, calculation become significantly faster especially for larger models and int4 compression.
  • This functionality is exposed by two methods at weight_lowering.py:
    • do_int_quantization() is used for computing a compressed weight. Possible signatures:
      • weight -> compressed_weight, scale, (zero_point for asymmetric compression)
      • weight, scale, (zero_point) -> compressed_weight, scale, (zero_point)
    • calculate_quantized_dequantized_weight() is used for computing a decompressed weight. Possible signatures:
      • weight -> decompressed_weight
      • weight, scale, (zero_point) -> decompressed_weight
      • weight -> decompressed_weight, compressed_weight, scale, (zero_point)
      • weight, scale, (zero_point) -> decompressed_weight, compressed_weight, scale, (zero_point)
    • Output scale and zero_point are the same as the ones given as input (if they were given at all).
    • Computation is done via OV models only if openvino package is installed and input tensors are not torch tensors.
  • Introduce a new NNCF Tensor backend for storing instances of openvino.Tensor. Implementation for this backend is limited by only the required functionality, e.g. addition of OV Tensors is not supported because it is not needed.
    • Introduction of OV Tensors is required for seamless handling of tensors in bf16, u4 and i4 data types. For example, bf16 constants are read from an OpenVINO LLM and given as inputs to a compressing OpenVINO model. u4 and i4 compressed weights are seamlessly inserted into the resulting compressed OpenVINO model.
    • Added tensor.to_backend() method to convert an NNCF Tensor from one backend to another. Currently on OV<->NP conversion is required.
  • All calculations are aligned with reference numpy implementation. Some performance and memory sacrifices had to be made for such alignment.

Data-free asymmetric compression:
image

Data-free symmetric compression:
image

Data-aware compression:
image

Reason for changes

Reducing model compression time. Only OpenVINO model compression backend is affected.

Related tickets

139047

Tests

  • tests/openvino/native/quantization/test_ov_modeling_compression.py::test_quantization_alignment -- check aligment with reference numpy implementation
  • tests/openvino/native/test_openvino_modeling.py -- checks OV modeling framework hyperparameters
  • tests/openvino/native/test_tensor.py -- NNCF OV Tensor backend tests

Validation jobs:

@github-actions github-actions bot added NNCF Common Pull request that updates NNCF Common NNCF OpenVINO Pull requests that updates NNCF OpenVINO NNCF PTQ Pull requests that updates NNCF PTQ labels Jun 11, 2024
@nikita-savelyevv nikita-savelyevv force-pushed the compress-via-openvino branch 4 times, most recently from 55cafaa to a68a63d Compare July 3, 2024 18:31
@nikita-savelyevv nikita-savelyevv force-pushed the compress-via-openvino branch 4 times, most recently from 6b98ddd to 3d9faa4 Compare July 16, 2024 14:19
@nikita-savelyevv nikita-savelyevv force-pushed the compress-via-openvino branch 6 times, most recently from 1c85732 to b527cac Compare September 6, 2024 11:11
@github-actions github-actions bot added the documentation Improvements or additions to documentation label Sep 6, 2024
@nikita-savelyevv nikita-savelyevv force-pushed the compress-via-openvino branch 2 times, most recently from ac3ea02 to 2a3a63c Compare September 11, 2024 12:59
@nikita-savelyevv nikita-savelyevv force-pushed the compress-via-openvino branch 2 times, most recently from fe30c13 to 19ea412 Compare October 21, 2024 08:52
@nikita-savelyevv nikita-savelyevv force-pushed the compress-via-openvino branch 3 times, most recently from eef34f8 to ca3447c Compare October 26, 2024 13:40
@nikita-savelyevv nikita-savelyevv changed the title Generalize weight compression via OpenVINO submodels Computation of compression parameters via OpenVINO models Dec 12, 2024
@nikita-savelyevv nikita-savelyevv marked this pull request as ready for review December 13, 2024 10:57
@nikita-savelyevv nikita-savelyevv requested a review from a team as a code owner December 13, 2024 10:57


@lru_cache(None)
def log_once(level: int, message: str) -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NNCF already has a solution for single logging with DuplicateFilter:

dup_filter = DuplicateFilter() # so that the overflow fix warning is only logged once

return item in self._cache


def cache_results(cache: ResultsCacheContainer) -> Callable: # type: ignore
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like you implemented a general solution for function output caching based on memorization techniques. The functools has such implementation https://docs.python.org/dev/library/functools.html#functools.cache. What do you think about using it?

return divide_node


def create_ov_const_from_tensor(x: Tensor, dtype: ov.Type, name: Optional[str] = None) -> Constant:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def create_ov_const_from_tensor(x: Tensor, dtype: ov.Type, name: Optional[str] = None) -> Constant:
def create_ov_const_from_tensor(x: Tensor, dtype: ov.Type, name: Optional[str] = None) -> op.Constant:

@@ -14,6 +14,7 @@
import numpy as np
import openvino.runtime as ov
import openvino.runtime.opset13 as opset
from openvino._pyopenvino.op import Constant
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
from openvino._pyopenvino.op import Constant
import openvino.runtime.op as op

@@ -107,16 +110,17 @@ def cnt_if_op(model: ov.Model, cnt: int) -> int:
return cnt_if_op(model, 0)


def get_const_value(const_node: ov.Node) -> np.ndarray:
def get_const_value(const_node: ov.Node, cast_bf16_to_fp32: Optional[bool] = True) -> np.ndarray:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def get_const_value(const_node: ov.Node, cast_bf16_to_fp32: Optional[bool] = True) -> np.ndarray:
def get_const_value(const_node: ov.Node, cast_bf16_to_fp32: bool = True) -> np.ndarray:

@@ -40,12 +40,23 @@ def num_bits(self):
"""
return 8 if self.mode in [CompressWeightsMode.INT8_SYM, CompressWeightsMode.INT8_ASYM] else 4

@property
def is_int_asym(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def is_int_asym(self):
def is_asymmetric_mode(self):

# Infer the model
inputs = [inp.data for inp in inputs]
if ov_model_params.return_ov_tensors:
infer_request = compiled_model.create_infer_request()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you use the cache, I believe that you can cache the infer request to avoid creating instance every call. Did you try it?


# Infer the model
inputs = [inp.data for inp in inputs]
if ov_model_params.return_ov_tensors:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you briefly explain why you use different APIs for model inference such as a model(input) and an infer request? Is there any advantage to this?

@@ -0,0 +1,519 @@
# Copyright (c) 2024 Intel Corporation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think this approach has good opportunity for extending? I mean If a developer wants to add new function what he should implement?

compressed_weights = calculate_quantized_weight(weight, config, scale, zero_point)
return compressed_weights, scale, zero_point

from nncf.quantization.algorithms.weight_compression.openvino_modeling import OVModelParameters
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My opinion that a developer has to write a ton of code to use function powered by OpenVINO and I can assume that in this form few people will use it. You should think how to simplify it.

@@ -0,0 +1,123 @@
# Copyright (c) 2024 Intel Corporation
Copy link
Contributor

@alexsu52 alexsu52 Jan 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please rename ov.py -> ov_numeric.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation NNCF Common Pull request that updates NNCF Common NNCF OpenVINO Pull requests that updates NNCF OpenVINO NNCF PTQ Pull requests that updates NNCF PTQ
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants