Support OpenVINO int8 static quantization #3025

l-bat · 2024-10-28T09:12:34Z

Add Post-Training Static Quantization support for OpenVINO models

Usage examples:

To quantize Hugging Face Hub Model:

from sentence_transformers import SentenceTransformer, export_static_quantized_openvino_model
from optimum.intel import OVQuantizationConfig

model = SentenceTransformer("all-MiniLM-L6-v2", backend="openvino")
quantization_config = OVQuantizationConfig()
export_static_quantized_openvino_model(
      model, quantization_config, "all-MiniLM-L6-v2", push_to_hub=True, create_pr=True
)

To quantize a local model:

from sentence_transformers import SentenceTransformer, export_static_quantized_openvino_model
from optimum.intel import OVQuantizationConfig

model = SentenceTransformer("path/to/my/mpnet-legal-finetuned", backend="openvino")
quantization_config = OVQuantizationConfig()
export_static_quantized_openvino_model(model, quantization_config, "path/to/my/mpnet-legal-finetuned")

l-bat · 2024-10-28T09:14:58Z

@AlexKoff88, please take a look

sentence_transformers/backend.py

AlexKoff88 · 2024-10-28T09:54:25Z

@tomaarsen, following up on our conversation on Linkedin. We prepared an integration of quantization with OpenVINO. Can you please review it?

sentence_transformers/backend.py

tomaarsen · 2024-10-28T11:47:34Z

Thanks a bunch for this! I think this is looking quite solid already - I already wrote a few comments and I'm doing local tests now. I will be updating the Benchmark figures & Recommendations based on whatever my findings are.

Tom Aarsen

tomaarsen · 2024-10-28T14:40:02Z

I'm also getting this warning, can we do something about that?

Parameter 'function'=<function export_static_quantized_openvino_model.<locals>.preprocess_function at 0x0000025F839C4720> of the transform datasets.arrow_dataset.Dataset._map_single couldn't be hashed properly, a random hash was used instead. Make sure your transforms and parameters are serializable with pickle or dill for the dataset fingerprinting and caching to work. If you reuse this transform, the caching mechanism will consider it to be different from the previous calls and recompute everything. This warning is only showed once. Subsequent hashing failures won't be showed.

Another time I got 2:

Parameter 'function'=<function export_static_quantized_openvino_model.<locals>.preprocess_function at 0x0000020F87B020C0> of the transform datasets.arrow_dataset.Dataset._map_single couldn't be hashed properly, a random hash was used instead. Make sure your transforms and parameters are serializable with pickle or dill for the dataset fingerprinting and caching to work. If you reuse this transform, the caching mechanism will consider it to be different from the previous calls and recompute everything. This warning is only showed once. Subsequent hashing failures won't be showed.
WARNING:datasets.fingerprint:Parameter 'function'=<function export_static_quantized_openvino_model.<locals>.preprocess_function at 0x0000020F87B020C0> of the transform datasets.arrow_dataset.Dataset._map_single couldn't be hashed properly, a random hash was used instead. Make sure your transforms and parameters are serializable with pickle or dill for the dataset fingerprinting and caching to work. If you reuse this transform, the caching mechanism will consider it to be different from the previous calls and recompute everything. This warning is only showed once. Subsequent hashing failures won't be showed.

Also update the performance ratio lower bound from 94% to 99%

Indenting was off; "all-MiniLM-L6-v2" had to be updated to "sentence-transformers/all-MiniLM-L6-v2" in a few places; and updated recommendation

tomaarsen · 2024-10-28T15:05:18Z

I've made a few changes to help this along:

Re-formatted to make sure the CI won't complain
Patch save_or_push_to_hub_model (I see now my commit description is wrong) as it didn't upload the bin file.
Added benchmark figure: OV int8 quantization looks extremely extremely solid!
Updated the recommendation in the docs accordingly

This is huge, well done! The future docs:

Could you please have a look at the remaining items, i.e.:

The preprocess_function hash warning
Exposing dataset_name, dataset_config_name, etc.

Tom Aarsen

AlexKoff88 · 2024-10-28T15:19:09Z

BTW, it should work in Intel GPU as well (e.g. integrated graphics) and it will be even faster if you make the input shape static.

tomaarsen · 2024-10-28T15:24:21Z

Alright, thanks for sharing, I'll do some experiments!

tomaarsen · 2024-10-28T15:33:33Z

I'm having issues with iGPU - not something we have to worry about now, it's not a dealbreaker for this PR.

Traceback (most recent call last):
  File "c:\code\sentence-transformers\demo_3025_load.py", line 6, in <module>
    model = SentenceTransformer(
            ^^^^^^^^^^^^^^^^^^^^
  File "c:\code\sentence-transformers\sentence_transformers\SentenceTransformer.py", line 306, in __init__
    modules, self.module_kwargs = self._load_sbert_model(
                                  ^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\code\sentence-transformers\sentence_transformers\SentenceTransformer.py", line 1722, in _load_sbert_model
    module = module_class(model_name_or_path, cache_dir=cache_folder, backend=self.backend, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\code\sentence-transformers\sentence_transformers\models\Transformer.py", line 76, in __init__
    self._load_model(model_name_or_path, config, cache_dir, backend, **model_args)
  File "c:\code\sentence-transformers\sentence_transformers\models\Transformer.py", line 114, in _load_model
    self._load_openvino_model(model_name_or_path, config, cache_dir, **model_args)
  File "c:\code\sentence-transformers\sentence_transformers\models\Transformer.py", line 159, in _load_openvino_model
    self.auto_model: OVModelForFeatureExtraction = OVModelForFeatureExtraction.from_pretrained(
                                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\tom\.conda\envs\sentence-transformers\Lib\site-packages\optimum\intel\openvino\modeling_base.py", line 465, in from_pretrained
    return super().from_pretrained(
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\tom\.conda\envs\sentence-transformers\Lib\site-packages\optimum\modeling_base.py", line 438, in from_pretrained
    return from_pretrained_method(
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\tom\.conda\envs\sentence-transformers\Lib\site-packages\optimum\intel\openvino\modeling_base.py", line 390, in _from_pretrained
    return cls(
           ^^^^
  File "C:\Users\tom\.conda\envs\sentence-transformers\Lib\site-packages\optimum\intel\openvino\modeling.py", line 363, in __init__
    super().__init__(model, config, **kwargs)
  File "C:\Users\tom\.conda\envs\sentence-transformers\Lib\site-packages\optimum\intel\openvino\modeling.py", line 124, in __init__
    super().__init__(model, config, **kwargs)
  File "C:\Users\tom\.conda\envs\sentence-transformers\Lib\site-packages\optimum\intel\openvino\modeling_base.py", line 158, in __init__
    self.compile()
  File "C:\Users\tom\.conda\envs\sentence-transformers\Lib\site-packages\optimum\intel\openvino\modeling_base.py", line 667, in compile
    self.request = self._compile_model(self.model, self._device, ov_config, self.model_save_dir)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\tom\.conda\envs\sentence-transformers\Lib\site-packages\optimum\intel\openvino\modeling_base.py", line 275, in _compile_model
    compiled_model = core.compile_model(model, device.upper() if device is not None else device, config=ov_config)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\tom\.conda\envs\sentence-transformers\Lib\site-packages\openvino\runtime\ie_api.py", line 543, in compile_model
    super().compile_model(model, device_name, {} if config is None else config),
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Exception from src\inference\src\cpp\core.cpp:107:
Exception from src\inference\src\dev\plugin.cpp:53:
Check 'false' failed at src\plugins\intel_gpu\src\plugin\program_builder.cpp:185:
[GPU] ProgramBuilder build failed!
Program build failed(0_part_7):

If I run:

import time
from sentence_transformers import SentenceTransformer
from datasets import load_dataset

pr_number = 86
model = SentenceTransformer(
    "sentence-transformers/all-MiniLM-L6-v2",
    revision=f"refs/pr/{pr_number}",
    backend="openvino",
    model_kwargs={
        "file_name": r"openvino\openvino_model_qint8_quantized.xml",
        "device": "GPU",
    },
)

AlexKoff88 · 2024-10-28T15:42:56Z

I'm having issues with iGPU - not something we have to worry about now, it's not a dealbreaker for this PR.

Traceback (most recent call last):
  File "c:\code\sentence-transformers\demo_3025_load.py", line 6, in <module>
    model = SentenceTransformer(
            ^^^^^^^^^^^^^^^^^^^^
  File "c:\code\sentence-transformers\sentence_transformers\SentenceTransformer.py", line 306, in __init__
    modules, self.module_kwargs = self._load_sbert_model(
                                  ^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\code\sentence-transformers\sentence_transformers\SentenceTransformer.py", line 1722, in _load_sbert_model
    module = module_class(model_name_or_path, cache_dir=cache_folder, backend=self.backend, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\code\sentence-transformers\sentence_transformers\models\Transformer.py", line 76, in __init__
    self._load_model(model_name_or_path, config, cache_dir, backend, **model_args)
  File "c:\code\sentence-transformers\sentence_transformers\models\Transformer.py", line 114, in _load_model
    self._load_openvino_model(model_name_or_path, config, cache_dir, **model_args)
  File "c:\code\sentence-transformers\sentence_transformers\models\Transformer.py", line 159, in _load_openvino_model
    self.auto_model: OVModelForFeatureExtraction = OVModelForFeatureExtraction.from_pretrained(
                                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\tom\.conda\envs\sentence-transformers\Lib\site-packages\optimum\intel\openvino\modeling_base.py", line 465, in from_pretrained
    return super().from_pretrained(
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\tom\.conda\envs\sentence-transformers\Lib\site-packages\optimum\modeling_base.py", line 438, in from_pretrained
    return from_pretrained_method(
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\tom\.conda\envs\sentence-transformers\Lib\site-packages\optimum\intel\openvino\modeling_base.py", line 390, in _from_pretrained
    return cls(
           ^^^^
  File "C:\Users\tom\.conda\envs\sentence-transformers\Lib\site-packages\optimum\intel\openvino\modeling.py", line 363, in __init__
    super().__init__(model, config, **kwargs)
  File "C:\Users\tom\.conda\envs\sentence-transformers\Lib\site-packages\optimum\intel\openvino\modeling.py", line 124, in __init__
    super().__init__(model, config, **kwargs)
  File "C:\Users\tom\.conda\envs\sentence-transformers\Lib\site-packages\optimum\intel\openvino\modeling_base.py", line 158, in __init__
    self.compile()
  File "C:\Users\tom\.conda\envs\sentence-transformers\Lib\site-packages\optimum\intel\openvino\modeling_base.py", line 667, in compile
    self.request = self._compile_model(self.model, self._device, ov_config, self.model_save_dir)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\tom\.conda\envs\sentence-transformers\Lib\site-packages\optimum\intel\openvino\modeling_base.py", line 275, in _compile_model
    compiled_model = core.compile_model(model, device.upper() if device is not None else device, config=ov_config)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\tom\.conda\envs\sentence-transformers\Lib\site-packages\openvino\runtime\ie_api.py", line 543, in compile_model
    super().compile_model(model, device_name, {} if config is None else config),
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Exception from src\inference\src\cpp\core.cpp:107:
Exception from src\inference\src\dev\plugin.cpp:53:
Check 'false' failed at src\plugins\intel_gpu\src\plugin\program_builder.cpp:185:
[GPU] ProgramBuilder build failed!
Program build failed(0_part_7):

If I run:

import time
from sentence_transformers import SentenceTransformer
from datasets import load_dataset

pr_number = 86
model = SentenceTransformer(
    "sentence-transformers/all-MiniLM-L6-v2",
    revision=f"refs/pr/{pr_number}",
    backend="openvino",
    model_kwargs={
        "file_name": r"openvino\openvino_model_qint8_quantized.xml",
        "device": "GPU",
    },
)

Thanks @tomaarsen, we will take a look on our end but it can be a driver/driver-absence problem. I wonder if you installed any. Please find details here: https://docs.openvino.ai/2024/get-started/configurations/configurations-intel-gpu.html

cc'ed @vladimir-paramuzov, @sshlyapn

tomaarsen · 2024-10-28T15:47:22Z

Driver seems to be good to go - no updates. I'm on Windows currently.

AlexKoff88 · 2024-10-29T07:16:06Z

Driver seems to be good to go - no updates. I'm on Windows currently.

I tried this code on iGPU of my Intel Core Ultra 7 165U laptop and it works. But I used the most recent version of OpenVINO:
pip install --force-reinstall --pre openvino openvino-tokenizers openvino-genai --extra-index-url https://storage.openvinotoolkit.org/simple/wheels/nightly.
And of course, the driver I used was verified before.

tomaarsen · 2024-10-29T10:10:51Z

Thanks for taking care of the comments @l-bat - I built on top of it to address some final nitpicks, such as:

Adding None to the typing of the new dataset-related arguments.
Specifying that we're loading sst2 from glue by default.
Updating the datasets cache disabling so that the cache is re-enabled after the function ends.
If quantization_config is None, just set it to OVQuantizationConfig() because otherwise we get a warning that the user should specify the quantization_config (which I think is unnecessary)

I think this is about ready to be merged, what do you think @l-bat @AlexKoff88?

Tom Aarsen

l-bat · 2024-10-31T08:08:37Z

Thank you, @tomaarsen, for implementing the changes and addressing the remaining details. We agree this is ready for merging. Please let us know if there's anything specific you're still waiting on from our side to proceed.

tomaarsen · 2024-11-01T10:27:51Z

I think we're all set! Ideally, I'd like to include some other PRs in the upcoming release, but I do intend to release this soon (1-2 weeks presumably).

Thanks a bunch for leading this work, it should be very very valuable!

I'll merge this once the tests go green again after my final nits.

Tom Aarsen

AlexKoff88 reviewed Oct 28, 2024

View reviewed changes

sentence_transformers/backend.py Show resolved Hide resolved

AlexKoff88 approved these changes Oct 28, 2024

View reviewed changes

Support OpenVINO int8 static quantization

5bc5ca4

tomaarsen reviewed Oct 28, 2024

View reviewed changes

sentence_transformers/backend.py Outdated Show resolved Hide resolved

tomaarsen reviewed Oct 28, 2024

View reviewed changes

sentence_transformers/backend.py Outdated Show resolved Hide resolved

tomaarsen added 5 commits October 28, 2024 15:57

Run 'pre-commit run --all'

e31a098

Patch export_optimized_onnx_model - previously didn't upload .bin file

82096cf

Fix edge case on Windows with model filename being ignored & reexporting

f0a48db

Update benchmark figures; add OV-qint8, remove OV-igpu

b3d289f

Also update the performance ratio lower bound from 94% to 99%

Update efficiency docs, including recommendation

13f58a1

Indenting was off; "all-MiniLM-L6-v2" had to be updated to "sentence-transformers/all-MiniLM-L6-v2" in a few places; and updated recommendation

Add dataset parameters

bc5c3db

In docs, explain what the default dataset is

a0904fb

l-bat and others added 2 commits October 29, 2024 10:46

Disable hash warning

0122ef2

Implement the last nitpicks

ce1b8d9

tomaarsen added 2 commits November 1, 2024 11:28

Separate try-excepts in type_checking

e04fad8

Specify that default values will be used if None for quant_config

414a577

tomaarsen merged commit b9316f9 into UKPLab:master Nov 1, 2024
11 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support OpenVINO int8 static quantization #3025

Support OpenVINO int8 static quantization #3025

l-bat commented Oct 28, 2024

l-bat commented Oct 28, 2024

AlexKoff88 commented Oct 28, 2024

tomaarsen commented Oct 28, 2024 •

edited

Loading

tomaarsen commented Oct 28, 2024 •

edited

Loading

tomaarsen commented Oct 28, 2024 •

edited

Loading

AlexKoff88 commented Oct 28, 2024

tomaarsen commented Oct 28, 2024

tomaarsen commented Oct 28, 2024 •

edited

Loading

AlexKoff88 commented Oct 28, 2024

tomaarsen commented Oct 28, 2024

AlexKoff88 commented Oct 29, 2024

tomaarsen commented Oct 29, 2024

l-bat commented Oct 31, 2024

tomaarsen commented Nov 1, 2024 •

edited

Loading

Support OpenVINO int8 static quantization #3025

Support OpenVINO int8 static quantization #3025

Conversation

l-bat commented Oct 28, 2024

l-bat commented Oct 28, 2024

AlexKoff88 commented Oct 28, 2024

tomaarsen commented Oct 28, 2024 • edited Loading

tomaarsen commented Oct 28, 2024 • edited Loading

tomaarsen commented Oct 28, 2024 • edited Loading

AlexKoff88 commented Oct 28, 2024

tomaarsen commented Oct 28, 2024

tomaarsen commented Oct 28, 2024 • edited Loading

AlexKoff88 commented Oct 28, 2024

tomaarsen commented Oct 28, 2024

AlexKoff88 commented Oct 29, 2024

tomaarsen commented Oct 29, 2024

l-bat commented Oct 31, 2024

tomaarsen commented Nov 1, 2024 • edited Loading

tomaarsen commented Oct 28, 2024 •

edited

Loading

tomaarsen commented Oct 28, 2024 •

edited

Loading

tomaarsen commented Oct 28, 2024 •

edited

Loading

tomaarsen commented Oct 28, 2024 •

edited

Loading

tomaarsen commented Nov 1, 2024 •

edited

Loading