Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support OpenVINO int8 static quantization #3025

Merged
merged 12 commits into from
Nov 1, 2024

Conversation

l-bat
Copy link
Contributor

@l-bat l-bat commented Oct 28, 2024

Add Post-Training Static Quantization support for OpenVINO models

Usage examples:

To quantize Hugging Face Hub Model:

from sentence_transformers import SentenceTransformer, export_static_quantized_openvino_model
from optimum.intel import OVQuantizationConfig

model = SentenceTransformer("all-MiniLM-L6-v2", backend="openvino")
quantization_config = OVQuantizationConfig()
export_static_quantized_openvino_model(
      model, quantization_config, "all-MiniLM-L6-v2", push_to_hub=True, create_pr=True
)

To quantize a local model:

from sentence_transformers import SentenceTransformer, export_static_quantized_openvino_model
from optimum.intel import OVQuantizationConfig

model = SentenceTransformer("path/to/my/mpnet-legal-finetuned", backend="openvino")
quantization_config = OVQuantizationConfig()
export_static_quantized_openvino_model(model, quantization_config, "path/to/my/mpnet-legal-finetuned")

@l-bat
Copy link
Contributor Author

l-bat commented Oct 28, 2024

@AlexKoff88, please take a look

@AlexKoff88
Copy link

@tomaarsen, following up on our conversation on Linkedin. We prepared an integration of quantization with OpenVINO. Can you please review it?

@tomaarsen
Copy link
Collaborator

tomaarsen commented Oct 28, 2024

Thanks a bunch for this! I think this is looking quite solid already - I already wrote a few comments and I'm doing local tests now. I will be updating the Benchmark figures & Recommendations based on whatever my findings are.

  • Tom Aarsen

@tomaarsen
Copy link
Collaborator

tomaarsen commented Oct 28, 2024

I'm also getting this warning, can we do something about that?

Parameter 'function'=<function export_static_quantized_openvino_model.<locals>.preprocess_function at 0x0000025F839C4720> of the transform datasets.arrow_dataset.Dataset._map_single couldn't be hashed properly, a random hash was used instead. Make sure your transforms and parameters are serializable with pickle or dill for the dataset fingerprinting and caching to work. If you reuse this transform, the caching mechanism will consider it to be different from the previous calls and recompute everything. This warning is only showed once. Subsequent hashing failures won't be showed.

Another time I got 2:

Parameter 'function'=<function export_static_quantized_openvino_model.<locals>.preprocess_function at 0x0000020F87B020C0> of the transform datasets.arrow_dataset.Dataset._map_single couldn't be hashed properly, a random hash was used instead. Make sure your transforms and parameters are serializable with pickle or dill for the dataset fingerprinting and caching to work. If you reuse this transform, the caching mechanism will consider it to be different from the previous calls and recompute everything. This warning is only showed once. Subsequent hashing failures won't be showed.
WARNING:datasets.fingerprint:Parameter 'function'=<function export_static_quantized_openvino_model.<locals>.preprocess_function at 0x0000020F87B020C0> of the transform datasets.arrow_dataset.Dataset._map_single couldn't be hashed properly, a random hash was used instead. Make sure your transforms and parameters are serializable with pickle or dill for the dataset fingerprinting and caching to work. If you reuse this transform, the caching mechanism will consider it to be different from the previous calls and recompute everything. This warning is only showed once. Subsequent hashing failures won't be showed.

Also update the performance ratio lower bound from 94% to 99%
Indenting was off; "all-MiniLM-L6-v2" had to be updated to "sentence-transformers/all-MiniLM-L6-v2" in a few places; and updated recommendation
@tomaarsen
Copy link
Collaborator

tomaarsen commented Oct 28, 2024

I've made a few changes to help this along:

  1. Re-formatted to make sure the CI won't complain
  2. Patch save_or_push_to_hub_model (I see now my commit description is wrong) as it didn't upload the bin file.
  3. Added benchmark figure: OV int8 quantization looks extremely extremely solid!
  4. Updated the recommendation in the docs accordingly

This is huge, well done! The future docs:
image
image

Could you please have a look at the remaining items, i.e.:

  1. The preprocess_function hash warning
  2. Exposing dataset_name, dataset_config_name, etc.

  • Tom Aarsen

@AlexKoff88
Copy link

BTW, it should work in Intel GPU as well (e.g. integrated graphics) and it will be even faster if you make the input shape static.

@tomaarsen
Copy link
Collaborator

Alright, thanks for sharing, I'll do some experiments!

@tomaarsen
Copy link
Collaborator

tomaarsen commented Oct 28, 2024

I'm having issues with iGPU - not something we have to worry about now, it's not a dealbreaker for this PR.

Traceback (most recent call last):
  File "c:\code\sentence-transformers\demo_3025_load.py", line 6, in <module>
    model = SentenceTransformer(
            ^^^^^^^^^^^^^^^^^^^^
  File "c:\code\sentence-transformers\sentence_transformers\SentenceTransformer.py", line 306, in __init__
    modules, self.module_kwargs = self._load_sbert_model(
                                  ^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\code\sentence-transformers\sentence_transformers\SentenceTransformer.py", line 1722, in _load_sbert_model
    module = module_class(model_name_or_path, cache_dir=cache_folder, backend=self.backend, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\code\sentence-transformers\sentence_transformers\models\Transformer.py", line 76, in __init__
    self._load_model(model_name_or_path, config, cache_dir, backend, **model_args)
  File "c:\code\sentence-transformers\sentence_transformers\models\Transformer.py", line 114, in _load_model
    self._load_openvino_model(model_name_or_path, config, cache_dir, **model_args)
  File "c:\code\sentence-transformers\sentence_transformers\models\Transformer.py", line 159, in _load_openvino_model
    self.auto_model: OVModelForFeatureExtraction = OVModelForFeatureExtraction.from_pretrained(
                                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\tom\.conda\envs\sentence-transformers\Lib\site-packages\optimum\intel\openvino\modeling_base.py", line 465, in from_pretrained
    return super().from_pretrained(
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\tom\.conda\envs\sentence-transformers\Lib\site-packages\optimum\modeling_base.py", line 438, in from_pretrained
    return from_pretrained_method(
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\tom\.conda\envs\sentence-transformers\Lib\site-packages\optimum\intel\openvino\modeling_base.py", line 390, in _from_pretrained
    return cls(
           ^^^^
  File "C:\Users\tom\.conda\envs\sentence-transformers\Lib\site-packages\optimum\intel\openvino\modeling.py", line 363, in __init__
    super().__init__(model, config, **kwargs)
  File "C:\Users\tom\.conda\envs\sentence-transformers\Lib\site-packages\optimum\intel\openvino\modeling.py", line 124, in __init__
    super().__init__(model, config, **kwargs)
  File "C:\Users\tom\.conda\envs\sentence-transformers\Lib\site-packages\optimum\intel\openvino\modeling_base.py", line 158, in __init__
    self.compile()
  File "C:\Users\tom\.conda\envs\sentence-transformers\Lib\site-packages\optimum\intel\openvino\modeling_base.py", line 667, in compile
    self.request = self._compile_model(self.model, self._device, ov_config, self.model_save_dir)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\tom\.conda\envs\sentence-transformers\Lib\site-packages\optimum\intel\openvino\modeling_base.py", line 275, in _compile_model
    compiled_model = core.compile_model(model, device.upper() if device is not None else device, config=ov_config)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\tom\.conda\envs\sentence-transformers\Lib\site-packages\openvino\runtime\ie_api.py", line 543, in compile_model
    super().compile_model(model, device_name, {} if config is None else config),
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Exception from src\inference\src\cpp\core.cpp:107:
Exception from src\inference\src\dev\plugin.cpp:53:
Check 'false' failed at src\plugins\intel_gpu\src\plugin\program_builder.cpp:185:
[GPU] ProgramBuilder build failed!
Program build failed(0_part_7):

If I run:

import time
from sentence_transformers import SentenceTransformer
from datasets import load_dataset

pr_number = 86
model = SentenceTransformer(
    "sentence-transformers/all-MiniLM-L6-v2",
    revision=f"refs/pr/{pr_number}",
    backend="openvino",
    model_kwargs={
        "file_name": r"openvino\openvino_model_qint8_quantized.xml",
        "device": "GPU",
    },
)

@AlexKoff88
Copy link

I'm having issues with iGPU - not something we have to worry about now, it's not a dealbreaker for this PR.

Traceback (most recent call last):
  File "c:\code\sentence-transformers\demo_3025_load.py", line 6, in <module>
    model = SentenceTransformer(
            ^^^^^^^^^^^^^^^^^^^^
  File "c:\code\sentence-transformers\sentence_transformers\SentenceTransformer.py", line 306, in __init__
    modules, self.module_kwargs = self._load_sbert_model(
                                  ^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\code\sentence-transformers\sentence_transformers\SentenceTransformer.py", line 1722, in _load_sbert_model
    module = module_class(model_name_or_path, cache_dir=cache_folder, backend=self.backend, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\code\sentence-transformers\sentence_transformers\models\Transformer.py", line 76, in __init__
    self._load_model(model_name_or_path, config, cache_dir, backend, **model_args)
  File "c:\code\sentence-transformers\sentence_transformers\models\Transformer.py", line 114, in _load_model
    self._load_openvino_model(model_name_or_path, config, cache_dir, **model_args)
  File "c:\code\sentence-transformers\sentence_transformers\models\Transformer.py", line 159, in _load_openvino_model
    self.auto_model: OVModelForFeatureExtraction = OVModelForFeatureExtraction.from_pretrained(
                                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\tom\.conda\envs\sentence-transformers\Lib\site-packages\optimum\intel\openvino\modeling_base.py", line 465, in from_pretrained
    return super().from_pretrained(
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\tom\.conda\envs\sentence-transformers\Lib\site-packages\optimum\modeling_base.py", line 438, in from_pretrained
    return from_pretrained_method(
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\tom\.conda\envs\sentence-transformers\Lib\site-packages\optimum\intel\openvino\modeling_base.py", line 390, in _from_pretrained
    return cls(
           ^^^^
  File "C:\Users\tom\.conda\envs\sentence-transformers\Lib\site-packages\optimum\intel\openvino\modeling.py", line 363, in __init__
    super().__init__(model, config, **kwargs)
  File "C:\Users\tom\.conda\envs\sentence-transformers\Lib\site-packages\optimum\intel\openvino\modeling.py", line 124, in __init__
    super().__init__(model, config, **kwargs)
  File "C:\Users\tom\.conda\envs\sentence-transformers\Lib\site-packages\optimum\intel\openvino\modeling_base.py", line 158, in __init__
    self.compile()
  File "C:\Users\tom\.conda\envs\sentence-transformers\Lib\site-packages\optimum\intel\openvino\modeling_base.py", line 667, in compile
    self.request = self._compile_model(self.model, self._device, ov_config, self.model_save_dir)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\tom\.conda\envs\sentence-transformers\Lib\site-packages\optimum\intel\openvino\modeling_base.py", line 275, in _compile_model
    compiled_model = core.compile_model(model, device.upper() if device is not None else device, config=ov_config)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\tom\.conda\envs\sentence-transformers\Lib\site-packages\openvino\runtime\ie_api.py", line 543, in compile_model
    super().compile_model(model, device_name, {} if config is None else config),
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Exception from src\inference\src\cpp\core.cpp:107:
Exception from src\inference\src\dev\plugin.cpp:53:
Check 'false' failed at src\plugins\intel_gpu\src\plugin\program_builder.cpp:185:
[GPU] ProgramBuilder build failed!
Program build failed(0_part_7):

If I run:

import time
from sentence_transformers import SentenceTransformer
from datasets import load_dataset

pr_number = 86
model = SentenceTransformer(
    "sentence-transformers/all-MiniLM-L6-v2",
    revision=f"refs/pr/{pr_number}",
    backend="openvino",
    model_kwargs={
        "file_name": r"openvino\openvino_model_qint8_quantized.xml",
        "device": "GPU",
    },
)

Thanks @tomaarsen, we will take a look on our end but it can be a driver/driver-absence problem. I wonder if you installed any. Please find details here: https://docs.openvino.ai/2024/get-started/configurations/configurations-intel-gpu.html

cc'ed @vladimir-paramuzov, @sshlyapn

@tomaarsen
Copy link
Collaborator

Driver seems to be good to go - no updates. I'm on Windows currently.
image

@AlexKoff88
Copy link

Driver seems to be good to go - no updates. I'm on Windows currently.

I tried this code on iGPU of my Intel Core Ultra 7 165U laptop and it works. But I used the most recent version of OpenVINO:
pip install --force-reinstall --pre openvino openvino-tokenizers openvino-genai --extra-index-url https://storage.openvinotoolkit.org/simple/wheels/nightly.
And of course, the driver I used was verified before.

@tomaarsen
Copy link
Collaborator

Thanks for taking care of the comments @l-bat - I built on top of it to address some final nitpicks, such as:

  1. Adding None to the typing of the new dataset-related arguments.
  2. Specifying that we're loading sst2 from glue by default.
  3. Updating the datasets cache disabling so that the cache is re-enabled after the function ends.
  4. If quantization_config is None, just set it to OVQuantizationConfig() because otherwise we get a warning that the user should specify the quantization_config (which I think is unnecessary)

I think this is about ready to be merged, what do you think @l-bat @AlexKoff88?

  • Tom Aarsen

@l-bat
Copy link
Contributor Author

l-bat commented Oct 31, 2024

Thank you, @tomaarsen, for implementing the changes and addressing the remaining details. We agree this is ready for merging. Please let us know if there's anything specific you're still waiting on from our side to proceed.

@tomaarsen
Copy link
Collaborator

tomaarsen commented Nov 1, 2024

I think we're all set! Ideally, I'd like to include some other PRs in the upcoming release, but I do intend to release this soon (1-2 weeks presumably).

Thanks a bunch for leading this work, it should be very very valuable!

I'll merge this once the tests go green again after my final nits.

  • Tom Aarsen

@tomaarsen tomaarsen merged commit b9316f9 into UKPLab:master Nov 1, 2024
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants