-
Notifications
You must be signed in to change notification settings - Fork 203
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Generate pipeline #334
Generate pipeline #334
Conversation
b139e69
to
b8026a9
Compare
text_generation/causal_lm/cpp/generate_pipeline/generate_pipeline.hpp
Outdated
Show resolved
Hide resolved
a6f16ea
to
d551da3
Compare
text_generation/causal_lm/cpp/generate_pipeline/generate_pipeline.hpp
Outdated
Show resolved
Hide resolved
text_generation/causal_lm/cpp/generate_pipeline/generate_pipeline.hpp
Outdated
Show resolved
Hide resolved
text_generation/causal_lm/cpp/generate_pipeline/generate_pipeline.hpp
Outdated
Show resolved
Hide resolved
text_generation/causal_lm/cpp/generate_pipeline/generate_pipeline.hpp
Outdated
Show resolved
Hide resolved
text_generation/causal_lm/cpp/generate_pipeline/generation_config.hpp
Outdated
Show resolved
Hide resolved
text_generation/causal_lm/cpp/generate_pipeline/generation_config.hpp
Outdated
Show resolved
Hide resolved
text_generation/causal_lm/cpp/generate_pipeline/generation_config.hpp
Outdated
Show resolved
Hide resolved
text_generation/causal_lm/cpp/generate_pipeline/generate_sample.cpp
Outdated
Show resolved
Hide resolved
text_generation/causal_lm/cpp/generate_pipeline/generate_sample.cpp
Outdated
Show resolved
Hide resolved
text_generation/causal_lm/cpp/generate_pipeline/generate_sample.cpp
Outdated
Show resolved
Hide resolved
text_generation/causal_lm/cpp/generate_pipeline/generate_sample.cpp
Outdated
Show resolved
Hide resolved
text_generation/causal_lm/cpp/generate_pipeline/generate_pipeline.hpp
Outdated
Show resolved
Hide resolved
text_generation/causal_lm/cpp/generate_pipeline/generate_pipeline.hpp
Outdated
Show resolved
Hide resolved
text_generation/causal_lm/cpp/generate_pipeline/generate_pipeline.hpp
Outdated
Show resolved
Hide resolved
src/README.md
Outdated
pipe = ov_ov_genai.LLMPipeline(model_path) | ||
|
||
config = {'num_groups': 3, 'group_size': 5, 'diversity_penalty': 1.5} | ||
pipe.set_generation_cofnig(config) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried and it does not work like this for me, should it work ? It woks for me if config is GenerationConfig object and I got with get_generation_config before
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to add description about options , which we can configure in config ?
fix ignore_eos fix batched detokenization add generation config validation removed CPU and redundant getting KV cache
* Leftovers * Leftovers * retrigger
* Split text samples to sepparate folders * correct path * correct * correct path
* Assume GenAI is installec * put --extra-index-url https://storage.openvinotoolkit.org/simple/wheels/pre-release back
// todo: if input_ids is a ov::Tensor/numpy tensor | ||
|
||
.def("get_tokenizer", &LLMPipeline::get_tokenizer) | ||
.def("start_chat", &LLMPipeline::start_chat) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried this api start_chat/finish_chat
on models form notebook https://github.com/openvinotoolkit/openvino_notebooks/tree/latest/notebooks/llm-chatbot
Two models work fine: llama-3-8b-instruct INT8 and tiny-llama-1b-chat INT8
But with other models I had 2 types of errors:
- phi-3-mini-instruct INT8, red-pajama-3b-chat INT4, gemma-2b-it INT4 and gemma-7b-it FP16 fails on the first call of pipe.generate() with error
msg = pipe(q)
RuntimeError: bad_expected_access
- notus-7b-v1 INT4 notus-7b-v1 FP16 neural-chat-7b-v3-1 INT8mistral-7b INT4 zephyr-7b-beta INT8 - first question is okay, fails on the first call of pipe.generate() with error:
# RuntimeError: Exception from src/inference/src/cpp/infer_request.cpp:223:
# Exception from src/plugins/intel_cpu/src/node.cpp:1626:
# Shape inference of Select node with name __module.model/aten::masked_fill/Select_1 failed: Exception from src/plugins/intel_cpu/src/shape_inference/custom/eltwise.cpp:45:
# Eltwise shape infer input shapes dim index: 3 mismatch
several calls of pipe.generate() without start_chat() don't lead to fails
device (str): Device to run the model on (e.g., CPU, GPU). Default is 'CPU'. | ||
)") | ||
|
||
.def(py::init([](py::object infer_request, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to create an infer request from outside with core.compile_model().create_infer_request() and put it here and got an error:
TypeError: __init__(): incompatible constructor arguments. The following argument types are supported:
1. openvino_genai.py_generate_pipeline.LLMPipeline(model_path: str, device: str = 'CPU')
2. openvino_genai.py_generate_pipeline.LLMPipeline(model_path: str, tokenizer: ov::genai::Tokenizer, device: str = 'CPU')
3. openvino_genai.py_generate_pipeline.LLMPipeline(infer_request: object, tokenizer: ov::genai::Tokenizer, config: Optional[ov::genai::GenerationConfig])
Is it right behavior ? How could I create the infer_requst to put it here ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The exact python line could help. The error message suggests, it's possible to do, but the args were incorrect:
3. openvino_genai.py_generate_pipeline.LLMPipeline(infer_request: object, tokenizer: ov::genai::Tokenizer, config: Optional[ov::genai::GenerationConfig])
python -m pip install --upgrade-strategy eager -r text_generation/causal_lm/cpp/requirements.txt | ||
python -m pip install ./thirdparty/openvino_tokenizers/[transformers] | ||
sudo apt-get install libtbb-dev | ||
python -m pip install --upgrade-strategy eager -r ./samples/cpp/requirements.txt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Wovchena here we install older OV first, then override with newer one.
Should we change lines order? BTW, why not to install OV Tokenizers simple PyPi?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They install identical OV. ./samples/cpp/requirements.txt has --extra-index-url https://storage.openvinotoolkit.org/simple/wheels/pre-release
./samples/cpp/multinomial_causal_lm | ||
# Don't install prompt_lookup_decoding_lm and speculative_decoding_lm because they don't use openvino_genai library and arent verifyed yet. | ||
DESTINATION samples/cpp/ COMPONENT cpp_samples_genai) | ||
install(FILES ./samples/cpp/requirements.txt DESTINATION samples/cpp/ COMPONENT cpp_samples_genai) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Wovchena
why these 2 install rules are not part of samples/cpp/CMakeLists.txt
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This rule shouldn't be included in the resulting package
install(FILES ./samples/cpp/requirements.txt DESTINATION samples/cpp/ COMPONENT cpp_samples_genai) | ||
install(FILES LICENSE DESTINATION licensing COMPONENT licensing_genai RENAME LICENSE-GENAI) | ||
install(FILES third-party-programs.txt DESTINATION licensing COMPONENT licensing_genai RENAME third-party-programs-genai.txt) | ||
if(MSVC AND NOT DEFINED CPACK_GENERATOR) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Wovchena
MSVC is compiler / cmake generator
WIN32 is platform.
Should we use WIN32
here?
|
||
find_package(OpenVINOGenAI REQUIRED PATHS | ||
"${CMAKE_BINARY_DIR}" # Reuse the package from the build. | ||
${OpenVINO_DIR} # GenAI may be installed alogside OpenVINO. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Wovchena
do we need it now?
setupvars.sh
exposes OpenVINOGenAI_DIR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OpenVINO_DIR a good guess anyway.
) | ||
add_executable(beam_search_causal_lm beam_search_causal_lm.cpp) | ||
target_link_libraries(beam_search_causal_lm PRIVATE openvino::genai) | ||
target_compile_features(beam_search_causal_lm PRIVATE cxx_std_17) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Wovchena
I don't see that we use some C++17 features in this sample
// previous prompt generation in chat dialog stops with the end of sentence token, | ||
// need to append this token to the current prompt | ||
if (is_chat_conversation && !m_is_cache_empty) | ||
text = m_tokenizer.get_eos_token() + text; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we agreed to drop it (to emulate skip_scial_tokens=True
)
return result; | ||
} | ||
|
||
std::string apply_chat_template(const std::vector<std::pair<std::string, std::string>>& prompts) const { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pavel-esir
should it be vector<unordered_map> ?
LLM return logits with probabilities of each token, these probabilities can be converted to tokens/words with different technics: greedy decoding, beam search decoding, random sampling, etc. This requires writing user unfriendly post-processing even for the simplest scenario of greedy decoding. In order to make live easier we we combined all decoding scenarios into a single function call, where the decoding method and parameters are specified by arguments. In this PR we provide a user friendly API for text generation inspired by `generate` method from HuggingFace transformers library. - [x] enable calling tokenizers/detokenizers from LLMPipeline - [ ] add callback for streaming mode - done partially, need to improve - [x] rewritten samples with the current approach: [causal_lm/cpp/generate_pipeline/generate_sample.cpp#L73-L83](https://github.com/pavel-esir/openvino.genai/blob/generate_pipeline/text_generation/causal_lm/cpp/generate_pipeline/generate_sample.cpp#L73-L83) - [x] Multibatch greedy decoding - [ ] Speculative decoding - [ ] Grouped Beam Search decoding: ready for batch 1, need to rebase multibatch support after merging openvinotoolkit#349 - [x] Random sampling Example 1: Greedy search generation ``` LLMPipeline pipe(model_path, device); // Will try to load config from generation_config.json. // but if not found default velues for gready search will be used GenerationConfig config = pipe.generation_config(); cout << pipe(prompt, config.max_new_tokens(20)); ``` Example 2: TextStreaming mode ``` LLMPipeline pipe(model_path, device); GenerationConfig config = pipe.generation_config(); auto text_streamer = TextStreamer{pipe}; auto text_streamer_callback = [&text_streamer](std::vector<int64_t>&& tokens, LLMPipeline& pipe){ text_streamer.put(tokens[0]); }; pipe(prompt, config.max_new_tokens(20).set_callback(text_streamer_callback)); text_streamer.end(); ``` CVS-132907 CVS-137920 --------- Co-authored-by: Wovchena <[email protected]> Co-authored-by: Ilya Lavrenov <[email protected]> Co-authored-by: Alexander Suvorov <[email protected]> Co-authored-by: Yaroslav Tarkan <[email protected]> Co-authored-by: Xiake Sun <[email protected]> Co-authored-by: wenyi5608 <[email protected]> Co-authored-by: Ekaterina Aidova <[email protected]> Co-authored-by: guozhong wang <[email protected]> Co-authored-by: Chen Peter <[email protected]>
LLM return logits with probabilities of each token, these probabilities can be converted to tokens/words with different technics: greedy decoding, beam search decoding, random sampling, etc. This requires writing user unfriendly post-processing even for the simplest scenario of greedy decoding. In order to make live easier we we combined all decoding scenarios into a single function call, where the decoding method and parameters are specified by arguments. In this PR we provide a user friendly API for text generation inspired by `generate` method from HuggingFace transformers library. - [x] enable calling tokenizers/detokenizers from LLMPipeline - [ ] add callback for streaming mode - done partially, need to improve - [x] rewritten samples with the current approach: [causal_lm/cpp/generate_pipeline/generate_sample.cpp#L73-L83](https://github.com/pavel-esir/openvino.genai/blob/generate_pipeline/text_generation/causal_lm/cpp/generate_pipeline/generate_sample.cpp#L73-L83) - [x] Multibatch greedy decoding - [ ] Speculative decoding - [ ] Grouped Beam Search decoding: ready for batch 1, need to rebase multibatch support after merging #349 - [x] Random sampling Example 1: Greedy search generation ``` LLMPipeline pipe(model_path, device); // Will try to load config from generation_config.json. // but if not found default velues for gready search will be used GenerationConfig config = pipe.generation_config(); cout << pipe(prompt, config.max_new_tokens(20)); ``` Example 2: TextStreaming mode ``` LLMPipeline pipe(model_path, device); GenerationConfig config = pipe.generation_config(); auto text_streamer = TextStreamer{pipe}; auto text_streamer_callback = [&text_streamer](std::vector<int64_t>&& tokens, LLMPipeline& pipe){ text_streamer.put(tokens[0]); }; pipe(prompt, config.max_new_tokens(20).set_callback(text_streamer_callback)); text_streamer.end(); ``` CVS-132907 CVS-137920 --------- Co-authored-by: Wovchena <[email protected]> Co-authored-by: Ilya Lavrenov <[email protected]> Co-authored-by: Alexander Suvorov <[email protected]> Co-authored-by: Yaroslav Tarkan <[email protected]> Co-authored-by: Xiake Sun <[email protected]> Co-authored-by: wenyi5608 <[email protected]> Co-authored-by: Ekaterina Aidova <[email protected]> Co-authored-by: guozhong wang <[email protected]> Co-authored-by: Chen Peter <[email protected]>
commit adec0e0 Author: Irina Efode <[email protected]> Date: Tue Jun 11 14:32:45 2024 +0400 Remove extra token desc commit a64f30a Author: Irina Efode <[email protected]> Date: Tue Jun 11 13:36:01 2024 +0400 Working sampler commit 05048ff Author: Irina Efode <[email protected]> Date: Tue Jun 11 13:23:43 2024 +0400 check commit e349418 Merge: bfaa55a 0b1ce98 Author: Irina Efode <[email protected]> Date: Mon Jun 10 23:11:58 2024 +0400 Merge remote-tracking branch 'ilavrenov_upstream/ct-beam-search' into penalties commit 0b1ce98 Merge: 16d857e 2da1556 Author: Ilya Lavrenov <[email protected]> Date: Mon Jun 10 18:52:20 2024 +0400 Merge pull request openvinotoolkit#21 from iefode/n_support Support num_return_seq for multinomial case commit bfaa55a Author: Irina Efode <[email protected]> Date: Mon Jun 10 17:42:01 2024 +0400 Fix tests commit fa0efb6 Author: Irina Efode <[email protected]> Date: Mon Jun 10 16:41:04 2024 +0400 Config tests commit 7551303 Author: Irina Efode <[email protected]> Date: Mon Jun 10 15:34:14 2024 +0400 Implement LogitTransformers. todo config check commit 16d857e Merge: 76148c5 1ee4f38 Author: Ilya Lavrenov <[email protected]> Date: Mon Jun 10 10:41:27 2024 +0200 Merge remote-tracking branch 'upstream/master' into ct-beam-search commit 1ee4f38 Author: guozhong wang <[email protected]> Date: Sun Jun 9 18:26:57 2024 +0800 Add option --prompt_index (openvinotoolkit#481) Run the corresponding prompt according to the option prompt index commit 9902928 Author: Pavel Esir <[email protected]> Date: Fri Jun 7 20:57:47 2024 +0200 Generate pipeline (openvinotoolkit#334) LLM return logits with probabilities of each token, these probabilities can be converted to tokens/words with different technics: greedy decoding, beam search decoding, random sampling, etc. This requires writing user unfriendly post-processing even for the simplest scenario of greedy decoding. In order to make live easier we we combined all decoding scenarios into a single function call, where the decoding method and parameters are specified by arguments. In this PR we provide a user friendly API for text generation inspired by `generate` method from HuggingFace transformers library. - [x] enable calling tokenizers/detokenizers from LLMPipeline - [ ] add callback for streaming mode - done partially, need to improve - [x] rewritten samples with the current approach: [causal_lm/cpp/generate_pipeline/generate_sample.cpp#L73-L83](https://github.com/pavel-esir/openvino.genai/blob/generate_pipeline/text_generation/causal_lm/cpp/generate_pipeline/generate_sample.cpp#L73-L83) - [x] Multibatch greedy decoding - [ ] Speculative decoding - [ ] Grouped Beam Search decoding: ready for batch 1, need to rebase multibatch support after merging openvinotoolkit#349 - [x] Random sampling Example 1: Greedy search generation ``` LLMPipeline pipe(model_path, device); // Will try to load config from generation_config.json. // but if not found default velues for gready search will be used GenerationConfig config = pipe.generation_config(); cout << pipe(prompt, config.max_new_tokens(20)); ``` Example 2: TextStreaming mode ``` LLMPipeline pipe(model_path, device); GenerationConfig config = pipe.generation_config(); auto text_streamer = TextStreamer{pipe}; auto text_streamer_callback = [&text_streamer](std::vector<int64_t>&& tokens, LLMPipeline& pipe){ text_streamer.put(tokens[0]); }; pipe(prompt, config.max_new_tokens(20).set_callback(text_streamer_callback)); text_streamer.end(); ``` CVS-132907 CVS-137920 --------- Co-authored-by: Wovchena <[email protected]> Co-authored-by: Ilya Lavrenov <[email protected]> Co-authored-by: Alexander Suvorov <[email protected]> Co-authored-by: Yaroslav Tarkan <[email protected]> Co-authored-by: Xiake Sun <[email protected]> Co-authored-by: wenyi5608 <[email protected]> Co-authored-by: Ekaterina Aidova <[email protected]> Co-authored-by: guozhong wang <[email protected]> Co-authored-by: Chen Peter <[email protected]> commit 2da1556 Author: Irina Efode <[email protected]> Date: Thu Jun 6 19:24:45 2024 +0400 library/src/continuous_batching_pipeline.cpp commit 7b48fa4 Author: Irina Efode <[email protected]> Date: Thu Jun 6 15:03:05 2024 +0400 enable streaming for greedy commit 5c601e0 Author: Irina Efode <[email protected]> Date: Thu Jun 6 13:29:47 2024 +0400 Comments commit 4f73d36 Author: Irina Efode <[email protected]> Date: Wed Jun 5 22:46:04 2024 +0400 Enable frequency and presence penalties commit 5e49c46 Author: Irina Efode <[email protected]> Date: Wed Jun 5 11:56:31 2024 +0400 Fix python tests commit eb4a219 Author: Irina Efode <[email protected]> Date: Tue Jun 4 22:38:43 2024 +0400 fix assert place commit f4d8461 Author: Irina Efode <[email protected]> Date: Tue Jun 4 22:22:37 2024 +0400 Correct accumulation commit 55448a1 Merge: 1128792 76148c5 Author: Irina Efode <[email protected]> Date: Tue Jun 4 18:56:42 2024 +0400 Merge remote-tracking branch 'ilavrenov_upstream/ct-beam-search' into n_support commit 1128792 Author: Irina Efode <[email protected]> Date: Tue Jun 4 18:52:38 2024 +0400 test commit e245041 Author: Irina Efode <[email protected]> Date: Tue Jun 4 18:52:03 2024 +0400 Apply comments commit 561cde0 Author: guozhong wang <[email protected]> Date: Tue Jun 4 16:27:08 2024 +0800 using sdpa for statble diffusion (openvinotoolkit#458) Co-authored-by: Chen Peter <[email protected]> commit 04510d4 Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Date: Mon Jun 3 17:37:41 2024 +0000 Bump optimum[openvino] from 1.19.2 to 1.20.0 in /text_generation/causal_lm/cpp (openvinotoolkit#467) commit db4a88f Merge: e5d33f5 b63bda2 Author: Irina Efode <[email protected]> Date: Mon Jun 3 13:17:32 2024 +0400 Merge remote-tracking branch 'ilavrenov_upstream/ct-beam-search' into n_support commit e5d33f5 Merge: fe29df9 bcdcefc Author: Irina Efode <[email protected]> Date: Fri May 31 14:11:13 2024 +0400 Merge remote-tracking branch 'ilavrenov_upstream/ct-beam-search' into n_support commit fe29df9 Author: Irina Efode <[email protected]> Date: Fri May 31 14:06:51 2024 +0400 Tests + Readme commit 7af72aa Author: Irina Efode <[email protected]> Date: Wed May 29 15:16:23 2024 +0400 Squashed commit of the following: commit 28af66d Author: Anastasiia Pnevskaia <[email protected]> Date: Tue May 28 15:40:15 2024 +0200 Added cache_size to python binding of scheduler config. commit 65a793a Author: Anastasiia Pnevskaia <[email protected]> Date: Tue May 28 15:12:16 2024 +0200 Fixed tests. commit 033558e Author: Irina Efode <[email protected]> Date: Wed May 29 00:40:48 2024 +0400 One more change commit dbae0bf Merge: f992591 2c2799f Author: Irina Efode <[email protected]> Date: Wed May 29 00:38:52 2024 +0400 Merge master, without py tests commit a5b14c7 Author: Lyalyushkin Nikolay <[email protected]> Date: Tue May 28 16:15:42 2024 +0200 grammar corrector support in WWB (openvinotoolkit#462) This PR introduces support for `AutoForSeq2SeqLM` models in WWB. Previously, WWB only supported `AutoForCasualLM`, assuming that the `generate` method copies the prompt to the output. But AutoForSeq2SeqLM generates output differently: there is no copy of the prompt, and it directly generates output. The fix was checked on the [example](https://gist.github.com/ljaljushkin/5a489a27cd0020ddbd42ea7ae54be688). It evaluates grammar correction with Seq2Seq model using WWB. commit f992591 Author: Irina Efode <[email protected]> Date: Tue May 28 17:39:17 2024 +0400 tmp commit 7e771f1 Author: Liwenke <[email protected]> Date: Tue May 28 15:24:15 2024 +0800 Note for wikitext data set connection issue (openvinotoolkit#452) Co-authored-by: Chen Peter <[email protected]> commit 24ef06e Author: guozhong wang <[email protected]> Date: Tue May 28 14:23:19 2024 +0800 Force to generate more tokens (openvinotoolkit#457) commit 1ed7539 Author: guozhong wang <[email protected]> Date: Tue May 28 09:44:45 2024 +0800 Correct flan-t5 output size (openvinotoolkit#451) openvinotoolkit#358 --------- Co-authored-by: Chen Peter <[email protected]> commit b5a9f28 Author: Irina Efode <[email protected]> Date: Mon May 27 23:48:03 2024 +0400 Extend in beam support commit edc53e5 Author: Irina Efode <[email protected]> Date: Fri May 24 17:59:48 2024 +0400 remove extra commit 9038308 Author: Irina Efode <[email protected]> Date: Fri May 24 16:20:13 2024 +0400 Improve multinomial commit c453e3e Author: Irina Efode <[email protected]> Date: Fri May 24 15:42:48 2024 +0400 Support num_return_seq for multinomial case commit e6f05c6 Author: guozhong wang <[email protected]> Date: Thu May 23 11:36:50 2024 +0800 Output median min and avg values to csv (openvinotoolkit#450) Co-authored-by: Chen Peter <[email protected]> commit 25909cc Author: guozhong wang <[email protected]> Date: Thu May 23 11:12:27 2024 +0800 verify beam search 1st token optimization (openvinotoolkit#426) The minimum version of transformers to get 1st and 2nd tokens latency is v4.40-release. commit 03e78fe Author: Chen Peter <[email protected]> Date: Wed May 22 13:06:11 2024 +0800 Revert "Force to generate "inference count" tokens" (openvinotoolkit#455) Reverts openvinotoolkit#289 to unblock the release. Since it causes the performance regression of some models. (WIP to investigate the reason) commit 05a0f36 Author: Ekaterina Aidova <[email protected]> Date: Tue May 21 20:33:26 2024 +0400 fix path based configuration (openvinotoolkit#456) commit 41b07d3 Author: Ekaterina Aidova <[email protected]> Date: Fri May 17 06:20:18 2024 +0400 Fix md5 hash for env that does not support usedforsecurity arg (openvinotoolkit#445) I got an error running benchmarking on my working machine (python3.8, ubuntu20) due to unsupported args for hashlib. ``` [ ERROR ] An exception occurred [ INFO ] Traceback (most recent call last): File "benchmark.py", line 532, in main iter_data_list, pretrain_time = CASE_TO_BENCH[model_args['use_case']](model_path, framework, args.device, model_args, args.num_iters) File "benchmark.py", line 194, in run_text_generation_benchmark run_text_generation(input_text, num, model, tokenizer, args, iter_data_list, warmup_md5, prompt_idx, bench_hook, model_precision, proc_id) File "benchmark.py", line 131, in run_text_generation result_md5_list.append(hashlib.md5(result_text.encode(), usedforsecurity=False).hexdigest()) TypeError: openssl_md5() takes at most 1 argument (2 given) ``` Based on this [StackOverflow issue](https://stackoverflow.com/questions/54717862/how-do-i-know-if-the-usedforsecurity-flag-is-supported-by-hashlib-md5), not all clients support this argument and usage hashlib.new("md5") vs hashlib.md5 should be safe for usage in both cases commit d473e96 Author: guozhong wang <[email protected]> Date: Fri May 17 10:09:27 2024 +0800 output no hook data warning when it is text gen model (openvinotoolkit#449) commit cad3abb Author: guozhong wang <[email protected]> Date: Thu May 16 17:28:49 2024 +0800 Fix an attempt to add a string value to a numerical value (openvinotoolkit#447) commit 93f7670 Author: Ekaterina Aidova <[email protected]> Date: Thu May 16 11:49:08 2024 +0400 update optimum intel commit in llm bench (openvinotoolkit#444) commit d73346c Author: Yaroslav Tarkan <[email protected]> Date: Wed May 15 14:24:30 2024 +0300 Fix noise images generated for '--num' > 1 in Stable Diffusion sample (openvinotoolkit#441) Fixes openvinotoolkit#405
LLM return logits with probabilities of each token, these probabilities can be converted to tokens/words with different technics: greedy decoding, beam search decoding, random sampling, etc.
This requires writing user unfriendly post-processing even for the simplest scenario of greedy decoding. In order to make live easier we we combined all decoding scenarios into a single function call, where the decoding method and parameters are specified by arguments.
In this PR we provide a user friendly API for text generation inspired by
generate
method from HuggingFace transformers library.Example 1: Greedy search generation
Example 2: TextStreaming mode
CVS-132907 CVS-137920