Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix ChatGLM Detokenization #36

Merged
merged 2 commits into from
Feb 23, 2024
Merged

Conversation

apaniukov
Copy link
Collaborator

New special tokens were added to ChatGLM repository, that caused Sentencepiece to crash during decoding because of indices were not added to the main vocabulary (these tokens were not marked as special in the repository and were filtered out because of it). Include tokens to a vocab and also align vocab sizes better.
Has to lower pass-rate, because of ChatGLM3 decoder inserts spaces between special tokens and Sentencepiece does not. No functional difference between actual texts.

apaniukov and others added 2 commits February 22, 2024 21:07
New special tokens was added to ChatGLM repository, that causes Sentencepiece to crash during decoding  because of indices was not added to the main vocabulary (these tokens was not marked as special in the repository and were filtered out because of it). Include tokens to a vocab and also align vocab sizes better.
Has to lower pass rate, because of ChatGLM3 decoder inserts spaces between special tokens and Sentencepiece does not. No functional difference between actual texts.
@apaniukov apaniukov merged commit a459c16 into openvinotoolkit:master Feb 23, 2024
11 checks passed
@apaniukov apaniukov deleted the fix-chatglm branch March 6, 2024 15:41
mryzhov pushed a commit to mryzhov/openvino_tokenizers_public that referenced this pull request Mar 7, 2024
New special tokens was added to ChatGLM repository, that causes Sentencepiece to crash during decoding  because of indices was not added to the main vocabulary (these tokens was not marked as special in the repository and were filtered out because of it). Include tokens to a vocab and also align vocab sizes better.
Has to lower pass rate, because of ChatGLM3 decoder inserts spaces between special tokens and Sentencepiece does not. No functional difference between actual texts.
mryzhov added a commit that referenced this pull request Mar 13, 2024
* master fixes

* Fix ChatGLM Detokenization (#36)

New special tokens was added to ChatGLM repository, that causes Sentencepiece to crash during decoding  because of indices was not added to the main vocabulary (these tokens was not marked as special in the repository and were filtered out because of it). Include tokens to a vocab and also align vocab sizes better.
Has to lower pass rate, because of ChatGLM3 decoder inserts spaces between special tokens and Sentencepiece does not. No functional difference between actual texts.

* updated test results

---------

Co-authored-by: Artur Paniukov <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant