Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix UnicodeDecodeError for BPE-based Models (especially GLM-4) #6357

Merged
merged 7 commits into from
Sep 3, 2024

Conversation

GralchemOz
Copy link
Contributor

Checklist:

Description
This PR addresses a UnicodeDecodeError encountered in text-generation-webui when using models that rely on byte pair encoding (BPE) tokenizers, such as the GLM-4-9b model. The error occurs with certain characters that cannot be decoded due to incomplete byte sequences being passed to the utf-8 decoder.

Issue Example
When the model attempts to output characters like "簌", which is UTF-8 encoded as b'\xe7\xb0\xa8', the web UI tries to decode the output after each token. If only a partial sequence is available (e.g., b'\xe7\xb0'), a UnicodeDecodeError is raised, such as UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 0-1: unexpected end of data

@oobabooga
Copy link
Owner

Thanks for the fix

@oobabooga oobabooga merged commit 4c74c7a into oobabooga:dev Sep 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants