Fix UnicodeDecodeError for BPE-based Models (especially GLM-4) #6357
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Checklist:
Description
This PR addresses a UnicodeDecodeError encountered in text-generation-webui when using models that rely on byte pair encoding (BPE) tokenizers, such as the GLM-4-9b model. The error occurs with certain characters that cannot be decoded due to incomplete byte sequences being passed to the utf-8 decoder.
Issue Example
When the model attempts to output characters like "簌", which is UTF-8 encoded as b'\xe7\xb0\xa8', the web UI tries to decode the output after each token. If only a partial sequence is available (e.g., b'\xe7\xb0'), a UnicodeDecodeError is raised, such as UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 0-1: unexpected end of data