Fix UnicodeDecodeError for BPE-based Models (especially GLM-4) #6357

GralchemOz · 2024-08-30T10:03:58Z

Checklist:

I have read the Contributing guidelines.

Description
This PR addresses a UnicodeDecodeError encountered in text-generation-webui when using models that rely on byte pair encoding (BPE) tokenizers, such as the GLM-4-9b model. The error occurs with certain characters that cannot be decoded due to incomplete byte sequences being passed to the utf-8 decoder.

Issue Example
When the model attempts to output characters like "簌", which is UTF-8 encoded as b'\xe7\xb0\xa8', the web UI tries to decode the output after each token. If only a partial sequence is available (e.g., b'\xe7\xb0'), a UnicodeDecodeError is raised, such as UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 0-1: unexpected end of data

Merge dev branch

oobabooga · 2024-09-03T02:00:53Z

Thanks for the fix

oobabooga and others added 7 commits July 25, 2024 12:12

Merge pull request oobabooga#6271 from oobabooga/dev

dd97a83

Merge dev branch

UI: fix saving characters

498fec2

Merge pull request oobabooga#6300 from oobabooga/dev

d011040

Merge dev branch

Merge pull request oobabooga#6336 from oobabooga/dev

073694b

Merge dev branch

Merge pull request oobabooga#6337 from oobabooga/dev

1b62cd8

Merge dev branch

Merge pull request oobabooga#6339 from oobabooga/dev

5522584

Merge dev branch

Fix UnicodeDecodeError for partial character output in BPE tokenizer

dc51494

oobabooga merged commit 4c74c7a into oobabooga:dev Sep 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix UnicodeDecodeError for BPE-based Models (especially GLM-4) #6357

Fix UnicodeDecodeError for BPE-based Models (especially GLM-4) #6357

GralchemOz commented Aug 30, 2024

oobabooga commented Sep 3, 2024

Fix UnicodeDecodeError for BPE-based Models (especially GLM-4) #6357

Fix UnicodeDecodeError for BPE-based Models (especially GLM-4) #6357

Conversation

GralchemOz commented Aug 30, 2024

Checklist:

oobabooga commented Sep 3, 2024