Invalid encoding #1761

thewh1teagle · 2024-01-12T20:52:43Z

When transcribing audio files in hebrew language I receive error from whisper-rs of invalid utf-8, so I guess that basically it fails to decode some of them.
it happens only when getting individual segments with the function
whisper.cpp#L5988::whisper_full_get_token_text_from_state

but with
whisper.cpp#L5972::whisper_full_get_segment_text_from_state

it works

tazz4843/whisper-rs#115
audio.mp3

The text was updated successfully, but these errors were encountered:

bobqianic · 2024-01-13T11:13:02Z

I plan to address this issue over the weekend. Many users have reported it, and it seems to stem from the absence of a tokenizer in the decoding stage. #1313 (comment)

bobqianic · 2024-02-05T17:06:14Z

Unfortunately, we won't be able to resolve this issue due to its origin: BPE tokenization, which divides Unicode characters into subtokens, resulting in incomplete tokens. However, in the updated version I've introduced in proposal #1768, there's a workaround. You have the option to set max_len=1, ensuring you receive the smallest valid segment. Alternatively, you can continue utilizing the whisper_full_get_token_text_from_state function. Adding a buffer and applying the newly recommended whisper_utf8_is_valid function to verify the buffer's validity is also a viable approach.

bobqianic added bug Something isn't working enhancement New feature or request labels Jan 13, 2024

bobqianic linked a pull request Jan 14, 2024 that will close this issue

Fix the decoding issues #1768

Open

11 tasks

bobqianic removed a link to a pull request Feb 5, 2024

Fix the decoding issues #1768

Open

11 tasks

bobqianic removed the bug Something isn't working label Feb 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Invalid encoding #1761

Invalid encoding #1761

thewh1teagle commented Jan 12, 2024 •

edited

Loading

bobqianic commented Jan 13, 2024

bobqianic commented Feb 5, 2024

Invalid encoding #1761

Invalid encoding #1761

Comments

thewh1teagle commented Jan 12, 2024 • edited Loading

bobqianic commented Jan 13, 2024

bobqianic commented Feb 5, 2024

thewh1teagle commented Jan 12, 2024 •

edited

Loading