You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When transcribing audio files in hebrew language I receive error from whisper-rs of invalid utf-8, so I guess that basically it fails to decode some of them.
it happens only when getting individual segments with the function whisper.cpp#L5988::whisper_full_get_token_text_from_state
I plan to address this issue over the weekend. Many users have reported it, and it seems to stem from the absence of a tokenizer in the decoding stage. #1313 (comment)
Unfortunately, we won't be able to resolve this issue due to its origin: BPE tokenization, which divides Unicode characters into subtokens, resulting in incomplete tokens. However, in the updated version I've introduced in proposal #1768, there's a workaround. You have the option to set max_len=1, ensuring you receive the smallest valid segment. Alternatively, you can continue utilizing the whisper_full_get_token_text_from_state function. Adding a buffer and applying the newly recommended whisper_utf8_is_valid function to verify the buffer's validity is also a viable approach.
When transcribing audio files in
hebrew
language I receive error from whisper-rs of invalidutf-8
, so I guess that basically it fails to decode some of them.it happens only when getting individual segments with the function
whisper.cpp#L5988::whisper_full_get_token_text_from_state
but with
whisper.cpp#L5972::whisper_full_get_segment_text_from_state
it works
tazz4843/whisper-rs#115
audio.mp3
The text was updated successfully, but these errors were encountered: