-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Malformed multi-byte UTF8 characters #39
Comments
This hasn't been touched in a while, so let me take a peek at the issue. No promises. |
I tested the provided audio file and got this transcription: Is this correct? If so it seems like the problem was fixed upstream in |
@UsernamesLame, I tested it on my end as well without any issue. |
@raivisdejus before we close this forever, could you confirm this is what it's intended to spit out? |
Tiny model with Latvian ( Problematic part is in |
Still couldn't replicate the issue, I used the tiny model with lv language as described model = Model('tiny')
res = model.transcribe(media="./whisper-latvian.wav", language="lv")
print(res) Here are the results: [t0=0, t1=700, text=Mani uzstrauts, laikabstākļi, tapēc uz jūru, es diezvajī braukša.] |
please re-test with the latest version of |
As noted in ggerganov/whisper.cpp#1798 sometimes a multi byte utf-8 character will be split in multiple tokens, some part in first token, some part is second.
Sample audio where this happens is here https://github.com/chidiwilliams/buzz/blob/main/testdata/whisper-latvian.wav
If we can get to the bytes of the segment "text" we can work around this by gluing two tokens if they have some issue. Current version will fail with
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc4 in position 0: unexpected end of data
onwhisper_full_get_segment_text
call.A solution could be to add some function like
whisper_full_get_segment_bytes
that would return raw bytes of the segment text for manual processing.The text was updated successfully, but these errors were encountered: