Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Malformed multi-byte UTF8 characters #39

Closed
raivisdejus opened this issue Jul 16, 2024 · 7 comments
Closed

Malformed multi-byte UTF8 characters #39

raivisdejus opened this issue Jul 16, 2024 · 7 comments

Comments

@raivisdejus
Copy link

As noted in ggerganov/whisper.cpp#1798 sometimes a multi byte utf-8 character will be split in multiple tokens, some part in first token, some part is second.

Sample audio where this happens is here https://github.com/chidiwilliams/buzz/blob/main/testdata/whisper-latvian.wav

If we can get to the bytes of the segment "text" we can work around this by gluing two tokens if they have some issue. Current version will fail with UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc4 in position 0: unexpected end of data on whisper_full_get_segment_text call.

A solution could be to add some function like whisper_full_get_segment_bytes that would return raw bytes of the segment text for manual processing.

@UsernamesLame
Copy link
Contributor

This hasn't been touched in a while, so let me take a peek at the issue. No promises.

@UsernamesLame
Copy link
Contributor

I tested the provided audio file and got this transcription: Money, Ustrouts like a Pestakli, Tapets Uzior, is D.S. by Brausch.

Is this correct? If so it seems like the problem was fixed upstream in whisper.cpp itself, and we can close this.

@abdeladim-s
Copy link
Owner

@UsernamesLame, I tested it on my end as well without any issue.
So I will close this for now!

@UsernamesLame
Copy link
Contributor

@raivisdejus before we close this forever, could you confirm this is what it's intended to spit out?

@raivisdejus
Copy link
Author

Tiny model with Latvian (lv) as language should produce something similar to Mani uzstrauts, laikabstākļi, tapēc uz jūru, es diezvajī braukša.

Problematic part is in laikabstākļi where ļ gets returned from whisper.cpp in two segments, first segment has first byte b'\xc4' and the second has second byte b'\xbc'.

@abdeladim-s
Copy link
Owner

Still couldn't replicate the issue, I used the tiny model with lv language as described

model = Model('tiny')
res = model.transcribe(media="./whisper-latvian.wav", language="lv")
print(res)

Here are the results:

[t0=0, t1=700, text=Mani uzstrauts, laikabstākļi, tapēc uz jūru, es diezvajī braukša.]

@UsernamesLame
Copy link
Contributor

Tiny model with Latvian (lv) as language should produce something similar to Mani uzstrauts, laikabstākļi, tapēc uz jūru, es diezvajī braukša.

Problematic part is in laikabstākļi where ļ gets returned from whisper.cpp in two segments, first segment has first byte b'\xc4' and the second has second byte b'\xbc'.

please re-test with the latest version of pywhispercpp. We're unable to reproduce, so I assume it's fixed in whisper / whisper.cpp?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants