Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Invalid utf-8 #115

Closed
thewh1teagle opened this issue Jan 11, 2024 · 11 comments · Fixed by #130
Closed

Invalid utf-8 #115

thewh1teagle opened this issue Jan 11, 2024 · 11 comments · Fixed by #130

Comments

@thewh1teagle
Copy link
Contributor

image
The song in hebrew:
muminim hebrew.zip

@tazz4843
Copy link
Owner

This is an upstream issue, not something we can control. I run into this myself with my own services, and I just log it and ignore the output.

Doing some digging I found the following:
ggerganov/whisper.cpp#1098
ggerganov/whisper.cpp#1118

@thewh1teagle
Copy link
Contributor Author

@tazz4843
How can I ignore the errors and take only some of the transcribed data? or if it's in some languages it won't work at all?
I can't transcribe in some langauges at all.

@thewh1teagle
Copy link
Contributor Author

thewh1teagle commented Jan 13, 2024

I checked whisper.cpp with his cli example.
He has that issue there too but in terminal only.
If I write the output of whisper.cpp to file it works well,
So I think it's still encoding issue in whisper-rs
It happens here
whisper_state.rs#L481

@tazz4843
Copy link
Owner

tazz4843 commented Jan 14, 2024

We don't do anything with the string, this would be a bug in Rust's std string library, which there's essentially no chance of. As such this means it must be whisper.cpp returning an invalid UTF-8 string. We could return the raw bytes on error, but those are somewhat useless without being able to parse it unless you want to parse only up to the index where it fails (which would be a valid use case and if you want this added I can do so).

@magnus-ISU
Copy link

UTF-8 is designed specifically to be able to recover from invalid strings, right?
image
You could discard whatever is invalid (seems best to me); or as this crate (I think — it is dense and I didn't care to verify after glancing at the code) does, return invalid codepoints as valid UTF-8 had their prefixes been right.

0xxxxxxx -> great, we're back to ASCII, continue
10xxxxxx -> crap, invalid
110xxxxx -> great, back to valid input
10xxxxxx  -> end of the last char
10xxxxxx -> invalid
11110xxx -> start of 4 byte char
11xxxxxx -> invalid
11110xxx -> start of 4 byte char
10xxxxxx
10xxxxxx
10xxxxxx -> end of valid 4 byte char

you could still parse out of there 0xxxxxxx 110xxxxx 10xxxxxx 11110xxx 10xxxxxx 10xxxxxxx 10xxxxxx and assuming what you had was 1 invalid codepoint and a ton of crap, it will probably be fine.

@tazz4843
Copy link
Owner

There is String::from_utf8_lossy for that which does throw away information to get a valid UTF-8 string

@thewh1teagle
Copy link
Contributor Author

I still experience this issue, I'm not sure wether it's in my control or whisper-rs need to be changed
thewh1teagle/vibe#34
Can I ignore these utf-8 errors?

@tazz4843
Copy link
Owner

Remind me in a few days and I can add a function to infallibly convert.

@thewh1teagle
Copy link
Contributor Author

Hey, just a reminder
Many people opened issue related to that in vibe/issues so I hope to solve it.
I think that it's better to receive some invalid characters than fail the whole transcription

@tazz4843
Copy link
Owner

tazz4843 commented Apr 6, 2024

Should be solved in f4ea0d9

@tazz4843 tazz4843 closed this as completed Apr 6, 2024
@thewh1teagle
Copy link
Contributor Author

thewh1teagle commented Apr 11, 2024

Should be solved in f4ea0d9

Thanks, I wasn't able to use it but it helped me understand where is the problem so I added
github.com/thewh1teagle/whisper-rs/ee93930 and looks like it fixed the issue (and I don't even see invalid characters).
I can create PR from that if you want :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants