Adding fix for multi-byte segments in whisper.cpp #734
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Sometimes transcription in Latvian failed with error
Failed utf-8 codec can't decode byte 0xc4 in position 0: unexpected end of data
. This seems to be referenced in ggerganov/whisper.cpp#1798 where multi-byte utf-8 characters get returned in separate segments and uft-8 decoder fails to process them. This PR fixes this issue.This PR also fixes issue where with "Word-level timings" setting enabled words get split into separate segments making this feature less usable in real world situations. Changes in PR will combine whisper.cpp segments around word boundary of space.
The unclear part is in regards to languages where space may not be proper word boundary. If someone has relevant comments on word boundaries in languages like Chinese, I am happy to adjust the solution.