Adding fix for multi-byte segments in whisper.cpp #734

raivisdejus · 2024-05-10T12:03:59Z

Sometimes transcription in Latvian failed with error Failed utf-8 codec can't decode byte 0xc4 in position 0: unexpected end of data. This seems to be referenced in ggerganov/whisper.cpp#1798 where multi-byte utf-8 characters get returned in separate segments and uft-8 decoder fails to process them. This PR fixes this issue.

This PR also fixes issue where with "Word-level timings" setting enabled words get split into separate segments making this feature less usable in real world situations. Changes in PR will combine whisper.cpp segments around word boundary of space.

The unclear part is in regards to languages where space may not be proper word boundary. If someone has relevant comments on word boundaries in languages like Chinese, I am happy to adjust the solution.

codecov · 2024-05-14T22:36:39Z

Codecov Report

Attention: Patch coverage is 78.94737% with 8 lines in your changes are missing coverage. Please review.

Project coverage is 81.30%. Comparing base (d483864) to head (5b85a81).
Report is 3 commits behind head on main.

❗ Current head 5b85a81 differs from pull request most recent head 3513158. Consider uploading reports for the commit 3513158 to get more accurate results

Files	Patch %	Lines
buzz/transcriber/whisper_cpp.py	78.94%	8 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #734      +/-   ##
==========================================
- Coverage   81.97%   81.30%   -0.68%     
==========================================
  Files          83       81       -2     
  Lines        3840     3610     -230     
==========================================
- Hits         3148     2935     -213     
+ Misses        692      675      -17

Flag	Coverage Δ
Linux	`?`
Windows	`81.30% <78.94%> (-0.07%)`	⬇️
macOS	`?`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

chidiwilliams · 2024-05-14T23:16:25Z

Awesome, thank you. Given you commit access to the repo if you're interested in joining as well. Cheers.

Adding fix for multi-byte segments in whisper.cpp

5b85a81

Merge branch 'main' into fix-multibyte-word-timestamps

3513158

chidiwilliams enabled auto-merge (squash) May 14, 2024 23:16

chidiwilliams merged commit 38f5d26 into chidiwilliams:main May 14, 2024
9 of 11 checks passed

raivisdejus deleted the fix-multibyte-word-timestamps branch May 15, 2024 04:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding fix for multi-byte segments in whisper.cpp #734

Adding fix for multi-byte segments in whisper.cpp #734

raivisdejus commented May 10, 2024

codecov bot commented May 14, 2024 •

edited

Loading

chidiwilliams commented May 14, 2024

Adding fix for multi-byte segments in whisper.cpp #734

Adding fix for multi-byte segments in whisper.cpp #734

Conversation

raivisdejus commented May 10, 2024

codecov bot commented May 14, 2024 • edited Loading

Codecov Report

chidiwilliams commented May 14, 2024

codecov bot commented May 14, 2024 •

edited

Loading