Streaming Output Repetition #1702

ethan-vuong · 2023-12-29T21:07:01Z

I would like to use whisper.cpp to take real-time audio and relay the transcript of the audio back to a user. I am using a Mac m1.

Steps to reproduce:
git clone https://github.com/ggerganov/whisper.cpp.git
cd whisper.cpp
bash ./models/download-ggml-model.sh base.en
make base.en
brew install sdl2
make stream

To start the live transcription: ./stream -m ./models/ggml-base.en.bin --file output.txt

Here is the output when I play this video in the background:

[Start speaking]
Understand. It's difficult to overstate how important be
being mission driven is. So I want to emphasize it one last time. Derivative companies, companies that copy and existing ideas.
idea with very few new insights. Don't excite people and they don't compel the teams to work hard enough to be successful.
Paul Graham is going to talk about how to get startup ideas next week. It's something that a lot of founders struggle with.
with, but it's something I believe you can get better with it. Better out of practice. And it's definitely worth trying to get better at.
The hardest part about coming up with great ideas is that the best ideas often look terrible at the beginning.
The 13th search engine and without all the features of a web portal, most people thought that was pointless.
was done and anyway it didn't matter that much. Portals were where the value was at. The tenth social network and limited
only to college students with no money, also terrible. My space had won and who wants college students as customers. Or a way to stand is true.
strangers and couches. That just sounds terrible all around. These all sounded really bad, but they turned out to be good.
If they had sounded really good, there would have been too many people working on them.

Here is what is in output.txt:
Understand.
Understand. It's difficult to overstate how important be
being mission driven is, so I want to emphasize it one last time.
being mission driven is. So I want to emphasize it one last time. Derivative companies, companies that copy and existing ideas.
idea with very few new insights.
idea with very few new insights. Don't excite people and they don't compel the teams to work hard enough to be successful.
Paul Graham is going to talk about how to get started.
Paul Graham is going to talk about how to get startup ideas next week. It's something that a lot of founders struggle with.
with, but it's something I believe you can get better with it. Better out.
with, but it's something I believe you can get better with it. Better out of practice. And it's definitely worth trying to get better at.
The hardest part about coming up with great ideas.
The hardest part about coming up with great ideas is that the best ideas often look terrible at the beginning.
The 13th search engine and without all the features of a webinar.
The 13th search engine and without all the features of a web portal, most people thought that was pointless.
was done and anyway it didn't matter that much.
was done and anyway it didn't matter that much. Portals were where the value was at. The tenth social network and limited
only to college students with no money. Also terrible. MySpace@w
only to college students with no money, also terrible. My space had won and who wants college students as customers. Or a way to stand is true.
and strangers, couches. That just sounds terrible all around.
strangers and couches. That just sounds terrible all around. These all sounded really bad, but they turned out to be good.
If they had sounded really good, there would have been too many people.
If they had sounded really good, there would have been too many people working on them.

As you can see, there is a lot of repetition in the output file, which I think is ok for displaying live feedback, but I'm not sure how to go about deciding what should be the 'final' transcript. Any ideas?

chengyjonathan · 2024-07-26T05:22:35Z

+1 Experiencing the same thing

I wrote a silly push to talk thing for the cpp stream prototype, was seeing this

[00:00:00.000 --> 00:00:02.000] Testing 1, 2, 3
[00:00:00.000 --> 00:00:02.000] Testing, one, two, three.
[00:00:00.000 --> 00:00:02.260] Testing, one, two, three.
[00:00:00.000 --> 00:00:02.240] Testing, 1, 2, 3.
[00:00:00.000 --> 00:00:02.240] Testing, one, two, three.
[00:00:02.240 --> 00:00:03.360] Just currently seeing what.
[00:00:00.000 --> 00:00:02.280] Testing, 1, 2, 3.
[00:00:02.280 --> 00:00:04.000] Just currently seeing what happens.
[00:00:00.000 --> 00:00:02.280] Testing, 1, 2, 3.
[00:00:02.280 --> 00:00:04.480] Just currently seeing what happens when we...
[00:00:00.000 --> 00:00:02.280] Testing, 1, 2, 3.
[00:00:02.280 --> 00:00:05.880] Just currently seeing what happens when we get rid of the

whisper_full_with_state: input is too short - 690 ms < 1000 ms. consider padding the input audio with silence
[00:00:00.000 --> 00:00:03.120] rid of the whole single single.
[00:00:00.000 --> 00:00:01.700] rid of the whole single segment port.
[00:00:00.000 --> 00:00:02.080] rid of the whole single segment portion.
[00:00:00.000 --> 00:00:02.080] rid of the whole single segment portion.
[00:00:00.000 --> 00:00:02.640] rid of the whole single segment portion.
[00:00:00.000 --> 00:00:02.640] rid of the whole single segment portion.
[00:00:02.640 --> 00:00:03.720] I really wonder what's in your mind.
[00:00:00.000 --> 00:00:02.640] rid of the whole single segment portion.
[00:00:02.640 --> 00:00:04.160] I really wonder what single segment--
[00:00:00.000 --> 00:00:02.640] rid of the whole single segment portion.
[00:00:02.640 --> 00:00:04.640] I really wonder what single segment was trying to do.
[00:00:00.000 --> 00:00:02.640] rid of the whole single segment portion.
[00:00:02.640 --> 00:00:05.200] I really wonder what single segment was trying to do here.

whisper_full_with_state: input is too short - 690 ms < 1000 ms. consider padding the input audio with silence
[00:00:00.000 --> 00:00:01.200] here because it looks awesome.
[00:00:00.000 --> 00:00:03.280] here because it looks like
[00:00:00.000 --> 00:00:04.160] here because it looks like it's
[00:00:00.000 --> 00:00:04.160] here because it looks like it's
[00:00:00.000 --> 00:00:05.040] here because it looks like it's weird
[00:00:00.000 --> 00:00:03.540] here because it looks like it's weirdly a
[00:00:00.000 --> 00:00:04.160] here because it looks like it's weirdly effective.
[00:00:00.000 --> 00:00:04.760] here because it looks like it's weirdly affecting what
[00:00:00.000 --> 00:00:07.040] here because it looks like it's weirdly affecting what's going

bobqianic added the bug Something isn't working label Jan 15, 2024

bobqianic linked a pull request Jan 15, 2024 that will close this issue

Fix the decoding issues #1768

Open

11 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Streaming Output Repetition #1702

Streaming Output Repetition #1702

ethan-vuong commented Dec 29, 2023

chengyjonathan commented Jul 26, 2024

Streaming Output Repetition #1702

Streaming Output Repetition #1702

Comments

ethan-vuong commented Dec 29, 2023

chengyjonathan commented Jul 26, 2024