Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Streaming Output Repetition #1702

Open
ethan-vuong opened this issue Dec 29, 2023 · 1 comment · May be fixed by #1768
Open

Streaming Output Repetition #1702

ethan-vuong opened this issue Dec 29, 2023 · 1 comment · May be fixed by #1768
Labels
bug Something isn't working

Comments

@ethan-vuong
Copy link

I would like to use whisper.cpp to take real-time audio and relay the transcript of the audio back to a user. I am using a Mac m1.

Steps to reproduce:
git clone https://github.com/ggerganov/whisper.cpp.git
cd whisper.cpp
bash ./models/download-ggml-model.sh base.en
make base.en
brew install sdl2
make stream

To start the live transcription: ./stream -m ./models/ggml-base.en.bin --file output.txt

Here is the output when I play this video in the background:

[Start speaking]
Understand. It's difficult to overstate how important be
being mission driven is. So I want to emphasize it one last time. Derivative companies, companies that copy and existing ideas.
idea with very few new insights. Don't excite people and they don't compel the teams to work hard enough to be successful.
Paul Graham is going to talk about how to get startup ideas next week. It's something that a lot of founders struggle with.
with, but it's something I believe you can get better with it. Better out of practice. And it's definitely worth trying to get better at.
The hardest part about coming up with great ideas is that the best ideas often look terrible at the beginning.
The 13th search engine and without all the features of a web portal, most people thought that was pointless.
was done and anyway it didn't matter that much. Portals were where the value was at. The tenth social network and limited
only to college students with no money, also terrible. My space had won and who wants college students as customers. Or a way to stand is true.
strangers and couches. That just sounds terrible all around. These all sounded really bad, but they turned out to be good.
If they had sounded really good, there would have been too many people working on them.

Here is what is in output.txt:
Understand.
Understand. It's difficult to overstate how important be
being mission driven is, so I want to emphasize it one last time.
being mission driven is. So I want to emphasize it one last time. Derivative companies, companies that copy and existing ideas.
idea with very few new insights.
idea with very few new insights. Don't excite people and they don't compel the teams to work hard enough to be successful.
Paul Graham is going to talk about how to get started.
Paul Graham is going to talk about how to get startup ideas next week. It's something that a lot of founders struggle with.
with, but it's something I believe you can get better with it. Better out.
with, but it's something I believe you can get better with it. Better out of practice. And it's definitely worth trying to get better at.
The hardest part about coming up with great ideas.
The hardest part about coming up with great ideas is that the best ideas often look terrible at the beginning.
The 13th search engine and without all the features of a webinar.
The 13th search engine and without all the features of a web portal, most people thought that was pointless.
was done and anyway it didn't matter that much.
was done and anyway it didn't matter that much. Portals were where the value was at. The tenth social network and limited
only to college students with no money. Also terrible. MySpace@w
only to college students with no money, also terrible. My space had won and who wants college students as customers. Or a way to stand is true.
and strangers, couches. That just sounds terrible all around.
strangers and couches. That just sounds terrible all around. These all sounded really bad, but they turned out to be good.
If they had sounded really good, there would have been too many people.
If they had sounded really good, there would have been too many people working on them.

As you can see, there is a lot of repetition in the output file, which I think is ok for displaying live feedback, but I'm not sure how to go about deciding what should be the 'final' transcript. Any ideas?

@bobqianic bobqianic added the bug Something isn't working label Jan 15, 2024
@bobqianic bobqianic linked a pull request Jan 15, 2024 that will close this issue
11 tasks
@chengyjonathan
Copy link

+1 Experiencing the same thing

I wrote a silly push to talk thing for the cpp stream prototype, was seeing this

[00:00:00.000 --> 00:00:02.000] Testing 1, 2, 3
[00:00:00.000 --> 00:00:02.000] Testing, one, two, three.
[00:00:00.000 --> 00:00:02.260] Testing, one, two, three.
[00:00:00.000 --> 00:00:02.240] Testing, 1, 2, 3.
[00:00:00.000 --> 00:00:02.240] Testing, one, two, three.
[00:00:02.240 --> 00:00:03.360] Just currently seeing what.
[00:00:00.000 --> 00:00:02.280] Testing, 1, 2, 3.
[00:00:02.280 --> 00:00:04.000] Just currently seeing what happens.
[00:00:00.000 --> 00:00:02.280] Testing, 1, 2, 3.
[00:00:02.280 --> 00:00:04.480] Just currently seeing what happens when we...
[00:00:00.000 --> 00:00:02.280] Testing, 1, 2, 3.
[00:00:02.280 --> 00:00:05.880] Just currently seeing what happens when we get rid of the

whisper_full_with_state: input is too short - 690 ms < 1000 ms. consider padding the input audio with silence
[00:00:00.000 --> 00:00:03.120] rid of the whole single single.
[00:00:00.000 --> 00:00:01.700] rid of the whole single segment port.
[00:00:00.000 --> 00:00:02.080] rid of the whole single segment portion.
[00:00:00.000 --> 00:00:02.080] rid of the whole single segment portion.
[00:00:00.000 --> 00:00:02.640] rid of the whole single segment portion.
[00:00:00.000 --> 00:00:02.640] rid of the whole single segment portion.
[00:00:02.640 --> 00:00:03.720] I really wonder what's in your mind.
[00:00:00.000 --> 00:00:02.640] rid of the whole single segment portion.
[00:00:02.640 --> 00:00:04.160] I really wonder what single segment--
[00:00:00.000 --> 00:00:02.640] rid of the whole single segment portion.
[00:00:02.640 --> 00:00:04.640] I really wonder what single segment was trying to do.
[00:00:00.000 --> 00:00:02.640] rid of the whole single segment portion.
[00:00:02.640 --> 00:00:05.200] I really wonder what single segment was trying to do here.

whisper_full_with_state: input is too short - 690 ms < 1000 ms. consider padding the input audio with silence
[00:00:00.000 --> 00:00:01.200] here because it looks awesome.
[00:00:00.000 --> 00:00:03.280] here because it looks like
[00:00:00.000 --> 00:00:04.160] here because it looks like it's
[00:00:00.000 --> 00:00:04.160] here because it looks like it's
[00:00:00.000 --> 00:00:05.040] here because it looks like it's weird
[00:00:00.000 --> 00:00:03.540] here because it looks like it's weirdly a
[00:00:00.000 --> 00:00:04.160] here because it looks like it's weirdly effective.
[00:00:00.000 --> 00:00:04.760] here because it looks like it's weirdly affecting what
[00:00:00.000 --> 00:00:07.040] here because it looks like it's weirdly affecting what's going

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants