-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement word-level timestamps approach proposed by OpenAI #375
Comments
This came up in my "explore" feed as a way to implement accurate word-level timestamps: |
There is a functioning implementation of the attention weights approach here: https://github.com/linto-ai/whisper-timestamped which might be a useful reference for implementing in |
The whisper python module itself provided a time-stamp output option, which could be a reference, and I tested it, the command is:
it generated 5 files in the test_out folder:
In the test.json file, the content is: {
"text": " So, here's a great city of New York, and I realized now, going out public is a big, busy place. I'm going to be recognized, people are going to know why I am. Now I'm here, I'm on vacation, I'm with my family, I just want to have the money back. I just want to be a normal person, so... I'm going to go to the kitchen. I'm at this girl's girl, she seemed to hit me tight. The need to surface she was more about her perfect life. It's not the best thing, she drives the main thing. And when I'm dreaming her to scream and daddy make it. She's a gold digger lover, she got it from her mom. She's never stepfather body, all they should want. She's a gold digger lover, she's a gold digger lover. If you recognize me now, don't you? I'm the only one. So, real life, nobody really knows who the heck I am. So, I have a plan, but I gotta make myself known. I gotta do this somehow, I gotta get my name out there. She's a gold digger lover, she's a gold digger lover. She's a gold digger lover, she's a gold digger lover. Have you heard of her? You never heard of her? Oh, it's great. She won't do anything like gold in time and fame. But pop up on a mark, the birds will fade away with a ramycine. Last blow, sing, last tip of two, she could last. By the way, do you know what's going to be? Can I get caught by a goodness? I don't know. She's a gold digger lover, she's been on the cover, she's a brand. No, pop up on a mark, you just look her up, she's great. She's a gold digger lover, she's a gold digger lover, she's a gold digger lover. Thank you. Thank you. Thank you. I thought it was a party party party, can you? So, okay, New York City, you may not know me yet. And all, as I've learned, you may not have heard of Lindsey Sterling, he hit my violinist before. What? Think you're looking up for me. Think you're on the bright side. Hello, how you doing? Yes. Okay. So, subscribe to my YouTube channel. Stop it this drought. I'm just gonna be fine. I got some great stuff coming through away. Do. Ace that. More come. Yeah. Let me surely sign me out.",
"segments": [
{
"id": 0,
"seek": 0,
"start": 3.6,
"end": 10.6,
"text": " So, here's a great city of New York, and I realized now, going out public is a big, busy place.",
"tokens": [
50364, 407, 11, 510, 311, 257, 869, 2307, 295, 1873, 3609, 11, 293, 286, 5334, 586, 11, 516, 484, 1908, 307, 257, 955, 11, 5856, 1081, 13, 50914
],
"temperature": 0.0,
"avg_logprob": -0.5041744733097577,
"compression_ratio": 1.5665024630541873,
"no_speech_prob": 0.08891408145427704,
"words": [
{"word": " So,","start": 3.6,"end": 3.96,"probability": 0.5301069021224976},
{"word": " here's","start": 3.42,"end": 4.32,"probability": 0.6140210628509521},
{"word": " a","start": 4.32,"end": 4.42,"probability": 0.1545887440443039},
{"word": " great","start": 4.42,"end": 4.7,"probability": 0.6114427447319031},
{"word": " city","start": 4.7,"end": 5.08,"probability": 0.9124268293380737},
{"word": " of","start": 5.08,"end": 5.36,"probability": 0.9507943987846375},
{"word": " New","start": 5.36,"end": 5.44,"probability": 0.9982349872589111},
{"word": " York,","start": 5.44,"end": 6.18,"probability": 0.9951660633087158},
{"word": " and","start": 6.44,"end": 6.56,"probability": 0.9580233097076416},
{"word": " I","start": 6.56,"end": 6.66,"probability": 0.5875958204269409},
{"word": " realized","start": 6.66,"end": 7.02,"probability": 0.5471060872077942},
{"word": " now,","start": 7.02,"end": 7.86,"probability": 0.6020179390907288},
{"word": " going","start": 8.04,"end": 8.12,"probability": 0.7494494318962097},
{"word": " out","start": 8.12,"end": 8.38,"probability": 0.9883183240890503},
{"word": " public","start": 8.38,"end": 8.72,"probability": 0.6699197888374329},
{"word": " is","start": 8.72,"end": 8.98,"probability": 0.3241350054740906},
{"word": " a","start": 8.98,"end": 9.14,"probability": 0.7641012072563171},
{"word": " big,","start": 9.14,"end": 9.5,"probability": 0.4375719726085663},
{"word": " busy","start": 9.5,"end": 9.94,"probability": 0.6939781308174133},
{"word": " place.","start": 9.94,"end": 10.6,"probability": 0.8924348950386047}
]
},
{
"id": 1,
"seek": 0,
"start": 11.7,
"end": 15.16,
"text": " I'm going to be recognized, people are going to know why I am.",
"tokens": [
50914, 286, 478, 516, 281, 312, 9823, 11, 561, 366, 516, 281, 458, 983, 286, 669, 13, 51114
],
"temperature": 0.0,
"avg_logprob": -0.5041744733097577,
"compression_ratio": 1.5665024630541873,
"no_speech_prob": 0.08891408145427704,
"words": [
{"word": " I'm","start": 11.7,"end": 11.8,"probability": 0.980172872543335},
{"word": " going","start": 11.8,"end": 11.94,"probability": 0.32428041100502014},
{"word": " to","start": 11.94,"end": 12.04,"probability": 0.9828474521636963},
{"word": " be","start": 12.04,"end": 12.16,"probability": 0.9843984842300415},
{"word": " recognized,","start": 12.16,"end": 12.58,"probability": 0.3810001611709595},
{"word": " people","start": 13.22,"end": 13.5,"probability": 0.9561352729797363},
{"word": " are","start": 13.5,"end": 13.6,"probability": 0.9821558594703674},
{"word": " going","start": 13.6,"end": 13.78,"probability": 0.7550729513168335},
{"word": " to","start": 13.78,"end": 13.8,"probability": 0.9977655410766602},
{"word": " know","start": 13.8,"end": 14.0,"probability": 0.9933110475540161},
{"word": " why","start": 14.0,"end": 14.32,"probability": 0.7471684813499451},
{"word": " I","start": 14.32,"end": 14.58,"probability": 0.31861186027526855},
{"word": " am.","start": 14.58,"end": 15.16,"probability": 0.9440820217132568}
]
}
],
"language": "en"
} From a practicle view, the json word-timestamp file is quite useful. |
The method used to get per-word timestamps is pretty bad. The python version is substantially better. I'm struggling to sort out how to do it in the Whisper.CPP version, but it seems like "whisper_exp_compute_token_level_timestamps" needs to be replaced with something similar to what's in the "timing.py" of OpenAI's implementation. |
I'd love to help with implementing OpenAI's per-word timestamps approach based on DTW and cross-attention weights in I think the main steps required for this consist of:
Is this on the roadmap and is anyone willing to collaborate on this? |
I think the roadmap is pretty open to whatever you want to contribute. I don't know of anyone else working on it. I did take a look at trying to implement it, but found that I just don't know the inner workings of GGML and PyTorch well enough to build something that won't be a total mess. I'm definitely willing to collaborate on it, but I'm not sure how much use I can be. |
Would be great to implement this in From what I remember, DTW is a dynamic programming algorithm and it's implementation should be part of |
I would like to try my hand at this, would you be willing to offer me some guidance @ggerganov ? I'll probably start as suggested, implementing the DTW algorithm (on What i will probably need help figuring out is the information collection. Two points in special trouble me:
Considering the conversion between PyTorch to ggml, would these indexes still point to the same attention heads? |
See notebook, section "Word-level timestamps using attention weights":
https://github.com/openai/whisper/blob/main/notebooks/Multilingual_ASR.ipynb
The text was updated successfully, but these errors were encountered: