-
Notifications
You must be signed in to change notification settings - Fork 109
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: Add timestamp #52
Conversation
What kind of timestamp format do you recommend? I am considering using the one from https://github.com/alphacep/vosk-api/blob/master/src/vosk_api.h#L168 An example message is given below (in json format): {
"timestamp": [
{"token": "_AFTER", "start": 0.0},
{"token": "_E", "start": 0.28},
{"token": "AR", "start": 0.44},
{"token": "LY", "start": 0.56},
{"token": "_NIGHT", "start": 0.88},
],
"text": "AFTER EARLY NIGHT"
} The message would be constructed in Python, so it should be fairly straightforward to change its format. Note that only the start time of a BPE token is given. You have to figure out the start and end time of a word from it. |
this makes sense to me
y.
…On Wed, Jul 6, 2022 at 10:37 AM Fangjun Kuang ***@***.***> wrote:
From #43 <#43>
@ngoel17 <https://github.com/ngoel17>
What kind of timestamp format do you recommend?
I am considering using the one from
https://github.com/alphacep/vosk-api/blob/master/src/vosk_api.h#L168
An example message is given below (in json format):
{
"timestamp": [
{"token": "_AFTER", "start": 0.0},
{"token": "_E", "start": 0.28},
{"token": "AR", "start": 0.44},
{"token": "LY", "start": 0.56},
{"token": "_NIGHT", "start": 0.88},
],
"text": "AFTER EARLY NIGHT"
}
The message would be constructed in Python, so it should be fairly
straightforward to change its format.
Note that only the start time of a BPE token is given. You have to figure
out the start and end time of a word from it.
—
Reply to this email directly, view it on GitHub
<#52 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACUKYXYEZZYT23YZTMN3S5DVSWK2RANCNFSM52Z7LBTA>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
To create a CTM start time and duration are needed. The issue will be with words that have long silence in between, unless silence token is not deleted. |
One problem with this is the first word is always labeled with start:0.0 regardless of how much silence/non-speech there is before that word. |
No. It is 0 for this specific test wave since the first decoded token happens to be on the first frame. |
In the test file the first word "AFTER" approximately starts at 0.402s which is way after the first frame i think. Can you try the attached file? I added ~5 seconds of silence before the first word and is still gives start:0.0 Here is what i got (on a different model than yours): Original file (1089-134686-0001.wav): [('▁after', 0.0), ('▁e', 0.56), ('ar', 0.68), ('ly', 0.8), ('▁night', 1.04) ..... |
Yes, you are right. I can reproduce the results. Both greedy search and modified_beam_search emit the first token on the first frame. I think part of the reason is that the model uses global attention. The model can see all the remaining frames on frame 0. I will try to use the streaming model to test it. |
The following are the results for the streaming greedy search using the pre-trained model from You can see that the first token is no longer decoded at frame 0. Moreover, it shows the model tends to delay (about 0.4 s) the output. Without pre-pended silence
With pre-pended silence
|
The following table summarizes the results so far for non-streaming and streaming decoding. There are about 0.3 to 0.4 seconds delay for the streaming model.
|
If we assume that the chunk size is 40ms and we have 4 chunks, then the left context is 160ms. But 320 ms would be double that. Any intuition why we get an offset in this range? |
Maybe we can try the models trained with delay penalty, see if it helps. |
The following are the timestamps of model trained with delay penalty (see k2-fsa/k2#976). The encoder is reworked conformer. You can see from the table, the model trained without penalty delay, the delay is about 0.3 to 0.4 seconds. the model trained with penalty==0.0015, the delay decreases to 0.1 to 0.2 seconds, so the delay penalty does help to let symbols emit earlier. Apart from the delay penalty, the decode chunk size and left context also affect the symbols delay, the 4th, 5th, 6th columns were decoded with chunk-size=8, left-context=32, while the 7th column was decoded with chunk-size=16, left-context64, comparing with 6th and 7th columns, when decode chunk is larger the symbols delay will be slightly larger. No penalty
Penalty=0.001
Penalty=0.0015
Penalty=0.0015 (left-context=64, chunk-size=16)
|
I just found that a model trained using modified_transducer (from optimized_transducer) |
I don't see a recent comment there. I don't understand what you are saying, is this a Sherpa-specific issue? Are you saying that empirically, the 1st token never seems to appear on frame 0? |
k2-fsa/icefall#239 uses a model trained with standard RNN-T loss plus using modified_transducer with prob 0.25 to do force alignment. You can see that the first token does not start with the first frame. This PR uses a reworked conformer model trained with pruned RNN-T. It is also non-streaming. #52 (comment) shows that for the given audio, the model from this PR always predicts the first token on the first frame, no matter whether you prepend it with 5 seconds of noise or not. This PR and k2-fsa/icefall#239 are testing the same audio. One difference between this PR and k2-fsa/icefall#239 |
Already supported. |
Here are some initial results.
For the following test wav (Note: Its name should be
1089-134686-0002.wav
, not1089-134686-0001.wav
)https://huggingface.co/csukuangfj/icefall-asr-librispeech-pruned-transducer-stateless3-2022-05-13/blob/main/test_wavs/1089-134686-0001.wav
The ground truth word timestamp from https://github.com/CorentinJ/librispeech-alignments
is
The predicted results of this PR using the pre-trained model from
https://huggingface.co/csukuangfj/icefall-asr-librispeech-pruned-transducer-stateless3-2022-05-13/tree/main/exp
are
For ease of reference, the above results are listed in the following table:
Since the model subsampling factor is 4 and frame shift is 0.01s, the resolution of the timestamp is 0.04s.
You can see that the predicted timestamps of the words in the middle of the utterance are very close to the one
from https://github.com/CorentinJ/librispeech-alignments
Also note that we are not doing force-alignment here. The words are decoded using greedy search and it happens that all words are predicted correctly.