-
Notifications
You must be signed in to change notification settings - Fork 297
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extract framewise alignment information by the pretrained model #188
Comments
Could you first follow the The pre-trained model is called |
Thank you for your clarifications! I can now obtain three files
After checking with tokens.txt, I found the transcript. Just wondering what time slot is between two elements in this list to calculate the corresponding time with the word in the transcript. Many thanks for your help in advance! |
These numbers are the corresponding token IDs for the output frames. To get the results of #39, you have to do a few extra things. (1) Note the subsampling factor of the model is 4, so output frames 0, 1, 2 correspond to input frames 0, 4, 8. (2) The default frame shift is 10ms and you can convert the output frame index to time in seconds by multiplying 0.01 (3) You have to use
The time slot between two consecutive output frames is 0.04 s. |
Thank you very much for your detailed explanations. I have obtained almost the same results as #39 except the fact that for the first wordpiece
Could I have any chance to know how you determine the starting frame for the first word in the general case? |
The alignment is never going to be exact in any end-to-end setup, especially one like transformers that consumes unlimited left/right context. |
Hi guys, I also met the same problem when doing alignment by myself. Coudl I ask why the model #17 using in #39 can determine the starting frame for the first word It seems that both models in #17 and pretrained model rom: https://huggingface.co/csukuangfj/icefall-asr-librispeech-conformer-ctc-jit-bpe-500-2021-11-09 are ctc model. What possible modifications between these two models lead to this result? |
Too-powerful models can give poor alignments as they transform the data too
much.
Often the best alignments are from GMM systems.
…On Fri, Jan 28, 2022 at 2:13 AM Jianjie-Shi ***@***.***> wrote:
Hi guys,
I also met the same problem when doing alignment by myself. Coudl I ask
why the model #17 <#17> using in #39
<#39> can determine the starting
frame for the first word the accurately, e.g., 0.48s compared with the
ground truth 0.5s, while the pretrained model from:
https://huggingface.co/csukuangfj/icefall-asr-librispeech-conformer-ctc-jit-bpe-500-2021-11-09
got a worse result, e.g., 0s compared with 0.5s?
It seems that both models in #17
<#17> and pretrained model rom:
https://huggingface.co/csukuangfj/icefall-asr-librispeech-conformer-ctc-jit-bpe-500-2021-11-09
are ctc model. What possible modifications between these two models lead to
this result?
—
Reply to this email directly, view it on GitHub
<#188 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAZFLO24NBJ3PNRGUXPYI6TUYGDNDANCNFSM5MSXYAOA>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
You are receiving this because you commented.Message ID:
***@***.***>
|
Could you try the pre-trained model from the following repo? https://github.com/csukuangfj/icefall-asr-conformer-ctc-bpe-500 That model has a higher WER on test-clean than the one from https://huggingface.co/csukuangfj/icefall-asr-librispeech-conformer-ctc-jit-bpe-500-2021-11-09, i.e., 2.56 vs 2.42 They basically have the same model configuration, i.e., you can load the pre-trained model with the same code without modifications. I just tried it on the first utterance of
You can see the first token does not start at the very beginning.
Dan's comment may explain why those two models produce different alignments.
|
Sorry to bother you again. I just wonder whether the pretrained model can be used to Extract framewise alignment information for our own datasets now? I can see in
|
You can try that and look at the resulting alignments. You will probably need to train your own model.
Possible steps are: (2) Follow https://github.com/k2-fsa/icefall/blob/master/egs/librispeech/ASR/prepare.sh to extract features for your dataset (3) Adapt https://github.com/k2-fsa/icefall/blob/master/egs/librispeech/ASR/tdnn_lstm_ctc/asr_datamodule.py to your dataset (4) Train a model for your dataset. Please see https://github.com/k2-fsa/icefall/blob/master/egs/librispeech/ASR/conformer_ctc/train.py (5) Get alignments. Please see https://github.com/k2-fsa/icefall/blob/master/egs/librispeech/ASR/conformer_ctc/ali.py |
Hi,
I am new to Icefall. I would like to extract framewise alignment information like what is shown in #39 with the pretrained model from: https://huggingface.co/csukuangfj/icefall-asr-librispeech-conformer-ctc-jit-bpe-500-2021-11-09. I tried to follow README.MD in egs/librispeech/ASR/conformer_ctc/README.md. However, when I tried to run "ali.py" in egs/librispeech/ASR/conformer_ctc/ali.py by the usage "./conformer_ctc/ali.py --exp-dir ./conformer_ctc/exp --lang-dir ./data/lang_bpe_500 --epoch 20 --avg 10 --max-duration 300 --dataset train-clean-100 --out-dir data/ali", I found there are no checkpoint files in, e.g, /conformer_ctc/exp, uploaded for the pretrained model to average.
I wonder whether I missed something and/or where I can find an example to extract framewise alignment information by the pretrained model to get a similar results shown in #39. Many thanks for your help in advance!
The text was updated successfully, but these errors were encountered: