Extract framewise alignment information by the pretrained model #188

TianyuCao · 2022-01-23T02:23:11Z

Hi,

I am new to Icefall. I would like to extract framewise alignment information like what is shown in #39 with the pretrained model from: https://huggingface.co/csukuangfj/icefall-asr-librispeech-conformer-ctc-jit-bpe-500-2021-11-09. I tried to follow README.MD in egs/librispeech/ASR/conformer_ctc/README.md. However, when I tried to run "ali.py" in egs/librispeech/ASR/conformer_ctc/ali.py by the usage "./conformer_ctc/ali.py --exp-dir ./conformer_ctc/exp --lang-dir ./data/lang_bpe_500 --epoch 20 --avg 10 --max-duration 300 --dataset train-clean-100 --out-dir data/ali", I found there are no checkpoint files in, e.g, /conformer_ctc/exp, uploaded for the pretrained model to average.

I wonder whether I missed something and/or where I can find an example to extract framewise alignment information by the pretrained model to get a similar results shown in #39. Many thanks for your help in advance!

csukuangfj · 2022-01-23T03:49:54Z

Could you first follow the README.md in https://huggingface.co/csukuangfj/icefall-asr-librispeech-conformer-ctc-jit-bpe-500-2021-11-09 to download the pre-trained model?

The pre-trained model is called pretrained.pt. You can create a symlink to it in conformer_ctc/exp/epoch-999.pt
and use --epoch 999 --avg 1 when invoking conformer_ctc/ali.py.

TianyuCao · 2022-01-26T15:19:19Z

Thank you for your clarifications! I can now obtain three files aux_labels_test-clean.h5, labels_test-clean.h5 and cuts_test-clean by using --epoch 999 --avg 1 when invoking conformer_ctc/ali.py. However, when trying to read the data from aux_labels_test-clean.h5 for the test audio in #39, e.g., librispeech/LibriSpeech/test-clean/8224/274384/8224-274384-0008.flac, i just something like this

[4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 286, 0, 0, 0, 0, 0, 298, 0, 0, 0, 0, 276, 0, 12, 0, 0, 5, 0, 0, 28, 12, 0, 27, 0, 0, 0, 0, 209, 0, 0, 0, 0, 0, 0, 15, 0, 0, 0, 0, 0, 59, 0, 0, 0, 0, 210, 0, 0, 0, 0, 10, 0, 0, 0, 0, 0, 134, 0, 0, 58, 0, 0, 72, 0, 0, 0, 0, 161, 0, 0, 340, 0, 0, 0, 207, 0, 0, 16, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 187, 0, 150, 0, 8, 0, 0, 0, 0, 42, 0, 0, 0, 0, 74, 0, 0, 0, 0, 66, 0, 0, 0, 0, 0, 0, 0, 263, 0, 0, 0, 0, 0, 29, 0, 0, 0, 78, 0, 0, 38, 0, 29, 0, 0, 0, 209, 0, 0, 0, 0, 10, 0, 0, 0, 4, 0, 0, 0, 167, 0, 0, 14, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 236, 0, 0, 0, 0, 0, 10, 0, 0, 0, 4, 0, 0, 0, 139, 0, 13, 0, 0, 275, 0, 0, 29, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 7, 0, 0, 0, 0, 0, 0, 137, 0, 0, 0, 92, 0, 0, 0, 0, 4, 0, 0, 0, 0, 59, 0, 3, 0, 48, 0, 17, 0, 0, 0, 0, 0, 0, 0, 0, 0, 110, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 42, 0, 0, 0, 17, 0, 0, 29, 0, 0, 0, 62, 0, 0, 0, 0, 0, 127, 0, 0, 58, 0, 8, 0, 0, 0, 42, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

After checking with tokens.txt, I found the transcript. Just wondering what time slot is between two elements in this list to calculate the corresponding time with the word in the transcript. Many thanks for your help in advance!

csukuangfj · 2022-01-27T02:34:14Z

However, when trying to read the data from aux_labels_test-clean.h5 for the test audio in #39, e.g., librispeech/LibriSpeech/test-clean/8224/274384/8224-274384-0008.flac, i just something like this

[4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 286, 0, 0, 0, 0, 0, 298, 0, 0, 0, 0, 276, 0, 12, 0, 0, 5, 0, 0, 28, 12, 0, 27, 0, 0, 0, 0, 209, 0, 0, 0, 0, 0, 0, 15, 0, 0, 0, 0, 0, 59, 0, 0, 0, 0, 210, 0, 0, 0, 0, 10, 0, 0, 0, 0, 0, 134, 0,

These numbers are the corresponding token IDs for the output frames. To get the results of #39, you have to do a few extra things.

(1) Note the subsampling factor of the model is 4, so output frames 0, 1, 2 correspond to input frames 0, 4, 8.
You have to use interpolation to get the alignments for input frames 1, 2, 3, 5, 6, 7, etc.

(2) The default frame shift is 10ms and you can convert the output frame index to time in seconds by multiplying 0.01

(3) You have to use tokens.txt to map those integer token IDs to the corresponding symbols.

Just wondering what time slot is between two elements in this list to calculate the corresponding time with the word in the transcript

The time slot between two consecutive output frames is 0.04 s.
As we are using wordpieces and wordpieces of a word start with an underscore _, you can use this information to find the starting frame of a word. Unfortunately, it is not easy to find the ending frame of a word.

TianyuCao · 2022-01-27T06:25:08Z

Thank you very much for your detailed explanations. I have obtained almost the same results as #39 except the fact that for the first wordpiece ▁THE whose token ID is 4, if we use 0*0.04=0, then we will get the first word The starts immediately in this audio, which does not match the alignment information from https://github.com/CorentinJ/librispeech-alignments (0.500s) or the result shows in #39 (0.48s).

['▁THE', '▁GOOD', '▁NA', 'TURE', 'TURE', 'D', 'D', 'D', '▁A', '▁A', 'U', 'D', 'D', 'I', 'I', 'ENCE', '▁IN', '▁P', 'ITY', '▁TO', '▁FA', 'LL', 'LL', 'EN', '▁MA', 'J', 'J', 'EST', 'Y', 'Y', '▁SH', 'OW', 'ED', 'ED', '▁FOR', '▁ON', 'CE', 'CE', '▁GREAT', 'ER', '▁DE', 'F', 'F', 'ER', 'ER', 'ENCE', '▁TO', '▁THE', '▁K', 'ING', 'ING', '▁THAN', '▁TO', '▁THE', '▁MI', 'N', 'IST', 'ER', '▁AND', '▁SU', 'NG', 'NG', '▁THE', '▁P', '▁P', 'S', 'S', 'AL', 'AL', 'M', 'M', '▁WHICH', '▁THE', '▁FOR', 'M', 'ER', '▁HAD', '▁CA', 'LL', 'LL', 'ED', 'ED', '▁FOR']
[0.0, 0.64, 0.88, 1.08, 1.12, 1.16, 1.2, 1.24, 1.28, 1.32, 1.4000000000000001, 1.44, 1.48, 1.52, 1.56, 1.72, 2.0, 2.24, 2.44, 2.64, 2.88, 3.0, 3.04, 3.12, 3.3200000000000003, 3.44, 3.48, 3.6, 3.72, 3.7600000000000002, 4.68, 4.76, 4.84, 4.88, 5.04, 5.24, 5.44, 5.48, 5.76, 6.0, 6.16, 6.28, 6.32, 6.36, 6.4, 6.5200000000000005, 6.72, 6.88, 7.04, 7.16, 7.2, 7.88, 8.120000000000001, 8.28, 8.44, 8.52, 8.64, 8.76, 9.64, 9.92, 10.08, 10.120000000000001, 10.28, 10.48, 10.52, 10.56, 10.6, 10.64, 10.68, 10.72, 10.76, 11.120000000000001, 11.4, 11.56, 11.72, 11.84, 12.0, 12.24, 12.36, 12.4, 12.44, 12.48, 12.6]

Could I have any chance to know how you determine the starting frame for the first word in the general case?

danpovey · 2022-01-27T06:31:09Z

The alignment is never going to be exact in any end-to-end setup, especially one like transformers that consumes unlimited left/right context.

Jianjie-Shi · 2022-01-27T18:13:21Z

Hi guys,

I also met the same problem when doing alignment by myself. Coudl I ask why the model #17 using in #39 can determine the starting frame for the first word the accurately, e.g., 0.48s compared with the ground truth 0.5s, while the pretrained model from: https://huggingface.co/csukuangfj/icefall-asr-librispeech-conformer-ctc-jit-bpe-500-2021-11-09 got a worse result, e.g., 0s compared with 0.5s?

It seems that both models in #17 and pretrained model rom: https://huggingface.co/csukuangfj/icefall-asr-librispeech-conformer-ctc-jit-bpe-500-2021-11-09 are ctc model. What possible modifications between these two models lead to this result?

danpovey · 2022-01-28T02:04:20Z

Too-powerful models can give poor alignments as they transform the data too much. Often the best alignments are from GMM systems.

…

On Fri, Jan 28, 2022 at 2:13 AM Jianjie-Shi ***@***.***> wrote: Hi guys, I also met the same problem when doing alignment by myself. Coudl I ask why the model #17 <#17> using in #39 <#39> can determine the starting frame for the first word the accurately, e.g., 0.48s compared with the ground truth 0.5s, while the pretrained model from: https://huggingface.co/csukuangfj/icefall-asr-librispeech-conformer-ctc-jit-bpe-500-2021-11-09 got a worse result, e.g., 0s compared with 0.5s? It seems that both models in #17 <#17> and pretrained model rom: https://huggingface.co/csukuangfj/icefall-asr-librispeech-conformer-ctc-jit-bpe-500-2021-11-09 are ctc model. What possible modifications between these two models lead to this result? — Reply to this email directly, view it on GitHub <#188 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAZFLO24NBJ3PNRGUXPYI6TUYGDNDANCNFSM5MSXYAOA> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>. You are receiving this because you commented.Message ID: ***@***.***>

csukuangfj · 2022-01-28T03:51:24Z

Could I ask why the model #17 using in #39 can determine the starting frame for the first word the accurately

Could you try the pre-trained model from the following repo? https://github.com/csukuangfj/icefall-asr-conformer-ctc-bpe-500

That model has a higher WER on test-clean than the one from https://huggingface.co/csukuangfj/icefall-asr-librispeech-conformer-ctc-jit-bpe-500-2021-11-09, i.e., 2.56 vs 2.42

They basically have the same model configuration, i.e., you can load the pre-trained model with the same code without modifications. I just tried it on the first utterance of test-clean using the master branch and the following shows the debug output:

(Pdb) p supervisions
{'text': ["NO I'VE MADE UP MY MIND ABOUT IT IF I'M MABEL I'LL STAY DOWN HERE"], 'sequence_idx': tensor([0], dtype=torch.int32), 'start_frame': tensor([0], dtype=torch.int32), 'num_frames': tensor([487], dtype=torch.int32), 'cut': [MonoCut(id='260-123440-0011-1193-0',
start=0, duration=4.87, channel=0, supervisions=[SupervisionSegment(id='260-123440-0011', recording_id='260-123440-0011', start=0.0, duration=4.87, channel=0, text="NO I'VE MADE UP MY MIND ABOUT IT IF I'M MABEL I'LL STAY DOWN HERE", language='English', speaker='260',
gender=None, custom=None, alignment=None)], features=Features(type='fbank', num_frames=487, num_features=80, frame_shift=0.01, sampling_rate=16000, start=0, duration=4.87, storage_type='lilcom_hdf5', storage_path='data/fbank/feats_test-clean/feats-5.h5', storage_key='575aacae-38c5-45ec-9db9-0e3085e490be', recording_id=None, channels=0), recording=Recording(id='260-123440-0011', sources=[AudioSource(type='file', channels=[0], source='data/LibriSpeech/test-clean/260/123440/260-123440-0011.flac')], sampling_rate=16000, num_samples=77920, duration=4.87, transforms=None), custom=None)]}
(Pdb) p labels_ali
[[0, 0, 0, 0, 0, 0, 0, 94, 0, 0, 0, 0, 0, 0, 19, 45, 45, 75, 0, 300, 0, 0, 0, 0, 176, 0, 0, 105, 0, 0, 0, 139, 0, 0, 68, 0, 0, 0, 0, 250, 0, 0, 0, 0, 0, 30, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 165, 0, 0, 0, 0, 19, 0, 45, 45, 17, 0, 0, 161, 0, 0, 41, 41, 131, 131, 0, 0, 0, 0, 0, 19, 0, 0, 45, 58, 58, 58, 0, 0, 0, 0, 277, 0, 0, 0, 16, 16, 0, 0, 294, 0, 0, 0, 0, 0, 0, 22, 0, 0, 26, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]

You can see the first token does not start at the very beginning.
And you can compare the timestamps from https://github.com/CorentinJ/librispeech-alignments
I list them below for easier reference.

260-123440-0011 ",NO,I'VE,MADE,UP,MY,MIND,ABOUT,IT,,IF,,I'M,MABEL,,I'LL,STAY,DOWN,HERE," "0.220,0.600,0.760,0.950,1.060,1.190,1.480,1.840,2.000,2.220,2.440,2.470,2.720,3.200,3.230,3.500,3.950,4.220,4.660,4.87"

Dan's comment may explain why those two models produce different alignments.

Too-powerful models can give poor alignments as they transform the data too
much.
Often the best alignments are from GMM systems.

TianyuCao · 2022-02-09T18:23:39Z

Sorry to bother you again. I just wonder whether the pretrained model can be used to Extract framewise alignment information for our own datasets now? I can see in ali.py, only datasets in LibriSpeech can be used to compute alignments. If I need to compute alignments for my own datasets, what steps should I do, e.g., generate fbank and manifests for my datasets?

parser.add_argument(
"--dataset",
type=str,
required=True,
help="""The name of the dataset to compute alignments for.
Possible values are:
- test-clean.
- test-other
- train-clean-100
- train-clean-360
- train-other-500
- dev-clean
- dev-other
""",
)

csukuangfj · 2022-02-11T04:46:33Z

I just wonder whether the pretrained model can be used to Extract framewise alignment information for our own datasets now?

You can try that and look at the resulting alignments. You will probably need to train your own model.

If I need to compute alignments for my own datasets, what steps should I do, e.g., generate fbank and manifests for my datasets?

Possible steps are:
(1) Prepare your data. Please see https://lhotse.readthedocs.io/en/latest/corpus.html#adding-new-corpora for more information. You can find various recipes for different datasets in https://github.com/lhotse-speech/lhotse/tree/master/lhotse/recipes

(2) Follow https://github.com/k2-fsa/icefall/blob/master/egs/librispeech/ASR/prepare.sh to extract features for your dataset

(3) Adapt https://github.com/k2-fsa/icefall/blob/master/egs/librispeech/ASR/tdnn_lstm_ctc/asr_datamodule.py to your dataset

(4) Train a model for your dataset. Please see https://github.com/k2-fsa/icefall/blob/master/egs/librispeech/ASR/conformer_ctc/train.py

(5) Get alignments. Please see https://github.com/k2-fsa/icefall/blob/master/egs/librispeech/ASR/conformer_ctc/ali.py

csukuangfj mentioned this issue Mar 7, 2022

Add force alignment for stateless transducer. #239

Merged

2 tasks

himanshucodz55 mentioned this issue Feb 1, 2023

get word confidences #866

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extract framewise alignment information by the pretrained model #188

Extract framewise alignment information by the pretrained model #188

TianyuCao commented Jan 23, 2022

csukuangfj commented Jan 23, 2022 •

edited

Loading

TianyuCao commented Jan 26, 2022 •

edited

Loading

csukuangfj commented Jan 27, 2022

TianyuCao commented Jan 27, 2022

danpovey commented Jan 27, 2022

Jianjie-Shi commented Jan 27, 2022

danpovey commented Jan 28, 2022 via email

csukuangfj commented Jan 28, 2022

TianyuCao commented Feb 9, 2022 •

edited

Loading

csukuangfj commented Feb 11, 2022

Extract framewise alignment information by the pretrained model #188

Extract framewise alignment information by the pretrained model #188

Comments

TianyuCao commented Jan 23, 2022

csukuangfj commented Jan 23, 2022 • edited Loading

TianyuCao commented Jan 26, 2022 • edited Loading

csukuangfj commented Jan 27, 2022

TianyuCao commented Jan 27, 2022

danpovey commented Jan 27, 2022

Jianjie-Shi commented Jan 27, 2022

danpovey commented Jan 28, 2022 via email

csukuangfj commented Jan 28, 2022

TianyuCao commented Feb 9, 2022 • edited Loading

csukuangfj commented Feb 11, 2022

csukuangfj commented Jan 23, 2022 •

edited

Loading

TianyuCao commented Jan 26, 2022 •

edited

Loading

TianyuCao commented Feb 9, 2022 •

edited

Loading