Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extract framewise alignment information by the pretrained model #188

Open
TianyuCao opened this issue Jan 23, 2022 · 10 comments
Open

Extract framewise alignment information by the pretrained model #188

TianyuCao opened this issue Jan 23, 2022 · 10 comments

Comments

@TianyuCao
Copy link

Hi,

I am new to Icefall. I would like to extract framewise alignment information like what is shown in #39 with the pretrained model from: https://huggingface.co/csukuangfj/icefall-asr-librispeech-conformer-ctc-jit-bpe-500-2021-11-09. I tried to follow README.MD in egs/librispeech/ASR/conformer_ctc/README.md. However, when I tried to run "ali.py" in egs/librispeech/ASR/conformer_ctc/ali.py by the usage "./conformer_ctc/ali.py --exp-dir ./conformer_ctc/exp --lang-dir ./data/lang_bpe_500 --epoch 20 --avg 10 --max-duration 300 --dataset train-clean-100 --out-dir data/ali", I found there are no checkpoint files in, e.g, /conformer_ctc/exp, uploaded for the pretrained model to average.

I wonder whether I missed something and/or where I can find an example to extract framewise alignment information by the pretrained model to get a similar results shown in #39. Many thanks for your help in advance!

@csukuangfj
Copy link
Collaborator

csukuangfj commented Jan 23, 2022

Could you first follow the README.md in https://huggingface.co/csukuangfj/icefall-asr-librispeech-conformer-ctc-jit-bpe-500-2021-11-09 to download the pre-trained model?

The pre-trained model is called pretrained.pt. You can create a symlink to it in conformer_ctc/exp/epoch-999.pt
and use --epoch 999 --avg 1 when invoking conformer_ctc/ali.py.

@TianyuCao
Copy link
Author

TianyuCao commented Jan 26, 2022

Thank you for your clarifications! I can now obtain three files aux_labels_test-clean.h5, labels_test-clean.h5 and cuts_test-clean by using --epoch 999 --avg 1 when invoking conformer_ctc/ali.py. However, when trying to read the data from aux_labels_test-clean.h5 for the test audio in #39, e.g., librispeech/LibriSpeech/test-clean/8224/274384/8224-274384-0008.flac, i just something like this

[4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 286, 0, 0, 0, 0, 0, 298, 0, 0, 0, 0, 276, 0, 12, 0, 0, 5, 0, 0, 28, 12, 0, 27, 0, 0, 0, 0, 209, 0, 0, 0, 0, 0, 0, 15, 0, 0, 0, 0, 0, 59, 0, 0, 0, 0, 210, 0, 0, 0, 0, 10, 0, 0, 0, 0, 0, 134, 0, 0, 58, 0, 0, 72, 0, 0, 0, 0, 161, 0, 0, 340, 0, 0, 0, 207, 0, 0, 16, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 187, 0, 150, 0, 8, 0, 0, 0, 0, 42, 0, 0, 0, 0, 74, 0, 0, 0, 0, 66, 0, 0, 0, 0, 0, 0, 0, 263, 0, 0, 0, 0, 0, 29, 0, 0, 0, 78, 0, 0, 38, 0, 29, 0, 0, 0, 209, 0, 0, 0, 0, 10, 0, 0, 0, 4, 0, 0, 0, 167, 0, 0, 14, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 236, 0, 0, 0, 0, 0, 10, 0, 0, 0, 4, 0, 0, 0, 139, 0, 13, 0, 0, 275, 0, 0, 29, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 7, 0, 0, 0, 0, 0, 0, 137, 0, 0, 0, 92, 0, 0, 0, 0, 4, 0, 0, 0, 0, 59, 0, 3, 0, 48, 0, 17, 0, 0, 0, 0, 0, 0, 0, 0, 0, 110, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 42, 0, 0, 0, 17, 0, 0, 29, 0, 0, 0, 62, 0, 0, 0, 0, 0, 127, 0, 0, 58, 0, 8, 0, 0, 0, 42, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

After checking with tokens.txt, I found the transcript. Just wondering what time slot is between two elements in this list to calculate the corresponding time with the word in the transcript. Many thanks for your help in advance!

@csukuangfj
Copy link
Collaborator

However, when trying to read the data from aux_labels_test-clean.h5 for the test audio in #39, e.g., librispeech/LibriSpeech/test-clean/8224/274384/8224-274384-0008.flac, i just something like this

[4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 286, 0, 0, 0, 0, 0, 298, 0, 0, 0, 0, 276, 0, 12, 0, 0, 5, 0, 0, 28, 12, 0, 27, 0, 0, 0, 0, 209, 0, 0, 0, 0, 0, 0, 15, 0, 0, 0, 0, 0, 59, 0, 0, 0, 0, 210, 0, 0, 0, 0, 10, 0, 0, 0, 0, 0, 134, 0,

These numbers are the corresponding token IDs for the output frames. To get the results of #39, you have to do a few extra things.

(1) Note the subsampling factor of the model is 4, so output frames 0, 1, 2 correspond to input frames 0, 4, 8.
You have to use interpolation to get the alignments for input frames 1, 2, 3, 5, 6, 7, etc.

(2) The default frame shift is 10ms and you can convert the output frame index to time in seconds by multiplying 0.01

(3) You have to use tokens.txt to map those integer token IDs to the corresponding symbols.

Just wondering what time slot is between two elements in this list to calculate the corresponding time with the word in the transcript

The time slot between two consecutive output frames is 0.04 s.
As we are using wordpieces and wordpieces of a word start with an underscore _, you can use this information to find the starting frame of a word. Unfortunately, it is not easy to find the ending frame of a word.

@TianyuCao
Copy link
Author

Thank you very much for your detailed explanations. I have obtained almost the same results as #39 except the fact that for the first wordpiece ▁THE whose token ID is 4, if we use 0*0.04=0, then we will get the first word The starts immediately in this audio, which does not match the alignment information from https://github.com/CorentinJ/librispeech-alignments (0.500s) or the result shows in #39 (0.48s).

['▁THE', '▁GOOD', '▁NA', 'TURE', 'TURE', 'D', 'D', 'D', '▁A', '▁A', 'U', 'D', 'D', 'I', 'I', 'ENCE', '▁IN', '▁P', 'ITY', '▁TO', '▁FA', 'LL', 'LL', 'EN', '▁MA', 'J', 'J', 'EST', 'Y', 'Y', '▁SH', 'OW', 'ED', 'ED', '▁FOR', '▁ON', 'CE', 'CE', '▁GREAT', 'ER', '▁DE', 'F', 'F', 'ER', 'ER', 'ENCE', '▁TO', '▁THE', '▁K', 'ING', 'ING', '▁THAN', '▁TO', '▁THE', '▁MI', 'N', 'IST', 'ER', '▁AND', '▁SU', 'NG', 'NG', '▁THE', '▁P', '▁P', 'S', 'S', 'AL', 'AL', 'M', 'M', '▁WHICH', '▁THE', '▁FOR', 'M', 'ER', '▁HAD', '▁CA', 'LL', 'LL', 'ED', 'ED', '▁FOR']
[0.0, 0.64, 0.88, 1.08, 1.12, 1.16, 1.2, 1.24, 1.28, 1.32, 1.4000000000000001, 1.44, 1.48, 1.52, 1.56, 1.72, 2.0, 2.24, 2.44, 2.64, 2.88, 3.0, 3.04, 3.12, 3.3200000000000003, 3.44, 3.48, 3.6, 3.72, 3.7600000000000002, 4.68, 4.76, 4.84, 4.88, 5.04, 5.24, 5.44, 5.48, 5.76, 6.0, 6.16, 6.28, 6.32, 6.36, 6.4, 6.5200000000000005, 6.72, 6.88, 7.04, 7.16, 7.2, 7.88, 8.120000000000001, 8.28, 8.44, 8.52, 8.64, 8.76, 9.64, 9.92, 10.08, 10.120000000000001, 10.28, 10.48, 10.52, 10.56, 10.6, 10.64, 10.68, 10.72, 10.76, 11.120000000000001, 11.4, 11.56, 11.72, 11.84, 12.0, 12.24, 12.36, 12.4, 12.44, 12.48, 12.6]

Could I have any chance to know how you determine the starting frame for the first word in the general case?

@danpovey
Copy link
Collaborator

The alignment is never going to be exact in any end-to-end setup, especially one like transformers that consumes unlimited left/right context.

@Jianjie-Shi
Copy link

Hi guys,

I also met the same problem when doing alignment by myself. Coudl I ask why the model #17 using in #39 can determine the starting frame for the first word the accurately, e.g., 0.48s compared with the ground truth 0.5s, while the pretrained model from: https://huggingface.co/csukuangfj/icefall-asr-librispeech-conformer-ctc-jit-bpe-500-2021-11-09 got a worse result, e.g., 0s compared with 0.5s?

It seems that both models in #17 and pretrained model rom: https://huggingface.co/csukuangfj/icefall-asr-librispeech-conformer-ctc-jit-bpe-500-2021-11-09 are ctc model. What possible modifications between these two models lead to this result?

@danpovey
Copy link
Collaborator

danpovey commented Jan 28, 2022 via email

@csukuangfj
Copy link
Collaborator

Could I ask why the model #17 using in #39 can determine the starting frame for the first word the accurately

Could you try the pre-trained model from the following repo? https://github.com/csukuangfj/icefall-asr-conformer-ctc-bpe-500

That model has a higher WER on test-clean than the one from https://huggingface.co/csukuangfj/icefall-asr-librispeech-conformer-ctc-jit-bpe-500-2021-11-09, i.e., 2.56 vs 2.42

They basically have the same model configuration, i.e., you can load the pre-trained model with the same code without modifications. I just tried it on the first utterance of test-clean using the master branch and the following shows the debug output:

(Pdb) p supervisions
{'text': ["NO I'VE MADE UP MY MIND ABOUT IT IF I'M MABEL I'LL STAY DOWN HERE"], 'sequence_idx': tensor([0], dtype=torch.int32), 'start_frame': tensor([0], dtype=torch.int32), 'num_frames': tensor([487], dtype=torch.int32), 'cut': [MonoCut(id='260-123440-0011-1193-0',
start=0, duration=4.87, channel=0, supervisions=[SupervisionSegment(id='260-123440-0011', recording_id='260-123440-0011', start=0.0, duration=4.87, channel=0, text="NO I'VE MADE UP MY MIND ABOUT IT IF I'M MABEL I'LL STAY DOWN HERE", language='English', speaker='260',
gender=None, custom=None, alignment=None)], features=Features(type='fbank', num_frames=487, num_features=80, frame_shift=0.01, sampling_rate=16000, start=0, duration=4.87, storage_type='lilcom_hdf5', storage_path='data/fbank/feats_test-clean/feats-5.h5', storage_key='575aacae-38c5-45ec-9db9-0e3085e490be', recording_id=None, channels=0), recording=Recording(id='260-123440-0011', sources=[AudioSource(type='file', channels=[0], source='data/LibriSpeech/test-clean/260/123440/260-123440-0011.flac')], sampling_rate=16000, num_samples=77920, duration=4.87, transforms=None), custom=None)]}
(Pdb) p labels_ali
[[0, 0, 0, 0, 0, 0, 0, 94, 0, 0, 0, 0, 0, 0, 19, 45, 45, 75, 0, 300, 0, 0, 0, 0, 176, 0, 0, 105, 0, 0, 0, 139, 0, 0, 68, 0, 0, 0, 0, 250, 0, 0, 0, 0, 0, 30, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 165, 0, 0, 0, 0, 19, 0, 45, 45, 17, 0, 0, 161, 0, 0, 41, 41, 131, 131, 0, 0, 0, 0, 0, 19, 0, 0, 45, 58, 58, 58, 0, 0, 0, 0, 277, 0, 0, 0, 16, 16, 0, 0, 294, 0, 0, 0, 0, 0, 0, 22, 0, 0, 26, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]

You can see the first token does not start at the very beginning.
And you can compare the timestamps from https://github.com/CorentinJ/librispeech-alignments
I list them below for easier reference.

260-123440-0011 ",NO,I'VE,MADE,UP,MY,MIND,ABOUT,IT,,IF,,I'M,MABEL,,I'LL,STAY,DOWN,HERE," "0.220,0.600,0.760,0.950,1.060,1.190,1.480,1.840,2.000,2.220,2.440,2.470,2.720,3.200,3.230,3.500,3.950,4.220,4.660,4.87"


Dan's comment may explain why those two models produce different alignments.

Too-powerful models can give poor alignments as they transform the data too
much.
Often the best alignments are from GMM systems.

@TianyuCao
Copy link
Author

TianyuCao commented Feb 9, 2022

Sorry to bother you again. I just wonder whether the pretrained model can be used to Extract framewise alignment information for our own datasets now? I can see in ali.py, only datasets in LibriSpeech can be used to compute alignments. If I need to compute alignments for my own datasets, what steps should I do, e.g., generate fbank and manifests for my datasets?

parser.add_argument(
"--dataset",
type=str,
required=True,
help="""The name of the dataset to compute alignments for.
Possible values are:
- test-clean.
- test-other
- train-clean-100
- train-clean-360
- train-other-500
- dev-clean
- dev-other
""",
)

@csukuangfj
Copy link
Collaborator

I just wonder whether the pretrained model can be used to Extract framewise alignment information for our own datasets now?

You can try that and look at the resulting alignments. You will probably need to train your own model.


If I need to compute alignments for my own datasets, what steps should I do, e.g., generate fbank and manifests for my datasets?

Possible steps are:
(1) Prepare your data. Please see https://lhotse.readthedocs.io/en/latest/corpus.html#adding-new-corpora for more information. You can find various recipes for different datasets in https://github.com/lhotse-speech/lhotse/tree/master/lhotse/recipes

(2) Follow https://github.com/k2-fsa/icefall/blob/master/egs/librispeech/ASR/prepare.sh to extract features for your dataset

(3) Adapt https://github.com/k2-fsa/icefall/blob/master/egs/librispeech/ASR/tdnn_lstm_ctc/asr_datamodule.py to your dataset

(4) Train a model for your dataset. Please see https://github.com/k2-fsa/icefall/blob/master/egs/librispeech/ASR/conformer_ctc/train.py

(5) Get alignments. Please see https://github.com/k2-fsa/icefall/blob/master/egs/librispeech/ASR/conformer_ctc/ali.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants