Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support batch-wise forced-alignment #970

Merged
merged 6 commits into from
Mar 28, 2023

Conversation

yaozengwei
Copy link
Collaborator

This PR aims to support getting forced-alignments for transducer models. It is modified from the function modified_beam_search in beam_search.py. We assume that the maximum number of sybmols per frame is 1. The best alignment in the k-beams is used as the result. We will get both word-level alignments and token-level alignments.

It is based on #239, but it supports batch-wise computation.

@yaozengwei
Copy link
Collaborator Author

yaozengwei commented Mar 28, 2023

After running the script compute_ali.py, the token-level alignments and word-level alignments will be saved to a new cuts. We can use them with cut.supervisions[0].alignment["token"] and cut.supervisions[0].alignment["word"], respectively.

For example:

{"id": "2300-131720-0022-1106-0", "start": 0, "duration": 13.665, "channel": 0, "supervisions": [{"id": "2300-131720-0022", "recording_id": "2300-131720-0022", "start": 0.0, "duration": 13.665, "channel": 0, "text": "MEANWHILE HE HAD CALLED UPON ME TO MAKE A REPORT OF THE THREE WIRE SYSTEM KNOWN IN ENGLAND AS THE HOPKINSON BOTH DOCTOR JOHN HOPKINSON AND MISTER EDISON BEING INDEPENDENT INVENTORS AT PRACTICALLY THE SAME TIME", "language": "English", "speaker": "2300", "alignment": {"word": [["MEANWHILE", 0.12, null, null], ["HE", 1.16, null, null], ["HAD", 1.48, null, null], ["CALLED", 1.64, null, null], ["UPON", 2.04, null, null], ["ME", 2.32, null, null], ["TO", 2.48, null, null], ["MAKE", 2.68, null, null], ["A", 2.84, null, null], ["REPORT", 2.96, null, null], ["OF", 3.4, null, null], ["THE", 3.52, null, null], ["THREE", 3.64, null, null], ["WIRE", 3.88, null, null], ["SYSTEM", 4.2, null, null], ["KNOWN", 4.68, null, null], ["IN", 4.96, null, null], ["ENGLAND", 5.12, null, null], ["AS", 5.6, null, null], ["THE", 5.84, null, null], ["HOPKINSON", 6.0, null, null], ["BOTH", 6.96, null, null], ["DOCTOR", 7.36, null, null], ["JOHN", 7.76, null, null], ["HOPKINSON", 8.08, null, null], ["AND", 9.0, null, null], ["MISTER", 9.28, null, null], ["EDISON", 9.6, null, null], ["BEING", 10.2, null, null], ["INDEPENDENT", 10.44, null, null], ["INVENTORS", 11.08, null, null], ["AT", 11.68, null, null], ["PRACTICALLY", 11.84, null, null], ["THE", 12.4, null, null], ["SAME", 13.52, null, null], ["TIME", 13.56, null, null]], "token": [["_MEAN", 0.12, null, null], ["W", 0.36, null, null], ["HI", 0.4, null, null], ["LE", 0.44, null, null], ["_HE", 1.16, null, null], ["_HAD", 1.48, null, null], ["_CA", 1.64, null, null], ["LL", 1.8, null, null], ["ED", 1.92, null, null], ["_UPON", 2.04, null, null], ["_ME", 2.32, null, null], ["_TO", 2.48, null, null], ["_MAKE", 2.68, null, null], ["_A", 2.84, null, null], ["_RE", 2.96, null, null], ["PORT", 3.12, null, null], ["_OF", 3.4, null, null], ["_THE", 3.52, null, null], ["_THREE", 3.64, null, null], ["_WI", 3.88, null, null], ["RE", 4.04, null, null], ["_S", 4.2, null, null], ["Y", 4.24, null, null], ["S", 4.36, null, null], ["TE", 4.44, null, null], ["M", 4.52, null, null], ["_KNOW", 4.68, null, null], ["N", 4.84, null, null], ["_IN", 4.96, null, null], ["_E", 5.12, null, null], ["NG", 5.2, null, null], ["LA", 5.32, null, null], ["ND", 5.44, null, null], ["_AS", 5.6, null, null], ["_THE", 5.84, null, null], ["_HO", 6.0, null, null], ["P", 6.16, null, null], ["K", 6.28, null, null], ["IN", 6.36, null, null], ["S", 6.48, null, null], ["ON", 6.64, null, null], ["_BO", 6.96, null, null], ["TH", 7.2, null, null], ["_DO", 7.36, null, null], ["C", 7.52, null, null], ["T", 7.6, null, null], ["OR", 7.64, null, null], ["_JO", 7.76, null, null], ["H", 7.92, null, null], ["N", 7.96, null, null], ["_HO", 8.08, null, null], ["P", 8.24, null, null], ["K", 8.32, null, null], ["IN", 8.4, null, null], ["S", 8.52, null, null], ["ON", 8.68, null, null], ["_AND", 9.0, null, null], ["_MISTER", 9.28, null, null], ["_", 9.6, null, null], ["ED", 9.64, null, null], ["IS", 9.8, null, null], ["ON", 10.0, null, null], ["_BE", 10.2, null, null], ["ING", 10.32, null, null], ["_IN", 10.44, null, null], ["DE", 10.6, null, null], ["PE", 10.72, null, null], ["ND", 10.84, null, null], ["ENT", 10.88, null, null], ["_IN", 11.08, null, null], ["V", 11.2, null, null], ["ENT", 11.28, null, null], ["OR", 11.44, null, null], ["S", 11.56, null, null], ["_AT", 11.68, null, null], ["_P", 11.84, null, null], ["RA", 11.88, null, null], ["C", 11.96, null, null], ["T", 12.08, null, null], ["IC", 12.12, null, null], ["AL", 12.24, null, null], ["LY", 12.28, null, null], ["_THE", 12.4, null, null], ["_SAME", 13.52, null, null], ["_TIME", 13.56, null, null]]}}], "features": {"type": "kaldi-fbank", "num_frames": 1367, "num_features": 80, "frame_shift": 0.01, "sampling_rate": 16000, "start": 0, "duration": 13.665, "storage_type": "lilcom_chunky", "storage_path": "data/fbank/librispeech_feats_test-clean/feats-0.lca", "storage_key": "3011415,46294,45937,34209", "channels": 0}, "recording": {"id": "2300-131720-0022", "sources": [{"type": "file", "channels": [0], "source": "/root/fangjun/open-source-2/icefall-jsonl/egs/librispeech/ASR/download/LibriSpeech/test-clean/2300/131720/2300-131720-0022.flac"}], "sampling_rate": 16000, "num_samples": 218640, "duration": 13.665, "channel_ids": [0]}, "type": "MonoCut"}

In test_compute_ali.py, we compare the computed word-level alignments to the reference alignments which is generated using torchaudio (see add_alignments.sh and https://pytorch.org/audio/stable/tutorials/forced_alignment_tutorial.html). It summarizes the mean and std over all absolute difference. Taking the subset test-clean as an example, it prints:

  • For the word-level alignments abs difference on dataset test-clean, mean: 0.13s, std: 0.13s

@yaozengwei yaozengwei merged commit bcc5923 into k2-fsa:master Mar 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant