Add force alignment for stateless transducer. #239

csukuangfj · 2022-03-06T15:26:44Z

We assume that --max-sym-per-frame is 1, i.e., the model is trained with modified transducer.

The alignment results seem to be reasonable compared with the ones from https://github.com/CorentinJ/librispeech-alignments

#213
The following compares the word alignments for one utterance from dev-test with the one from https://github.com/CorentinJ/librispeech-alignments
(Note: The pre-trained model is from #213)

TODOs:

Add more documentation to the implementation
Save alignments to file (using lhtose's temporary array)

I am going to save only the framewise alignment for tokens. Since each word begins with the underscore _ (with unicode \xe2\x96\x81, it is straightforward to get the beginning of a word from the framewise alignment. As there is no silence token in the BPE model, the starting time of a word is not that accurate.

csukuangfj · 2022-03-07T08:11:41Z

egs/librispeech/ASR/transducer_stateless/README.md

+
+## How to get framewise token alignment
+
+Assume that you already have a trained model. If not, you can either


This readme shows the usage of this PR.

csukuangfj · 2022-03-07T08:30:36Z

The readme.md https://github.com/k2-fsa/icefall/blob/2aca0d536c1e530b78f29afde9fe3e7fc22f685e/egs/librispeech/ASR/transducer_stateless/README.md
shows the usages of this PR.

How to get framewise token alignment

Assume that you already have a trained model. If not, you can either
train one by yourself or download a pre-trained model from hugging face:
https://huggingface.co/csukuangfj/icefall-asr-librispeech-transducer-stateless-multi-datasets-bpe-500-2022-03-01

Caution: If you are going to use your own trained model, remember
to set --modified-transducer-prob to a nonzero value since the
force alignment code assumes that --max-sym-per-frame is 1.

The following shows how to get framewise token alignment using the above
pre-trained model.

git clone https://github.com/k2-fsa/icefall
cd icefall/egs/librispeech/ASR
mkdir tmp
sudo apt-get install git-lfs
git lfs install
git clone https://huggingface.co/csukuangfj/icefall-asr-librispeech-transducer-stateless-multi-datasets-bpe-500-2022-03-01 ./tmp/

ln -s $PWD/tmp/exp/pretrained.pt $PWD/tmp/epoch-999.pt

./transducer_stateless/compute_ali.py \
        --exp-dir ./tmp/exp \
        --bpe-model ./tmp/data/lang_bpe_500/bpe.model \
        --epoch 999 \
        --avg 1 \
        --max-duration 100 \
        --dataset dev-clean \
        --out-dir data/ali

After running the above commands, you will find the following two files
in the folder ./data/ali:

-rw-r--r-- 1 xxx xxx 412K Mar  7 15:45 cuts_dev-clean.json.gz
-rw-r--r-- 1 xxx xxx 2.9M Mar  7 15:45 token_ali_dev-clean.h5

You can find usage examples in ./test_compute_ali.py about
extracting framewise token alignment information from the above
two files.

How to get word starting time from framewise token alignment

Assume you have run the above commands to get framewise token alignment
using a pre-trained model from tmp/exp/epoch-999.pt. You can use the following
commands to obtain word starting time.

./transducer_stateless/test_compute_ali.py \
        --bpe-model ./tmp/data/lang_bpe_500/bpe.model \
        --ali-dir data/ali \
        --dataset dev-clean

Caution: Since the frame shift is 10ms and the subsampling factor
of the model is 4, the time resolution is 0.04 second.

Note: The script test_compute_ali.py is for illustration only
and it processes only one batch and then exits.

You will get the following output:

5694-64029-0022-1998-0
[('THE', '0.20'), ('LEADEN', '0.36'), ('HAIL', '0.72'), ('STORM', '1.00'), ('SWEPT', '1.48'), ('THEM', '1.88'), ('OFF', '2.00'), ('THE', '2.24'), ('FIELD', '2.36'), ('THEY', '3.20'), ('FELL', '3.36'), ('BACK', '3.64'), ('AND', '3.92'), ('RE', '4.04'), ('FORMED', '4.20')]

3081-166546-0040-308-0
[('IN', '0.32'), ('OLDEN', '0.60'), ('DAYS', '1.00'), ('THEY', '1.40'), ('WOULD', '1.56'), ('HAVE', '1.76'), ('SAID', '1.92'), ('STRUCK', '2.60'), ('BY', '3.16'), ('A', '3.36'), ('BOLT', '3.44'), ('FROM', '3.84'), ('HEAVEN', '4.04')]

2035-147960-0016-1283-0
[('A', '0.44'), ('SNAKE', '0.52'), ('OF', '0.84'), ('HIS', '0.96'), ('SIZE', '1.12'), ('IN', '1.60'), ('FIGHTING', '1.72'), ('TRIM', '2.12'), ('WOULD', '2.56'), ('BE', '2.76'), ('MORE', '2.88'), ('THAN', '3.08'), ('ANY', '3.28'), ('BOY', '3.56'), ('COULD', '3.88'), ('HANDLE', '4.04')]

2428-83699-0020-1734-0
[('WHEN', '0.28'), ('THE', '0.48'), ('TRAP', '0.60'), ('DID', '0.88'), ('APPEAR', '1.08'), ('IT', '1.80'), ('LOOKED', '1.96'), ('TO',
'2.24'), ('ME', '2.36'), ('UNCOMMONLY', '2.52'), ('LIKE', '3.16'), ('AN', '3.40'), ('OPEN', '3.56'), ('SPRING', '3.92'), ('CART', '4.28')]

8297-275154-0026-2108-0
[('LET', '0.44'), ('ME', '0.72'), ('REST', '0.92'), ('A', '1.32'), ('LITTLE', '1.40'), ('HE', '1.80'), ('PLEADED', '2.00'), ('IF', '3.04'), ("I'M", '3.28'), ('NOT', '3.52'), ('IN', '3.76'), ('THE', '3.88'), ('WAY', '4.00')]

652-129742-0007-1002-0
[('SURROUND', '0.28'), ('WITH', '0.80'), ('A', '0.92'), ('GARNISH', '1.00'), ('OF', '1.44'), ('COOKED', '1.56'), ('AND', '1.88'), ('DICED', '4.16'), ('CARROTS', '4.28'), ('TURNIPS', '4.44'), ('GREEN', '4.60'), ('PEAS', '4.72')]

For the row:

5694-64029-0022-1998-0
[('THE', '0.20'), ('LEADEN', '0.36'), ('HAIL', '0.72'), ('STORM', '1.00'), ('SWEPT', '1.48'),
('THEM', '1.88'), ('OFF', '2.00'), ('THE', '2.24'), ('FIELD', '2.36'), ('THEY', '3.20'), ('FELL', '3.36'),
('BACK', '3.64'), ('AND', '3.92'), ('RE', '4.04'), ('FORMED', '4.20')]

5694-64029-0022-1998-0 is the cut ID.
('THE', '0.20') means the word THE starts at 0.20 second.
('LEADEN', '0.36') means the word LEADEN starts at 0.36 second.

You can compare the above word starting time with the one
from https://github.com/CorentinJ/librispeech-alignments

5694-64029-0022 ",THE,LEADEN,HAIL,STORM,SWEPT,THEM,OFF,THE,FIELD,,THEY,FELL,BACK,AND,RE,FORMED," "0.230,0.360,0.670,1.010,1.440,1.860,1.990,2.230,2.350,2.870,3.230,3.390,3.660,3.960,4.060,4.160,4.850,4.9"

We reformat it below for readability:

5694-64029-0022 ",THE,LEADEN,HAIL,STORM,SWEPT,THEM,OFF,THE,FIELD,,THEY,FELL,BACK,AND,RE,FORMED,"
"0.230,0.360,0.670,1.010,1.440,1.860,1.990,2.230,2.350,2.870,3.230,3.390,3.660,3.960,4.060,4.160,4.850,4.9"
  the  leaden hail storm swept them  off   the   field  sil   they  fell  back  and   re   formed  sil

csukuangfj · 2022-03-07T08:32:03Z

From #188
@TianyuCao @Jianjie-Shi
You may find this PR useful.

csukuangfj · 2022-03-07T08:37:46Z

The framewise token alignment is saved into a 1-D array using lhotse's TemporaryArray and it should be straightforward to convert it to other formats, e.g., CTM or text grid.

This PR is ready for review, I think.

danpovey

Some small comments

danpovey · 2022-03-07T09:09:13Z

egs/librispeech/ASR/transducer_stateless/alignment.py

+# in the lattice.
+@dataclass
+class AlignItem:
+    # log prob of this item originating from the start item


when you say "start item", do you mean the preceding item (i.e. is it the log-prob p(this item | preceding item), or is it some kind of total log-prob from the start node of the lattice to here?

I mean the total log prob from the starting node.

AlignItem can be considered as the ending point of a path originated from the starting node and the log_probs is the tot_log_prob of this path.

danpovey · 2022-03-07T09:10:12Z

egs/librispeech/ASR/transducer_stateless/alignment.py

+
+# AlignItem is a node in the lattice, where its
+# len(ys) equals to `t` and pos_u is the u coordinate
+# in the lattice.


people may not know what the u co-ordinate refers to here (actually, I don't). Is it the word position?

In transducer training, the output shape is usually denoted as (N, T, U, V), instead of (N, S, T, V) used in k2, I think.
The notation t and u is from https://arxiv.org/pdf/1211.3711.pdf?ref=hackernoon.com

Since our modelling units are BPE tokens, so u actually denotes token position.

OK, please clarify all this in the docs.

danpovey · 2022-03-07T09:14:46Z

egs/librispeech/ASR/transducer_stateless/alignment.py

+    return ans
+
+
+def get_word_starting_frame(


rename to get_word_starting_frames

csukuangfj · 2022-03-07T14:05:02Z

NOTE: This PR assumes that --max-sym-per-frame is 1. Will create another PR for --max-sym-per-frame > 1.

pzelasko · 2022-09-28T17:43:52Z

I am going to save only the framewise alignment for tokens. Since each word begins with the underscore _ (with unicode \xe2\x96\x81, it is straightforward to get the beginning of a word from the framewise alignment. As there is no silence token in the BPE model, the starting time of a word is not that accurate.

Do you think it would help if (optional) silences are re-introduced? Or sth like they did here with Wav2Vec2, where there is an extra | token between all the words, which I think helps to get more precise timestamps (https://pytorch.org/audio/stable/tutorials/forced_alignment_tutorial.html)?

danpovey · 2022-09-29T16:17:55Z

For many purposes we don't need accurate alignments, just having some alignment is usually enough.
I don't think silences will mesh well with things like RNN-T.. the main issue is that they are not in the supervision, and anyway the algorithms often assume a linear sequence, introducing options is not really feasible.

csukuangfj added 4 commits March 6, 2022 23:14

Add force alignment for stateless transducer.

6bcfa62

Add more documentation.

d50e773

Compute word starting time from framewise token alignment.

75936a5

Update README to include force alignment information.

5df6040

csukuangfj changed the title ~~WIP: Add force alignment for stateless transducer.~~ Add force alignment for stateless transducer. Mar 7, 2022

csukuangfj commented Mar 7, 2022

View reviewed changes

csukuangfj added 2 commits March 7, 2022 16:19

Fix typos.

fb63ed6

Fix more typos.

2aca0d5

danpovey reviewed Mar 7, 2022

View reviewed changes

Fixes after review.

a7ecf96

csukuangfj added the ready label Mar 10, 2022

csukuangfj merged commit 2f4e71f into k2-fsa:master Mar 12, 2022

csukuangfj deleted the rnnt-alignment branch March 12, 2022 08:16

csukuangfj mentioned this pull request Jul 31, 2022

WIP: Add timestamp k2-fsa/sherpa#52

Closed

csukuangfj mentioned this pull request Sep 28, 2022

Experimental Lhotse feature: corpus creation tools (workflows), starting with OpenAI Whisper support lhotse-speech/lhotse#824

Merged

yaozengwei mentioned this pull request Mar 27, 2023

Support batch-wise forced-alignment #970

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add force alignment for stateless transducer. #239

Add force alignment for stateless transducer. #239

csukuangfj commented Mar 6, 2022 •

edited

Loading

csukuangfj Mar 7, 2022

csukuangfj commented Mar 7, 2022

csukuangfj commented Mar 7, 2022

csukuangfj commented Mar 7, 2022

danpovey left a comment

danpovey Mar 7, 2022

csukuangfj Mar 7, 2022

danpovey Mar 7, 2022

csukuangfj Mar 7, 2022

danpovey Mar 7, 2022

danpovey Mar 7, 2022

csukuangfj commented Mar 7, 2022

pzelasko commented Sep 28, 2022

danpovey commented Sep 29, 2022


		## How to get framewise token alignment

		Assume that you already have a trained model. If not, you can either

Add force alignment for stateless transducer. #239

Add force alignment for stateless transducer. #239

Conversation

csukuangfj commented Mar 6, 2022 • edited Loading

csukuangfj Mar 7, 2022

Choose a reason for hiding this comment

csukuangfj commented Mar 7, 2022

How to get framewise token alignment

How to get word starting time from framewise token alignment

csukuangfj commented Mar 7, 2022

csukuangfj commented Mar 7, 2022

danpovey left a comment

Choose a reason for hiding this comment

danpovey Mar 7, 2022

Choose a reason for hiding this comment

csukuangfj Mar 7, 2022

Choose a reason for hiding this comment

danpovey Mar 7, 2022

Choose a reason for hiding this comment

csukuangfj Mar 7, 2022

Choose a reason for hiding this comment

danpovey Mar 7, 2022

Choose a reason for hiding this comment

danpovey Mar 7, 2022

Choose a reason for hiding this comment

csukuangfj commented Mar 7, 2022

pzelasko commented Sep 28, 2022

danpovey commented Sep 29, 2022

csukuangfj commented Mar 6, 2022 •

edited

Loading