Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add force alignment for stateless transducer. #239

Merged
merged 7 commits into from
Mar 12, 2022

Conversation

csukuangfj
Copy link
Collaborator

@csukuangfj csukuangfj commented Mar 6, 2022

We assume that --max-sym-per-frame is 1, i.e., the model is trained with modified transducer.

The alignment results seem to be reasonable compared with the ones from https://github.com/CorentinJ/librispeech-alignments

#213
The following compares the word alignments for one utterance from dev-test with the one from https://github.com/CorentinJ/librispeech-alignments
(Note: The pre-trained model is from #213)

Screen Shot 2022-03-06 at 23 19 21

TODOs:

  • Add more documentation to the implementation
  • Save alignments to file (using lhtose's temporary array)

I am going to save only the framewise alignment for tokens. Since each word begins with the underscore _ (with unicode \xe2\x96\x81, it is straightforward to get the beginning of a word from the framewise alignment. As there is no silence token in the BPE model, the starting time of a word is not that accurate.

@csukuangfj csukuangfj changed the title WIP: Add force alignment for stateless transducer. Add force alignment for stateless transducer. Mar 7, 2022

## How to get framewise token alignment

Assume that you already have a trained model. If not, you can either
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This readme shows the usage of this PR.

@csukuangfj
Copy link
Collaborator Author

The readme.md https://github.com/k2-fsa/icefall/blob/2aca0d536c1e530b78f29afde9fe3e7fc22f685e/egs/librispeech/ASR/transducer_stateless/README.md
shows the usages of this PR.

How to get framewise token alignment

Assume that you already have a trained model. If not, you can either
train one by yourself or download a pre-trained model from hugging face:
https://huggingface.co/csukuangfj/icefall-asr-librispeech-transducer-stateless-multi-datasets-bpe-500-2022-03-01

Caution: If you are going to use your own trained model, remember
to set --modified-transducer-prob to a nonzero value since the
force alignment code assumes that --max-sym-per-frame is 1.

The following shows how to get framewise token alignment using the above
pre-trained model.

git clone https://github.com/k2-fsa/icefall
cd icefall/egs/librispeech/ASR
mkdir tmp
sudo apt-get install git-lfs
git lfs install
git clone https://huggingface.co/csukuangfj/icefall-asr-librispeech-transducer-stateless-multi-datasets-bpe-500-2022-03-01 ./tmp/

ln -s $PWD/tmp/exp/pretrained.pt $PWD/tmp/epoch-999.pt

./transducer_stateless/compute_ali.py \
        --exp-dir ./tmp/exp \
        --bpe-model ./tmp/data/lang_bpe_500/bpe.model \
        --epoch 999 \
        --avg 1 \
        --max-duration 100 \
        --dataset dev-clean \
        --out-dir data/ali

After running the above commands, you will find the following two files
in the folder ./data/ali:

-rw-r--r-- 1 xxx xxx 412K Mar  7 15:45 cuts_dev-clean.json.gz
-rw-r--r-- 1 xxx xxx 2.9M Mar  7 15:45 token_ali_dev-clean.h5

You can find usage examples in ./test_compute_ali.py about
extracting framewise token alignment information from the above
two files.

How to get word starting time from framewise token alignment

Assume you have run the above commands to get framewise token alignment
using a pre-trained model from tmp/exp/epoch-999.pt. You can use the following
commands to obtain word starting time.

./transducer_stateless/test_compute_ali.py \
        --bpe-model ./tmp/data/lang_bpe_500/bpe.model \
        --ali-dir data/ali \
        --dataset dev-clean

Caution: Since the frame shift is 10ms and the subsampling factor
of the model is 4, the time resolution is 0.04 second.

Note: The script test_compute_ali.py is for illustration only
and it processes only one batch and then exits.

You will get the following output:

5694-64029-0022-1998-0
[('THE', '0.20'), ('LEADEN', '0.36'), ('HAIL', '0.72'), ('STORM', '1.00'), ('SWEPT', '1.48'), ('THEM', '1.88'), ('OFF', '2.00'), ('THE', '2.24'), ('FIELD', '2.36'), ('THEY', '3.20'), ('FELL', '3.36'), ('BACK', '3.64'), ('AND', '3.92'), ('RE', '4.04'), ('FORMED', '4.20')]

3081-166546-0040-308-0
[('IN', '0.32'), ('OLDEN', '0.60'), ('DAYS', '1.00'), ('THEY', '1.40'), ('WOULD', '1.56'), ('HAVE', '1.76'), ('SAID', '1.92'), ('STRUCK', '2.60'), ('BY', '3.16'), ('A', '3.36'), ('BOLT', '3.44'), ('FROM', '3.84'), ('HEAVEN', '4.04')]

2035-147960-0016-1283-0
[('A', '0.44'), ('SNAKE', '0.52'), ('OF', '0.84'), ('HIS', '0.96'), ('SIZE', '1.12'), ('IN', '1.60'), ('FIGHTING', '1.72'), ('TRIM', '2.12'), ('WOULD', '2.56'), ('BE', '2.76'), ('MORE', '2.88'), ('THAN', '3.08'), ('ANY', '3.28'), ('BOY', '3.56'), ('COULD', '3.88'), ('HANDLE', '4.04')]

2428-83699-0020-1734-0
[('WHEN', '0.28'), ('THE', '0.48'), ('TRAP', '0.60'), ('DID', '0.88'), ('APPEAR', '1.08'), ('IT', '1.80'), ('LOOKED', '1.96'), ('TO',
'2.24'), ('ME', '2.36'), ('UNCOMMONLY', '2.52'), ('LIKE', '3.16'), ('AN', '3.40'), ('OPEN', '3.56'), ('SPRING', '3.92'), ('CART', '4.28')]

8297-275154-0026-2108-0
[('LET', '0.44'), ('ME', '0.72'), ('REST', '0.92'), ('A', '1.32'), ('LITTLE', '1.40'), ('HE', '1.80'), ('PLEADED', '2.00'), ('IF', '3.04'), ("I'M", '3.28'), ('NOT', '3.52'), ('IN', '3.76'), ('THE', '3.88'), ('WAY', '4.00')]

652-129742-0007-1002-0
[('SURROUND', '0.28'), ('WITH', '0.80'), ('A', '0.92'), ('GARNISH', '1.00'), ('OF', '1.44'), ('COOKED', '1.56'), ('AND', '1.88'), ('DICED', '4.16'), ('CARROTS', '4.28'), ('TURNIPS', '4.44'), ('GREEN', '4.60'), ('PEAS', '4.72')]

For the row:

5694-64029-0022-1998-0
[('THE', '0.20'), ('LEADEN', '0.36'), ('HAIL', '0.72'), ('STORM', '1.00'), ('SWEPT', '1.48'),
('THEM', '1.88'), ('OFF', '2.00'), ('THE', '2.24'), ('FIELD', '2.36'), ('THEY', '3.20'), ('FELL', '3.36'),
('BACK', '3.64'), ('AND', '3.92'), ('RE', '4.04'), ('FORMED', '4.20')]
  • 5694-64029-0022-1998-0 is the cut ID.
  • ('THE', '0.20') means the word THE starts at 0.20 second.
  • ('LEADEN', '0.36') means the word LEADEN starts at 0.36 second.

You can compare the above word starting time with the one
from https://github.com/CorentinJ/librispeech-alignments

5694-64029-0022 ",THE,LEADEN,HAIL,STORM,SWEPT,THEM,OFF,THE,FIELD,,THEY,FELL,BACK,AND,RE,FORMED," "0.230,0.360,0.670,1.010,1.440,1.860,1.990,2.230,2.350,2.870,3.230,3.390,3.660,3.960,4.060,4.160,4.850,4.9"

We reformat it below for readability:

5694-64029-0022 ",THE,LEADEN,HAIL,STORM,SWEPT,THEM,OFF,THE,FIELD,,THEY,FELL,BACK,AND,RE,FORMED,"
"0.230,0.360,0.670,1.010,1.440,1.860,1.990,2.230,2.350,2.870,3.230,3.390,3.660,3.960,4.060,4.160,4.850,4.9"
  the  leaden hail storm swept them  off   the   field  sil   they  fell  back  and   re   formed  sil

@csukuangfj
Copy link
Collaborator Author

From #188
@TianyuCao @Jianjie-Shi
You may find this PR useful.

@csukuangfj
Copy link
Collaborator Author

The framewise token alignment is saved into a 1-D array using lhotse's TemporaryArray and it should be straightforward to convert it to other formats, e.g., CTM or text grid.


This PR is ready for review, I think.

Copy link
Collaborator

@danpovey danpovey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some small comments

# in the lattice.
@dataclass
class AlignItem:
# log prob of this item originating from the start item
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when you say "start item", do you mean the preceding item (i.e. is it the log-prob p(this item | preceding item), or is it some kind of total log-prob from the start node of the lattice to here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean the total log prob from the starting node.

AlignItem can be considered as the ending point of a path originated from the starting node and the log_probs is the tot_log_prob of this path.


# AlignItem is a node in the lattice, where its
# len(ys) equals to `t` and pos_u is the u coordinate
# in the lattice.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

people may not know what the u co-ordinate refers to here (actually, I don't). Is it the word position?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In transducer training, the output shape is usually denoted as (N, T, U, V), instead of (N, S, T, V) used in k2, I think.
The notation t and u is from https://arxiv.org/pdf/1211.3711.pdf?ref=hackernoon.com
Screen Shot 2022-03-07 at 5 21 09 PM

Since our modelling units are BPE tokens, so u actually denotes token position.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, please clarify all this in the docs.

return ans


def get_word_starting_frame(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename to get_word_starting_frames

@csukuangfj
Copy link
Collaborator Author

NOTE: This PR assumes that --max-sym-per-frame is 1. Will create another PR for --max-sym-per-frame > 1.

@pzelasko
Copy link
Collaborator

I am going to save only the framewise alignment for tokens. Since each word begins with the underscore _ (with unicode \xe2\x96\x81, it is straightforward to get the beginning of a word from the framewise alignment. As there is no silence token in the BPE model, the starting time of a word is not that accurate.

Do you think it would help if (optional) silences are re-introduced? Or sth like they did here with Wav2Vec2, where there is an extra | token between all the words, which I think helps to get more precise timestamps (https://pytorch.org/audio/stable/tutorials/forced_alignment_tutorial.html)?

@danpovey
Copy link
Collaborator

For many purposes we don't need accurate alignments, just having some alignment is usually enough.
I don't think silences will mesh well with things like RNN-T.. the main issue is that they are not in the supervision, and anyway the algorithms often assume a linear sequence, introducing options is not really feasible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants