-
Notifications
You must be signed in to change notification settings - Fork 297
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add force alignment for stateless transducer. #239
Conversation
|
||
## How to get framewise token alignment | ||
|
||
Assume that you already have a trained model. If not, you can either |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This readme shows the usage of this PR.
The How to get framewise token alignmentAssume that you already have a trained model. If not, you can either Caution: If you are going to use your own trained model, remember The following shows how to get framewise token alignment using the above git clone https://github.com/k2-fsa/icefall
cd icefall/egs/librispeech/ASR
mkdir tmp
sudo apt-get install git-lfs
git lfs install
git clone https://huggingface.co/csukuangfj/icefall-asr-librispeech-transducer-stateless-multi-datasets-bpe-500-2022-03-01 ./tmp/
ln -s $PWD/tmp/exp/pretrained.pt $PWD/tmp/epoch-999.pt
./transducer_stateless/compute_ali.py \
--exp-dir ./tmp/exp \
--bpe-model ./tmp/data/lang_bpe_500/bpe.model \
--epoch 999 \
--avg 1 \
--max-duration 100 \
--dataset dev-clean \
--out-dir data/ali After running the above commands, you will find the following two files
You can find usage examples in How to get word starting time from framewise token alignmentAssume you have run the above commands to get framewise token alignment ./transducer_stateless/test_compute_ali.py \
--bpe-model ./tmp/data/lang_bpe_500/bpe.model \
--ali-dir data/ali \
--dataset dev-clean Caution: Since the frame shift is 10ms and the subsampling factor Note: The script You will get the following output:
For the row:
You can compare the above word starting time with the one
We reformat it below for readability:
|
From #188 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some small comments
# in the lattice. | ||
@dataclass | ||
class AlignItem: | ||
# log prob of this item originating from the start item |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when you say "start item", do you mean the preceding item (i.e. is it the log-prob p(this item | preceding item), or is it some kind of total log-prob from the start node of the lattice to here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean the total log prob from the starting node.
AlignItem
can be considered as the ending point of a path originated from the starting node and the log_probs
is the tot_log_prob of this path.
|
||
# AlignItem is a node in the lattice, where its | ||
# len(ys) equals to `t` and pos_u is the u coordinate | ||
# in the lattice. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
people may not know what the u co-ordinate refers to here (actually, I don't). Is it the word position?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In transducer training, the output shape is usually denoted as (N, T, U, V), instead of (N, S, T, V)
used in k2, I think.
The notation t
and u
is from https://arxiv.org/pdf/1211.3711.pdf?ref=hackernoon.com
Since our modelling units are BPE tokens, so u
actually denotes token position.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, please clarify all this in the docs.
return ans | ||
|
||
|
||
def get_word_starting_frame( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rename to get_word_starting_frames
NOTE: This PR assumes that --max-sym-per-frame is 1. Will create another PR for --max-sym-per-frame > 1. |
Do you think it would help if (optional) silences are re-introduced? Or sth like they did here with Wav2Vec2, where there is an extra |
For many purposes we don't need accurate alignments, just having some alignment is usually enough. |
We assume that
--max-sym-per-frame
is 1, i.e., the model is trained with modified transducer.The alignment results seem to be reasonable compared with the ones from https://github.com/CorentinJ/librispeech-alignments
#213
The following compares the word alignments for one utterance from
dev-test
with the one from https://github.com/CorentinJ/librispeech-alignments(Note: The pre-trained model is from #213)
TODOs:
I am going to save only the framewise alignment for tokens. Since each word begins with the underscore
_
(with unicode\xe2\x96\x81
, it is straightforward to get the beginning of a word from the framewise alignment. As there is no silence token in the BPE model, the starting time of a word is not that accurate.