Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP]: Implement token level shallow fusion #609

Closed
wants to merge 3 commits into from

Conversation

csukuangfj
Copy link
Collaborator

@csukuangfj csukuangfj commented Oct 10, 2022

We have been trying to use word-level G and LG for RNN-T decoding, but we have only tried this for fast_beam_search. However, using a word-level G or an LG cannot handle OOV words.

This PR tries to use a token-level G for shallow fusion with modified_beam_search. I am using OpenFst to manipulate the n-gram G on the CPU as it is easier to implement.

@ezerhouni
Copy link
Collaborator

@csukuangfj Look very promising. Ping me if you need an extra hand

@csukuangfj
Copy link
Collaborator Author

@csukuangfj Look very promising. Ping me if you need an extra hand

@ezerhouni

Thanks! I will draft a version without batch size support. If it gives promising results, we need your help to implement a version that supports batches.

@ezerhouni
Copy link
Collaborator

@csukuangfj Do you have any update on this issue ? I am very eager to try it out !

@csukuangfj
Copy link
Collaborator Author

@csukuangfj Do you have any update on this issue ? I am very eager to try it out !

Yes. But the results are not good so far. I will post them tonight.

@csukuangfj
Copy link
Collaborator Author

Steps for reproducing the following results:

cd egs/librispeech/ASR
git lfs install
git clone https://huggingface.co/csukuangfj/icefall-asr-librispeech-lstm-transducer-stateless2-2022-09-03
mkdir tmp3-3
cd tmp3-3
ln -s $PWD/../https://huggingface.co/csukuangfj/icefall-asr-librispeech-lstm-transducer-stateless2-2022-09-03/exp/pretrained-iter-468000-avg-16.pt epoch-99.pt
cd ..

./generate-lm.sh

for lm_scale in  0.01 0.2 0.4 ; do
./lstm_transducer_stateless2/decode.py \
  --epoch 99 \
  --avg 1 \
  --use-averaged-model 0 \
  --exp-dir ./tmp3-3 \
  --max-duration 600 \
  --num-encoder-layers 12 \
  --rnn-hidden-size 1024 \
  --decoding-method modified_beam_search2 \
  --beam 8 \
  --max-contexts 4 \
  --ngram-lm-scale $lm_scale
done

You will find the results inside ./tmp3-3/modified_beam_search2


ngram_lm_scale test-clean test-other
0 (baseline) 2.73 7.15
-0.01 2.73 7.17
0.01 2.74 7.15
-0.05 2.75 7.19
0.2 2.76 7.28
-0.1 2.77 7.23
-0.2 2.83 7.46
-0.3 3.01 7.75

I am using a tri-gram LM. Note the cost on the final state of the FST is not considered

I will recheck the code in case it contains some bugs.

@ezerhouni
Copy link
Collaborator

@csukuangfj Thanks !

@danpovey
Copy link
Collaborator

I expect that unless there is some kind of domain mismatch, we will not see much or any improvement. (Unless we try super-large LMs. I seem to remember Liyong had some experiment with a 5-gram or something like that?)

@csukuangfj
Copy link
Collaborator Author

I expect that unless there is some kind of domain mismatch, we will not see much or any improvement. (Unless we try super-large LMs. I seem to remember Liyong had some experiment with a 5-gram or something like that?)

I think Liyong was using fast_beam_search + (L, or LG) in #472

We have never tried to use a token-level G with modified beam search, I think.

@ezerhouni
Copy link
Collaborator

I expect that unless there is some kind of domain mismatch, we will not see much or any improvement. (Unless we try super-large LMs. I seem to remember Liyong had some experiment with a 5-gram or something like that?)

I think Liyong was using fast_beam_search + (L, or LG) in #472

We have never tried to use a token-level G with modified beam search, I think.

My 2cts is that we need a very large LM (like 5gram). I will try it tomorrow and let you know

@pkufool
Copy link
Collaborator

pkufool commented Oct 19, 2022

I expect that unless there is some kind of domain mismatch, we will not see much or any improvement. (Unless we try super-large LMs. I seem to remember Liyong had some experiment with a 5-gram or something like that?)

I think Liyong was using fast_beam_search + (L, or LG) in #472

We have never tried to use a token-level G with modified beam search, I think.

@glynpu Liyoug did try using a token-level G with beam search, he did not make a PR though, the results are in our weekly meeting notes (the 20th week), as the follows:

image

The results show that we can not get improvement from a pruned LM.

@glynpu
Copy link
Collaborator

glynpu commented Oct 19, 2022

@glynpu Liyoug did try using a token-level G with beam search, he did not make a PR though, the results are in our weekly meeting notes (the 20th week), as the follows:

The results came from a word level LM.
I was using kenlm at that time, here is the related code:
glynpu@3a9ff31

@ezerhouni
Copy link
Collaborator

@csukuangfj Quick update :
I am testing with a 5gram at the moment. I am getting
test-clean : 2.68
test-other: 7.11

I am still doing some tests and do a more thorough review of the code.

@ezerhouni
Copy link
Collaborator

ezerhouni commented Oct 19, 2022

Ngram : 5
Beam Size 4 :

ngram_lm_scale test-clean test-other
0 (baseline) 2.73 7.15
0.01 2.74 7.15
0.1 2.68 7.11
0.2 2.68 7.14

Ngram : 5
Beam Size 8 :

ngram_lm_scale test-clean test-other
0 (baseline) 2.72 7.15
0.01 2.71 7.14
0.1 2.71 7.11
0.2 2.68 7.06
0.3 2.74 7.28

@csukuangfj
Copy link
Collaborator Author

@ezerhouni

Thanks! Are you using ./generate-lm.sh to generate the 5-gram LM or are you using an LM trained on an external dataset?

@ezerhouni
Copy link
Collaborator

@ezerhouni

Thanks! Are you using ./generate-lm.sh to generate the 5-gram LM or are you using an LM trained on an external dataset?

I am using ./generate-lm.sh. I am trying a 7gram to have an idea if it helps or not.

@ezerhouni
Copy link
Collaborator

@csukuangfj
I tried a 7gram and it seems to improve a bit (2.67/7.03) but I am not sure it is worth it

@danpovey
Copy link
Collaborator

I think the main use-case of this is when there is a domain mismatch from the training corpus to the target domain.
We can also try dividing the scores on the LM arcs by the corresonding scores given a low-order LM estimated on the training data.

@csukuangfj
Copy link
Collaborator Author

@csukuangfj I tried a 7gram and it seems to improve a bit (2.67/7.03) but I am not sure it is worth it

Sorry for the late replay. I though I have replied last night.

I think 7gram is more than enough. Thanks for your experiments. The result shows that the code works with an n-gram LM, though
we don't gain much from it. The next step is to use it to decode with a graph constructed from lists of specific words/phrases that we want to recognize.

@ezerhouni
Copy link
Collaborator

@csukuangfj I tried a 7gram and it seems to improve a bit (2.67/7.03) but I am not sure it is worth it

Sorry for the late replay. I though I have replied last night.

I think 7gram is more than enough. Thanks for your experiments. The result shows that the code works with an n-gram LM, though we don't gain much from it. The next step is to use it to decode with a graph constructed from lists of specific words/phrases that we want to recognize.

I agree, I think 5gram is enough. I was thinking to use it for detecting OOV words. I will let you know once I have more results. (except if you have something in mind)

@csukuangfj
Copy link
Collaborator Author

@csukuangfj I tried a 7gram and it seems to improve a bit (2.67/7.03) but I am not sure it is worth it

Sorry for the late replay. I though I have replied last night.
I think 7gram is more than enough. Thanks for your experiments. The result shows that the code works with an n-gram LM, though we don't gain much from it. The next step is to use it to decode with a graph constructed from lists of specific words/phrases that we want to recognize.

I agree, I think 5gram is enough. I was thinking to use it for detecting OOV words. I will let you know once I have more results. (except if you have something in mind)

By the way, @marcoyang1998 is using the RNN-LM model that you provided for conformer CTC for shallow fusion
and he can get a WER 2.46 for test-clean without being specifically tuned.

@ezerhouni
Copy link
Collaborator

@csukuangfj I tried a 7gram and it seems to improve a bit (2.67/7.03) but I am not sure it is worth it

Sorry for the late replay. I though I have replied last night.
I think 7gram is more than enough. Thanks for your experiments. The result shows that the code works with an n-gram LM, though we don't gain much from it. The next step is to use it to decode with a graph constructed from lists of specific words/phrases that we want to recognize.

I agree, I think 5gram is enough. I was thinking to use it for detecting OOV words. I will let you know once I have more results. (except if you have something in mind)

By the way, @marcoyang1998 is using the RNN-LM model that you provided for conformer CTC for shallow fusion and he can get a WER 2.46 for test-clean without being specifically tuned.

Sounds interesting ! If I am not mistaken, we can't add new word on the fly to an already trained RNN-LM isn't it ?

@csukuangfj
Copy link
Collaborator Author

Sounds interesting ! If I am not mistaken, we can't add new word on the fly to an already trained RNN-LM isn't it ?

The RNN-LM is at token level, so as long as the new word can be represented by the bpe tokens, it can be rescored by the RNN-LM, I think.

@ezerhouni
Copy link
Collaborator

The RNN-LM is at token level, so as long as the new word can be represented by the bpe tokens, it can be rescored by the RNN-LM, I think.

Indeed, but we can't "boost" specific words (or combination of specific tokens)

@csukuangfj
Copy link
Collaborator Author

The RNN-LM is at token level, so as long as the new word can be represented by the bpe tokens, it can be rescored by the RNN-LM, I think.

Indeed, but we can't "boost" specific words (or combination of specific tokens)

Yes, you are right. That is why we are trying to integrate FST into decoding.

@ezerhouni
Copy link
Collaborator

@csukuangfj I have a batch version (à la modified_beam_search), I took your commits and added mine on top of it (with a rebase), I will create a new PR if that's ok

@csukuangfj
Copy link
Collaborator Author

csukuangfj commented Oct 21, 2022

@csukuangfj I have a batch version (à la modified_beam_search), I took your commits and added mine on top of it (with a rebase), I will create a new PR if that's ok

Yes, thanks! I will close this PR once you create a new PR.

@csukuangfj
Copy link
Collaborator Author

See #630

@csukuangfj csukuangfj closed this Oct 21, 2022
@csukuangfj csukuangfj deleted the shallow-fusion branch October 21, 2022 07:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants