-
Notifications
You must be signed in to change notification settings - Fork 301
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add streaming Emformer stateless RNN-T. #390
Add streaming Emformer stateless RNN-T. #390
Conversation
I have uploaded the pretrained model, training logs, decoding logs, and decoding results to https://huggingface.co/csukuangfj/icefall-asr-librispeech-pruned-stateless-emformer-rnnt2-2022-06-01 You can use the pretrained model in https://github.com/k2-fsa/sherpa, which is an ASR server in Python supporting both streaming and non-streaming ASR. The following is a YouTube video demonstrating its use in sherpa. |
Hi @csukuangfj , I am trying to use the non-torchscripted checkpoint that you released on HuggingFace ( |
What is your decoding command? |
I just rechecked it with the following commands: cd egs/librispeech/ASR/
mkdir t
cd t
ln -s /ceph-fj/fangjun/open-source-2/icefall-models//icefall-asr-librispeech-pruned-stateless-emformer-rnnt2-2022-06-01/exp/pretrained-epoch-39-avg-6-use-averaged-model-1.pt epoch-99.pt
ln -s /ceph-fj/fangjun/open-source-2/icefall-models//icefall-asr-librispeech-pruned-stateless-emformer-rnnt2-2022-06-01/data/lang_bpe_500/bpe.model ./
cd ../
./pruned_stateless_emformer_rnnt2/decode.py \
--epoch 99 \
--avg 1 \
--use-averaged-model 0 \
--exp-dir ./t/ \
--bpe-model ./t/bpe.model \
--max-duration 50 \
--decoding-method greedy_search \
--num-encoder-layers 18 \
--left-context-length 128 \
--segment-length 8 \
--right-context-length 4 It gives me the following output:
|
Note: I am using the Emformer model from the following commit of the torchaudio repo:
|
Thanks for your prompt reply, it was very helpful! I used mismatching BPE model 🙄 Sorry for bothering you |
Never mind. Glad to hear it works for you. |
This PR uses the Emformer model from torchaudio, which requires torchaudio >= 0.11.0.
Training command
Decoding command
The baseline is from
https://github.com/pytorch/audio/blob/main/examples/asr/emformer_rnnt/README.md
Note that the baseline is trained for 120 epochs, with 32 GPUs
Also, the baseline uses vocab size 4098.
Will switch to #389
The pretrained model can be used in k2-fsa/sherpa#6 for streaming ASR recognition.
I am uploading the training logs, decoding results, decoding logs, and pretrained model to hugging face.