-
Notifications
You must be signed in to change notification settings - Fork 302
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CSJ pruned_transducer_stateless7_streaming #892
Conversation
…efall into csj_pts7stream
Thanks! Is it ready for review? |
I am writing up RESULTS.md and README.md now, but will publish them once I manage to upload my results and model to HuggingFace. Also, I still couldn't wrap my head around the difference between streaming_decode.py and decode.py. Other than that, this PR is ready. EDIT: Sorry, I just realised I haven't adapted export.py. I will commit it soon. |
Apologies! I meant, I couldn't find out why the results of |
I have retrained an early model with padding (30) so that when decoding with padding 30 the insertions at end of utterances are less. This improved the simulated streaming How can I pad the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
Left some minor comments.
@csukuangfj I hope it's not too late, but actually I have much better results after padding both in training and decoding. I am updating the CER tables, the Huggingface repo, and the script commands. Code-wise, the only addition is a "--pad-feature" argument in |
CERsThese CERs are trained with padding=30. They are introduced with Trained and evaluated on disfluent transcript
Trained and evaluated on fluent transcript
Comparing disfluent and fluent modelsThis comparison evaluates the disfluent model on the fluent transcript (calculated by
Commands for disfluent training and decodingThe training command was: ./pruned_transducer_stateless7_streaming/train.py \
--feedforward-dims "1024,1024,2048,2048,1024" \
--world-size 8 \
--num-epochs 30 \
--start-epoch 1 \
--use-fp16 1 \
--exp-dir pruned_transducer_stateless7_streaming/exp_disfluent_2_pad30 \
--max-duration 375 \
--transcript-mode disfluent \
--lang data/lang_char \
--manifest-dir /mnt/host/corpus/csj/fbank \
--pad-feature 30 \
--musan-dir /mnt/host/corpus/musan/musan/fbank The simulated streaming decoding command was: for chunk in 64 32; do
for m in greedy_search fast_beam_search modified_beam_search; do
python pruned_transducer_stateless7_streaming/decode.py \
--feedforward-dims "1024,1024,2048,2048,1024" \
--exp-dir pruned_transducer_stateless7_streaming/exp_disfluent_2_pad30 \
--epoch 30 \
--avg 17 \
--max-duration 350 \
--decoding-method $m \
--manifest-dir /mnt/host/corpus/csj/fbank \
--lang data/lang_char \
--transcript-mode disfluent \
--res-dir pruned_transducer_stateless7_streaming/exp_disfluent_2_pad30/github/sim_"$chunk"_"$m" \
--decode-chunk-len $chunk \
--pad-feature 30 \
--gpu 0
done
done The streaming chunk-wise decoding command was: for chunk in 64 32; do
for m in greedy_search fast_beam_search modified_beam_search; do
python pruned_transducer_stateless7_streaming/streaming_decode.py \
--feedforward-dims "1024,1024,2048,2048,1024" \
--exp-dir pruned_transducer_stateless7_streaming/exp_disfluent_2_pad30 \
--epoch 30 \
--avg 17 \
--max-duration 350 \
--decoding-method $m \
--manifest-dir /mnt/host/corpus/csj/fbank \
--lang data/lang_char \
--transcript-mode disfluent \
--res-dir pruned_transducer_stateless7_streaming/exp_disfluent_2_pad30/github/stream_"$chunk"_"$m" \
--decode-chunk-len $chunk \
--gpu 2 \
--num-decode-streams 40
done
done Commands for fluent training and decodingThe training command was: ./pruned_transducer_stateless7_streaming/train.py \
--feedforward-dims "1024,1024,2048,2048,1024" \
--world-size 8 \
--num-epochs 30 \
--start-epoch 1 \
--use-fp16 1 \
--exp-dir pruned_transducer_stateless7_streaming/exp_fluent_2_pad30 \
--max-duration 375 \
--transcript-mode fluent \
--lang data/lang_char \
--manifest-dir /mnt/host/corpus/csj/fbank \
--pad-feature 30 \
--musan-dir /mnt/host/corpus/musan/musan/fbank The simulated streaming decoding command was: for chunk in 64 32; do
for m in greedy_search fast_beam_search modified_beam_search; do
python pruned_transducer_stateless7_streaming/decode.py \
--feedforward-dims "1024,1024,2048,2048,1024" \
--exp-dir pruned_transducer_stateless7_streaming/exp_fluent_2_pad30 \
--epoch 30 \
--avg 12 \
--max-duration 350 \
--decoding-method $m \
--manifest-dir /mnt/host/corpus/csj/fbank \
--lang data/lang_char \
--transcript-mode fluent \
--res-dir pruned_transducer_stateless7_streaming/exp_fluent_2_pad30/github/sim_"$chunk"_"$m" \
--decode-chunk-len $chunk \
--pad-feature 30 \
--gpu 1
done
done The streaming chunk-wise decoding command was: for chunk in 64 32; do
for m in greedy_search fast_beam_search modified_beam_search; do
python pruned_transducer_stateless7_streaming/streaming_decode.py \
--feedforward-dims "1024,1024,2048,2048,1024" \
--exp-dir pruned_transducer_stateless7_streaming/exp_fluent_2_pad30 \
--epoch 30 \
--avg 12 \
--max-duration 350 \
--decoding-method $m \
--manifest-dir /mnt/host/corpus/csj/fbank \
--lang data/lang_char \
--transcript-mode fluent \
--res-dir pruned_transducer_stateless7_streaming/exp_fluent_2_pad30/github/stream_"$chunk"_"$m" \
--decode-chunk-len $chunk \
--gpu 3 \
--num-decode-streams 40
done
done |
@csukuangfj Just to update that I have addressed your comments, and that this branch and the HuggingFace repo have been updated with the new results. |
@teowenshen Thank you very much! Is it ready for merge? |
@csukuangfj Yes. It is ready for merge from my end. Thanks! |
Here is a demo on iPhone using the model trained from this pull-request. It uses https://github.com/k2-fsa/sherpa-ncnn for deployment. 2023-02-15-sherpa-ncnn-streaming-zipformer-japanese-iPhone.mp4 |
Could you explain the differences between |
Yes, sure. In the CSJ transcript, words are explicitly tagged to express a variety of information. You can refer table 3 of this paper to get an idea. In this recipe, the (Kaldi is trained on the However, a problem I noticed early on while training the For example,
So, I uploaded the |
Thanks for your detailed explanation! |
Is the model at https://huggingface.co/TeoWenShen/icefall-asr-csj-pruned-transducer-stateless7-streaming-230208 If yes, could you add a README.md to it containing something like below:
You can find an example at |
I'm sorry, I've just clarified - The models from this pull request are not available for commercial use. I've taken down the models online. Can you help to delete the models from the demo page at https://huggingface.co/spaces/k2-fsa/automatic-speech-recognition too, since it is no longer working? Since the original Huggingface link to the model has been deleted, I will send in a pull request to edit the README file. Deepest apologies for the inconvenience! I will try to attempt a recipe for another openly available Japanese corpus if time permits. |
@teowenshen Hope that a new Japanese model will be available soon. By the way, I have converted this model to sherpa-ncnn and sherpa-onnx, shall I also take them down? |
@teowenshen |
Yes, please help to take them down as well. Thanks!
I am really sorry as it is not my place to permit any use of derivatives from CSJ. I just want to say that I am truly on board with the open source spirit of k2-fsa / NGK, and personally have nothing against the commercial use of models. Once again, really sorry for the inconvenience! |
@teowenshen |
See discussions at k2-fsa/icefall#892
See discussions at k2-fsa/icefall#892
See discussions at k2-fsa/icefall#892
See discussions at k2-fsa/icefall#892
CERs
[NOTE: These results are without padding during training. Later experiments showed that padding both during training and decoding decreased insertions at the end of utterances - at least for CSJ. See below for results of said later experiments.]
The CERs are:
Trained and evaluated on disfluent transcript
Trained and evaluated on fluent transcript
Comparing disfluent and fluent models
This comparison evaluates the disfluent model on the fluent transcript (calculated by
disfluent_recogs_to_fluent.py
), forgiving the disfluent model's mistakes on fillers and partial words. It is meant as an illustrative metric only, so that the disfluent and fluent models can be compared.Commands for disfluent training and decoding
The training command was:
./pruned_transducer_stateless7_streaming/train.py \ --feedforward-dims "1024,1024,2048,2048,1024" \ --context-size 2 \ --world-size 8 \ --num-epochs 30 \ --start-epoch 1 \ --use-fp16 1 \ --exp-dir pruned_transducer_stateless7_streaming/exp_disfluent \ --max-duration 375 \ --transcript-mode disfluent \ --lang data/lang_char \ --musan-dir /mnt/host/corpus/musan/musan/fbank
Padding with 30 caused many insertions at the end of utterances. The simulated streaming decoding command was:
The streaming chunk-wise decoding command was:
Commands for fluent training and decoding
The training command was:
The simulated streaming decoding command was:
The streaming chunk-wise decoding command was: