CSJ pruned_transducer_stateless7_streaming #892

teowenshen · 2023-02-09T02:40:54Z

CERs

[NOTE: These results are without padding during training. Later experiments showed that padding both during training and decoding decreased insertions at the end of utterances - at least for CSJ. See below for results of said later experiments.]

The CERs are:

Trained and evaluated on disfluent transcript

decoding method	chunk size	eval1	eval2	eval3	excluded	valid	average	decoding mode
fast beam search	320ms	6.27	5.13	5.05	6.30	5.42	--epoch 30 --avg 14	simulated streaming
fast beam search	320ms	5.91	4.30	4.53	6.13	5.13	--epoch 30 --avg 14	chunk-wise
greedy search	320ms	5.81	4.29	4.63	6.07	5.13	--epoch 30 --avg 14	simulated streaming
greedy search	320ms	5.91	4.50	4.65	6.36	5.34	--epoch 30 --avg 14	chunk-wise
modified beam search	320ms	5.59	4.20	4.39	5.54	4.90	--epoch 30 --avg 14	simulated streaming
modified beam search	320ms	5.79	4.48	4.41	5.98	5.19	--epoch 30 --avg 14	chunk-wise
fast beam search	640ms	5.76	4.35	4.39	5.40	4.92	--epoch 30 --avg 14	simulated streaming
fast beam search	640ms	5.45	4.31	4.29	5.61	4.97	--epoch 30 --avg 14	chunk-wise
greedy search	640ms	5.37	3.94	4.03	5.22	4.77	--epoch 30 --avg 14	simulated streaming
greedy search	640ms	5.77	4.44	4.49	5.70	5.29	--epoch 30 --avg 14	chunk-wise
modified beam search	640ms	5.19	3.81	3.93	4.83	4.59	--epoch 30 --avg 14	simulated streaming
modified beam search	640ms	6.71	5.35	4.95	6.06	5.94	--epoch 30 --avg 14	chunk-wise

Trained and evaluated on fluent transcript

decoding method	chunk size	eval1	eval2	eval3	excluded	valid	average	decoding mode
fast beam search pad30	320ms	4.72	3.74	4.21	5.21	4.39	--epoch 30 --avg 19	simulated streaming
fast beam search	320ms	4.63	3.63	4.18	5.3	4.31	--epoch 30 --avg 19	chunk-wise
greedy search	320ms	4.83	3.71	4.27	4.89	4.38	--epoch 30 --avg 19	simulated streaming
greedy search	320ms	4.7	3.87	4.24	5.39	4.39	--epoch 30 --avg 19	chunk-wise
modified beam search	320ms	4.61	3.55	4.07	4.89	4.18	--epoch 30 --avg 19	simulated streaming
modified beam search	320ms	4.53	3.73	3.98	5.9	4.25	--epoch 30 --avg 19	chunk-wise
fast beam search pad30	640ms	4.33	3.55	4.03	4.97	4.33	--epoch 30 --avg 19	simulated streaming
fast beam search	640ms	4.21	3.64	3.93	5.04	4.18	--epoch 30 --avg 19	chunk-wise
greedy search	640ms	4.3	3.51	3.91	4.45	4.04	--epoch 30 --avg 19	simulated streaming
greedy search	640ms	4.4	3.83	4.03	5.14	4.31	--epoch 30 --avg 19	chunk-wise
modified beam search	640ms	4.11	3.29	3.66	4.33	3.88	--epoch 30 --avg 19	simulated streaming
modified beam search	640ms	4.42	3.91	3.93	5.62	4.33	--epoch 30 --avg 19	chunk-wise

Comparing disfluent and fluent models

$$\texttt{CER}^{f}_d = \frac{\texttt{sub}_f + \texttt{ins} + \texttt{del}_f}{N_f}$$

This comparison evaluates the disfluent model on the fluent transcript (calculated by disfluent_recogs_to_fluent.py), forgiving the disfluent model's mistakes on fillers and partial words. It is meant as an illustrative metric only, so that the disfluent and fluent models can be compared.

decoding method	chunk size	eval1 (d vs f)	eval2 (d vs f)	eval3 (d vs f)	excluded (d vs f)	valid (d vs f)	decoding mode
fast beam search	320ms	5.44 vs 4.72	4.49 vs 3.74	4.44 vs 4.21	5.14 vs 5.21	4.64 vs 4.39	simulated streaming
fast beam search	320ms	5.05 vs 4.63	3.63 vs 3.63	3.91 vs 4.18	4.75 vs 5.30	4.29 vs 4.31	chunk-wise
greedy search	320ms	4.97 vs 4.83	3.63 vs 3.71	4.02 vs 4.27	4.93 vs 4.89	4.32 vs 4.38	simulated streaming
greedy search	320ms	5.02 vs 4.70	3.78 vs 3.87	4.02 vs 4.24	5.11 vs 5.39	4.47 vs 4.39	chunk-wise
modified beam search	320ms	4.86 vs 4.61	3.62 vs 3.55	3.85 vs 4.07	4.66 vs 4.89	4.21 vs 4.18	simulated streaming
modified beam search	320ms	5.05 vs 4.53	3.89 vs 3.73	3.88 vs 3.98	4.88 vs 5.90	4.48 vs 4.25	chunk-wise
fast beam search	640ms	4.93 vs 4.33	3.74 vs 3.55	3.78 vs 4.03	4.31 vs 4.97	4.15 vs 4.33	simulated streaming
fast beam search	640ms	4.61 vs 4.21	3.67 vs 3.64	3.66 vs 3.93	4.34 vs 5.04	4.15 vs 4.18	chunk-wise
greedy search	640ms	4.48 vs 4.30	3.29 vs 3.51	3.43 vs 3.91	4.11 vs 4.45	3.96 vs 4.04	simulated streaming
greedy search	640ms	4.89 vs 4.40	3.77 vs 3.83	3.87 vs 4.03	4.41 vs 5.14	4.47 vs 4.31	chunk-wise
modified beam search	640ms	4.45 vs 4.11	3.28 vs 3.29	3.41 vs 3.66	3.97 vs 4.33	3.90 vs 3.88	simulated streaming
modified beam search	640ms	6.10 vs 4.42	4.86 vs 3.91	4.51 vs 3.93	5.16 vs 5.62	5.34 vs 4.33	chunk-wise
average of (d - f)		0.50	0.14	-0.13	-0.45	0.11

Commands for disfluent training and decoding

The training command was:

./pruned_transducer_stateless7_streaming/train.py \
  --feedforward-dims  "1024,1024,2048,2048,1024" \
  --context-size 2 \
  --world-size 8 \
  --num-epochs 30 \
  --start-epoch 1 \
  --use-fp16 1 \
  --exp-dir pruned_transducer_stateless7_streaming/exp_disfluent \
  --max-duration 375 \
  --transcript-mode disfluent \
  --lang data/lang_char \
  --musan-dir /mnt/host/corpus/musan/musan/fbank

Padding with 30 caused many insertions at the end of utterances. The simulated streaming decoding command was:

for chunk in 64 32; do
    for m in greedy_search fast_beam_search modified_beam_search; do
        python pruned_transducer_stateless7_streaming/decode.py \
            --feedforward-dims  "1024,1024,2048,2048,1024" \
            --exp-dir pruned_transducer_stateless7_streaming/exp_disfluent_2 \
            --epoch 30 \
            --avg 14 \
            --max-duration 250 \
            --decoding-method $m \
            --manifest-dir /mnt/host/corpus/csj/fbank \
            --lang data/lang_char \
            --transcript-mode disfluent \
            --res-dir pruned_transducer_stateless7_streaming/exp_disfluent_2/github/sim_"$chunk"_"$m" \
            --decode-chunk-len $chunk \
            --pad 4
        done
    done
done

The streaming chunk-wise decoding command was:

for chunk in 64 32; do
    for m in greedy_search fast_beam_search modified_beam_search; do
        python pruned_transducer_stateless7_streaming/streaming_decode.py \
            --feedforward-dims  "1024,1024,2048,2048,1024" \
            --exp-dir pruned_transducer_stateless7_streaming/exp_disfluent_2 \
            --epoch 30 \
            --avg 14 \
            --max-duration 250 \
            --decoding-method $m \
            --manifest-dir /mnt/host/corpus/csj/fbank \
            --lang data/lang_char \
            --transcript-mode disfluent \
            --res-dir pruned_transducer_stateless7_streaming/exp_disfluent_2/github/stream_"$chunk"_"$m" \
            --decode-chunk-len $chunk \
            --num-decode-streams 40
    done
done

Commands for fluent training and decoding

The training command was:

./pruned_transducer_stateless7_streaming/train.py \
  --feedforward-dims  "1024,1024,2048,2048,1024" \
  --context-size 2 \
  --world-size 8 \
  --num-epochs 30 \
  --start-epoch 1 \
  --use-fp16 1 \
  --exp-dir pruned_transducer_stateless7_streaming/exp_fluent_2 \
  --max-duration 375 \
  --transcript-mode fluent \
  --telegram-cred misc.ini \
  --lang data/lang_char \
  --manifest-dir $csj_fbank_dir \
  --musan-dir /mnt/host/corpus/musan/musan/fbank

The simulated streaming decoding command was:

for chunk in 64 32; do
    for m in greedy_search modified_beam_search; do
        python pruned_transducer_stateless7_streaming/decode.py \
            --feedforward-dims  "1024,1024,2048,2048,1024" \
            --exp-dir pruned_transducer_stateless7_streaming/exp_fluent_2 \
            --epoch 30 \
            --avg 19 \
            --max-duration 350 \
            --decoding-method $m \
            --manifest-dir /mnt/host/corpus/csj/fbank \
            --lang data/lang_char \
            --transcript-mode fluent \
            --res-dir pruned_transducer_stateless7_streaming/exp_fluent_2/github/sim_"$chunk"_"$m" \
            --decode-chunk-len $chunk \
            --pad 4
    done
    # Padding of 4 caused many deletions only in the fast_beam_search case. 
    python pruned_transducer_stateless7_streaming/decode.py \
        --feedforward-dims  "1024,1024,2048,2048,1024" \
        --exp-dir pruned_transducer_stateless7_streaming/exp_fluent_2 \
        --epoch 30 \
        --avg 19 \
        --max-duration 350 \
        --decoding-method fast_beam_search \
        --manifest-dir /mnt/host/corpus/csj/fbank \
        --lang data/lang_char \
        --transcript-mode fluent \
        --res-dir pruned_transducer_stateless7_streaming/exp_fluent_2/github/sim_"$chunk"_fast_beam_search \
        --decode-chunk-len $chunk \
        --pad 30
done

The streaming chunk-wise decoding command was:

for chunk in 64 32; do
    for m in greedy_search fast_beam_search modified_beam_search; do
        python pruned_transducer_stateless7_streaming/streaming_decode.py \
            --feedforward-dims  "1024,1024,2048,2048,1024" \
            --exp-dir pruned_transducer_stateless7_streaming/exp_fluent_2 \
            --epoch 30 \
            --avg 19 \
            --max-duration 250 \
            --decoding-method $m \
            --manifest-dir /mnt/host/corpus/csj/fbank \
            --lang data/lang_char \
            --transcript-mode fluent \
            --res-dir pruned_transducer_stateless7_streaming/exp_fluent_2/github/stream_"$chunk"_"$m" \
            --decode-chunk-len $chunk \
            --gpu 4 \
            --num-decode-streams 40
    done
done

…efall into csj_pts7stream

csukuangfj · 2023-02-09T02:47:42Z

@teowenshen

Thanks! Is it ready for review?

teowenshen · 2023-02-09T02:55:43Z

I am writing up RESULTS.md and README.md now, but will publish them once I manage to upload my results and model to HuggingFace. Also, I still couldn't wrap my head around the difference between streaming_decode.py and decode.py.

Other than that, this PR is ready.

EDIT: Sorry, I just realised I haven't adapted export.py. I will commit it soon.

pkufool · 2023-02-09T03:33:47Z

Also, I still couldn't wrap my head around the difference between streaming_decode.py and decode.py.

About the difference between streaming_decode.py and decode.py please see the discussion here and the documents here

teowenshen · 2023-02-09T03:54:55Z

Apologies! I meant, I couldn't find out why the results of streaming_decode.py and decode.py have such a CER gap, since given the same chunk length and left context they should be almost the same. I have a more detailed finding here: #807, and have since then retrained another model to compare.

teowenshen · 2023-02-10T01:59:01Z

I have retrained an early model with padding (30) so that when decoding with padding 30 the insertions at end of utterances are less. This improved the simulated streaming decode.py results, but still the chunk-wise streaming_decode.py results are bad.

How can I pad the streaming_decode.py with 30 frames only, like decode.py?

csukuangfj

Thanks!

Left some minor comments.

egs/csj/ASR/RESULTS.md

egs/csj/ASR/local/disfluent_recogs_to_fluent.py

egs/csj/ASR/local/prepare_lang_char.py

egs/csj/ASR/local/utils/tokenizer.py

teowenshen · 2023-02-13T11:27:57Z

@csukuangfj I hope it's not too late, but actually I have much better results after padding both in training and decoding. I am updating the CER tables, the Huggingface repo, and the script commands.

Code-wise, the only addition is a "--pad-feature" argument in train.py, so the existing bulk of codes that you've reviewed are mostly the same. So sorry for the hassle! 🙏

teowenshen · 2023-02-13T13:02:04Z

CERs

These CERs are trained with padding=30. They are introduced with --pad-feature:

Trained and evaluated on disfluent transcript

decoding method	chunk size	eval1	eval2	eval3	excluded	valid	average	decoding mode
fast beam search	320ms	5.39	4.08	4.16	5.4	5.02	--epoch 30 --avg 17	simulated streaming
fast beam search	320ms	5.34	4.1	4.26	5.61	4.91	--epoch 30 --avg 17	chunk-wise
greedy search	320ms	5.43	4.14	4.31	5.48	4.88	--epoch 30 --avg 17	simulated streaming
greedy search	320ms	5.44	4.14	4.39	5.7	4.98	--epoch 30 --avg 17	chunk-wise
modified beam search	320ms	5.2	3.95	4.09	5.12	4.75	--epoch 30 --avg 17	simulated streaming
modified beam search	320ms	5.18	4.07	4.12	5.36	4.77	--epoch 30 --avg 17	chunk-wise
fast beam search	640ms	5.01	3.78	3.96	4.85	4.6	--epoch 30 --avg 17	simulated streaming
fast beam search	640ms	4.97	3.88	3.96	4.91	4.61	--epoch 30 --avg 17	chunk-wise
greedy search	640ms	5.02	3.84	4.14	5.02	4.59	--epoch 30 --avg 17	simulated streaming
greedy search	640ms	5.32	4.22	4.33	5.39	4.99	--epoch 30 --avg 17	chunk-wise
modified beam search	640ms	4.78	3.66	3.85	4.72	4.42	--epoch 30 --avg 17	simulated streaming
modified beam search	640ms	5.77	4.72	4.73	5.85	5.36	--epoch 30 --avg 17	chunk-wise

Trained and evaluated on fluent transcript

decoding method	chunk size	eval1	eval2	eval3	excluded	valid	average	decoding mode
fast beam search	320ms	4.19	3.63	3.77	4.43	4.09	--epoch 30 --avg 12	simulated streaming
fast beam search	320ms	4.06	3.55	3.66	4.70	4.04	--epoch 30 --avg 12	chunk-wise
greedy search	320ms	4.22	3.62	3.82	4.45	3.98	--epoch 30 --avg 12	simulated streaming
greedy search	320ms	4.13	3.61	3.85	4.67	4.05	--epoch 30 --avg 12	chunk-wise
modified beam search	320ms	4.02	3.43	3.62	4.43	3.81	--epoch 30 --avg 12	simulated streaming
modified beam search	320ms	3.97	3.43	3.59	4.99	3.88	--epoch 30 --avg 12	chunk-wise
fast beam search	640ms	3.80	3.31	3.55	4.16	3.90	--epoch 30 --avg 12	simulated streaming
fast beam search	640ms	3.81	3.34	3.46	4.58	3.85	--epoch 30 --avg 12	chunk-wise
greedy search	640ms	3.92	3.38	3.65	4.31	3.88	--epoch 30 --avg 12	simulated streaming
greedy search	640ms	3.98	3.38	3.64	4.54	4.01	--epoch 30 --avg 12	chunk-wise
modified beam search	640ms	3.72	3.26	3.39	4.10	3.65	--epoch 30 --avg 12	simulated streaming
modified beam search	640ms	3.78	3.32	3.45	4.81	3.81	--epoch 30 --avg 12	chunk-wise

Comparing disfluent and fluent models

$$\texttt{CER}^{f}_d = \frac{\texttt{sub}_f + \texttt{ins} + \texttt{del}_f}{N_f}$$

This comparison evaluates the disfluent model on the fluent transcript (calculated by disfluent_recogs_to_fluent.py), forgiving the disfluent model's mistakes on fillers and partial words. It is meant as an illustrative metric only, so that the disfluent and fluent models can be compared.

decoding method	chunk size	eval1 (d vs f)	eval2 (d vs f)	eval3 (d vs f)	excluded (d vs f)	valid (d vs f)	decoding mode
fast beam search	320ms	4.54 vs 4.19	3.44 vs 3.63	3.56 vs 3.77	4.22 vs 4.43	4.22 vs 4.09	simulated streaming
fast beam search	320ms	4.48 vs 4.06	3.41 vs 3.55	3.65 vs 3.66	4.26 vs 4.7	4.08 vs 4.04	chunk-wise
greedy search	320ms	4.53 vs 4.22	3.48 vs 3.62	3.69 vs 3.82	4.38 vs 4.45	4.05 vs 3.98	simulated streaming
greedy search	320ms	4.53 vs 4.13	3.46 vs 3.61	3.71 vs 3.85	4.48 vs 4.67	4.12 vs 4.05	chunk-wise
modified beam search	320ms	4.45 vs 4.02	3.38 vs 3.43	3.57 vs 3.62	4.19 vs 4.43	4.04 vs 3.81	simulated streaming
modified beam search	320ms	4.44 vs 3.97	3.47 vs 3.43	3.56 vs 3.59	4.28 vs 4.99	4.04 vs 3.88	chunk-wise
fast beam search	640ms	4.14 vs 3.8	3.12 vs 3.31	3.38 vs 3.55	3.72 vs 4.16	3.81 vs 3.9	simulated streaming
fast beam search	640ms	4.05 vs 3.81	3.23 vs 3.34	3.36 vs 3.46	3.65 vs 4.58	3.78 vs 3.85	chunk-wise
greedy search	640ms	4.1 vs 3.92	3.17 vs 3.38	3.5 vs 3.65	3.87 vs 4.31	3.77 vs 3.88	simulated streaming
greedy search	640ms	4.41 vs 3.98	3.56 vs 3.38	3.69 vs 3.64	4.26 vs 4.54	4.16 vs 4.01	chunk-wise
modified beam search	640ms	4 vs 3.72	3.08 vs 3.26	3.33 vs 3.39	3.75 vs 4.1	3.71 vs 3.65	simulated streaming
modified beam search	640ms	5.05 vs 3.78	4.22 vs 3.32	4.26 vs 3.45	5.02 vs 4.81	4.73 vs 3.81	chunk-wise
average (d - f)		0.43	-0.02	-0.02	-0.34	0.13

Commands for disfluent training and decoding

The training command was:

./pruned_transducer_stateless7_streaming/train.py \
  --feedforward-dims  "1024,1024,2048,2048,1024" \
  --world-size 8 \
  --num-epochs 30 \
  --start-epoch 1 \
  --use-fp16 1 \
  --exp-dir pruned_transducer_stateless7_streaming/exp_disfluent_2_pad30 \
  --max-duration 375 \
  --transcript-mode disfluent \
  --lang data/lang_char \
  --manifest-dir /mnt/host/corpus/csj/fbank \
  --pad-feature 30 \
  --musan-dir /mnt/host/corpus/musan/musan/fbank

The simulated streaming decoding command was:

for chunk in 64 32; do
    for m in greedy_search fast_beam_search modified_beam_search; do
        python pruned_transducer_stateless7_streaming/decode.py \
            --feedforward-dims  "1024,1024,2048,2048,1024" \
            --exp-dir pruned_transducer_stateless7_streaming/exp_disfluent_2_pad30 \
            --epoch 30 \
            --avg 17 \
            --max-duration 350 \
            --decoding-method $m \
            --manifest-dir /mnt/host/corpus/csj/fbank \
            --lang data/lang_char \
            --transcript-mode disfluent \
            --res-dir pruned_transducer_stateless7_streaming/exp_disfluent_2_pad30/github/sim_"$chunk"_"$m" \
            --decode-chunk-len $chunk \
            --pad-feature 30 \
            --gpu 0
    done
done

The streaming chunk-wise decoding command was:

for chunk in 64 32; do
    for m in greedy_search fast_beam_search modified_beam_search; do
        python pruned_transducer_stateless7_streaming/streaming_decode.py \
            --feedforward-dims  "1024,1024,2048,2048,1024" \
            --exp-dir pruned_transducer_stateless7_streaming/exp_disfluent_2_pad30 \
            --epoch 30 \
            --avg 17 \
            --max-duration 350 \
            --decoding-method $m \
            --manifest-dir /mnt/host/corpus/csj/fbank \
            --lang data/lang_char \
            --transcript-mode disfluent \
            --res-dir pruned_transducer_stateless7_streaming/exp_disfluent_2_pad30/github/stream_"$chunk"_"$m" \
            --decode-chunk-len $chunk \
            --gpu 2 \
            --num-decode-streams 40
    done
done

Commands for fluent training and decoding

The training command was:

./pruned_transducer_stateless7_streaming/train.py \
  --feedforward-dims  "1024,1024,2048,2048,1024" \
  --world-size 8 \
  --num-epochs 30 \
  --start-epoch 1 \
  --use-fp16 1 \
  --exp-dir pruned_transducer_stateless7_streaming/exp_fluent_2_pad30 \
  --max-duration 375 \
  --transcript-mode fluent \
  --lang data/lang_char \
  --manifest-dir /mnt/host/corpus/csj/fbank \
  --pad-feature 30 \
  --musan-dir /mnt/host/corpus/musan/musan/fbank

The simulated streaming decoding command was:

for chunk in 64 32; do
    for m in greedy_search fast_beam_search modified_beam_search; do
        python pruned_transducer_stateless7_streaming/decode.py \
            --feedforward-dims  "1024,1024,2048,2048,1024" \
            --exp-dir pruned_transducer_stateless7_streaming/exp_fluent_2_pad30 \
            --epoch 30 \
            --avg 12 \
            --max-duration 350 \
            --decoding-method $m \
            --manifest-dir /mnt/host/corpus/csj/fbank \
            --lang data/lang_char \
            --transcript-mode fluent \
            --res-dir pruned_transducer_stateless7_streaming/exp_fluent_2_pad30/github/sim_"$chunk"_"$m" \
            --decode-chunk-len $chunk \
            --pad-feature 30 \
            --gpu 1
    done
done

The streaming chunk-wise decoding command was:

for chunk in 64 32; do
    for m in greedy_search fast_beam_search modified_beam_search; do
        python pruned_transducer_stateless7_streaming/streaming_decode.py \
            --feedforward-dims  "1024,1024,2048,2048,1024" \
            --exp-dir pruned_transducer_stateless7_streaming/exp_fluent_2_pad30 \
            --epoch 30 \
            --avg 12 \
            --max-duration 350 \
            --decoding-method $m \
            --manifest-dir /mnt/host/corpus/csj/fbank \
            --lang data/lang_char \
            --transcript-mode fluent \
            --res-dir pruned_transducer_stateless7_streaming/exp_fluent_2_pad30/github/stream_"$chunk"_"$m" \
            --decode-chunk-len $chunk \
            --gpu 3 \
            --num-decode-streams 40
    done
done

teowenshen · 2023-02-13T13:49:45Z

@csukuangfj Just to update that I have addressed your comments, and that this branch and the HuggingFace repo have been updated with the new results.

csukuangfj · 2023-02-13T14:14:26Z

@teowenshen Thank you very much!

Is it ready for merge?

teowenshen · 2023-02-13T14:18:52Z

@csukuangfj Yes. It is ready for merge from my end. Thanks!

csukuangfj · 2023-02-15T13:44:53Z

@teowenshen

Here is a demo on iPhone using the model trained from this pull-request.

It uses https://github.com/k2-fsa/sherpa-ncnn for deployment.

2023-02-15-sherpa-ncnn-streaming-zipformer-japanese-iPhone.mp4

csukuangfj · 2023-02-21T03:23:59Z

@teowenshen

Could you explain the differences between fluent and disfluent?

teowenshen · 2023-02-21T04:29:41Z

Yes, sure. disfluent is the transcript mode that most other CSJ ASR recipes are based on. fluent is where disfluency elements, like fillers and partial words, are removed.

In the CSJ transcript, words are explicitly tagged to express a variety of information. You can refer table 3 of this paper to get an idea. In this recipe, the fluent model is trained on transcripts where all F-, D-, and D2-tags are removed, and all other tags are dealt with the same way as disfluent.

(Kaldi is trained on the disfluent transcript. ESPnet uses the disfluent transcript too, if I'm not mistaken. Basically, for academic comparison purposes, we can use the disfluent transcript mode. However, because CSJ's original utterances are much shorter than 10s, this recipe's transcript is not exactly the same as Kaldi due to differences in the utterance concatenation algorithm.)

However, a problem I noticed early on while training the disfluent model is that many errors happen around fillers and partial words, and in fact those errors aren't quite important in the actual text.

For example,

Type	Disfluent	Fluent	Remarks
Filler	え (*->ー) この図は ...	この図は ...	Error in the spelling of the filler by the `disfluent` model.
Filler	えー (*->っ) とー	∅	Whole utterance consists of a filler. The `disfluent` model wrongly spelled the filler, while the `fluent` model correctly ignored the filler.
Partial word: Stuttering	(ねこしゅぼん->となってしまう) えー (*->っ) とここで ...	ここで ...	The `disfluent` model wrongly spelled the gibberish partial word and the filler, while the `fluent` model correctly ignored both the partial word and the filler.
Partial word: Speech correction	(にほん->日本) 難易度 ...	(*->日本) 難易度	Because the partial word itself is a full word, the `disfluent` model spelled it with Chinese characters although the transcript used phonetic hiragana. The `fluent` model unfortunately picked up on the partial word too.

So, I uploaded the fluent model too because, in my opinion, the fluent model is more immediately useful. Plus, the training codes are the same anyway.

csukuangfj · 2023-02-21T04:41:25Z

@teowenshen

Thanks for your detailed explanation!

csukuangfj · 2023-02-22T02:56:21Z

@teowenshen

Is the model at https://huggingface.co/TeoWenShen/icefall-asr-csj-pruned-transducer-stateless7-streaming-230208
suitable for commercial use?

If yes, could you add a README.md to it containing something like below:

---
license: apache-2.0
---

You can find an example at
https://huggingface.co/csukuangfj/sherpa-ncnn-conv-emformer-transducer-2022-12-06/raw/main/README.md

teowenshen · 2023-02-22T07:22:44Z

I'm sorry, I've just clarified - The models from this pull request are not available for commercial use.

I've taken down the models online. Can you help to delete the models from the demo page at https://huggingface.co/spaces/k2-fsa/automatic-speech-recognition too, since it is no longer working?

Since the original Huggingface link to the model has been deleted, I will send in a pull request to edit the README file.

Deepest apologies for the inconvenience! I will try to attempt a recipe for another openly available Japanese corpus if time permits.

csukuangfj · 2023-02-22T07:46:05Z

@teowenshen
Sorry to hear that. I just removed it from https://huggingface.co/spaces/k2-fsa/automatic-speech-recognition

Hope that a new Japanese model will be available soon.

By the way, I have converted this model to sherpa-ncnn and sherpa-onnx, shall I also take them down?

csukuangfj · 2023-02-22T07:48:52Z

@teowenshen
Also, can I make the exported models private and use them only to demonstrate k2-fsa?

teowenshen · 2023-02-22T08:13:17Z

By the way, I have converted this model to sherpa-ncnn and sherpa-onnx, shall I also take them down?

https://huggingface.co/csukuangfj/sherpa-ncnn-streaming-zipformer-ja-fluent-2023-02-14
https://huggingface.co/csukuangfj/sherpa-ncnn-streaming-zipformer-ja-disfluent-2023-02-14
https://huggingface.co/csukuangfj/sherpa-ncnn-streaming-zipformer-ja-fluent-2023-02-14

Yes, please help to take them down as well. Thanks!

Also, can I make the exported models private and use them only to demonstrate k2-fsa?

I am really sorry as it is not my place to permit any use of derivatives from CSJ.

I just want to say that I am truly on board with the open source spirit of k2-fsa / NGK, and personally have nothing against the commercial use of models. Once again, really sorry for the inconvenience!

csukuangfj · 2023-02-22T08:15:59Z

@teowenshen
Thanks, I see. I am taking them down.

See discussions at k2-fsa/icefall#892

teowenshen added 22 commits January 11, 2023 23:42

update manifest stats

4db1ff2

update transcript configs

7b36af2

lang_char and compute_fbanks

4e6511f

save cuts in fbank_dir

42e887e

add core codes

e8c692d

update decode.py

80bab58

Create local/utils

8079be2

tidy up

a265b01

parse raw in prepare_lang_char.py

20c612f

update manifest stats

a69ba9f

update transcript configs

130e851

lang_char and compute_fbanks

841ea1a

save cuts in fbank_dir

80d087f

add core codes

9f1a625

update decode.py

22d7360

Create local/utils

62f1a69

tidy up

266b8d5

parse raw in prepare_lang_char.py

01f32bf

Merge branch 'csj_pts7stream' of https://github.com/Minami-Lab-UEC/ic…

192627b

…efall into csj_pts7stream

working train

2fc1ae7

Add compare_cer_transcript.py

c455fd5

fix tokenizer decode, allow d2f only

4ce6be4

teowenshen added 3 commits February 9, 2023 12:58

comment cleanup

eafe1c8

add export files and READMEs

5ebda26

reword average column

ba15053

csukuangfj reviewed Feb 13, 2023

View reviewed changes

fix comments

4b29025

csukuangfj mentioned this pull request Feb 13, 2023

Export streaming zipformer to ncnn #906

Merged

7 tasks

Update new results

b11c120

csukuangfj approved these changes Feb 13, 2023

View reviewed changes

csukuangfj merged commit e63a8c2 into k2-fsa:master Feb 13, 2023

csukuangfj added a commit to csukuangfj/sherpa that referenced this pull request Feb 22, 2023

Remove pretrained Japanese zipformer models from doc

6155f81

See discussions at k2-fsa/icefall#892

csukuangfj mentioned this pull request Feb 22, 2023

Remove pretrained Japanese zipformer models from doc k2-fsa/sherpa#315

Merged

csukuangfj added a commit to k2-fsa/sherpa that referenced this pull request Feb 22, 2023

Remove pretrained Japanese zipformer models from doc (#315)

92e3965

See discussions at k2-fsa/icefall#892

csukuangfj added a commit to csukuangfj/sherpa-ncnn that referenced this pull request Feb 22, 2023

Disable testing Japanese pre-trained models in CI

b3313a9

See discussions at k2-fsa/icefall#892

csukuangfj mentioned this pull request Feb 22, 2023

Disable testing Japanese pre-trained models in CI k2-fsa/sherpa-ncnn#124

Merged

csukuangfj added a commit to k2-fsa/sherpa-ncnn that referenced this pull request Feb 22, 2023

Disable testing Japanese pre-trained models in CI (#124)

442228a

See discussions at k2-fsa/icefall#892

LeeTZhi pushed a commit to LeeTZhi/sherpa-ncnn that referenced this pull request Oct 27, 2023

Disable testing Japanese pre-trained models in CI (k2-fsa#124)

845542f

See discussions at k2-fsa/icefall#892

teowenshen deleted the csj_pts7stream branch December 12, 2023 11:29

duhtapioca mentioned this pull request Jul 24, 2024

Training with disfluencies in speech #1701

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CSJ pruned_transducer_stateless7_streaming #892

CSJ pruned_transducer_stateless7_streaming #892

teowenshen commented Feb 9, 2023 •

edited

Loading

csukuangfj commented Feb 9, 2023

teowenshen commented Feb 9, 2023 •

edited

Loading

pkufool commented Feb 9, 2023 •

edited

Loading

teowenshen commented Feb 9, 2023

teowenshen commented Feb 10, 2023

csukuangfj left a comment

teowenshen commented Feb 13, 2023

teowenshen commented Feb 13, 2023 •

edited

Loading

teowenshen commented Feb 13, 2023

csukuangfj commented Feb 13, 2023

teowenshen commented Feb 13, 2023

csukuangfj commented Feb 15, 2023

csukuangfj commented Feb 21, 2023 •

edited

Loading

teowenshen commented Feb 21, 2023 •

edited

Loading

csukuangfj commented Feb 21, 2023

csukuangfj commented Feb 22, 2023

teowenshen commented Feb 22, 2023

csukuangfj commented Feb 22, 2023 •

edited

Loading

csukuangfj commented Feb 22, 2023

teowenshen commented Feb 22, 2023

csukuangfj commented Feb 22, 2023

CSJ pruned_transducer_stateless7_streaming #892

CSJ pruned_transducer_stateless7_streaming #892

Conversation

teowenshen commented Feb 9, 2023 • edited Loading

CERs

Trained and evaluated on disfluent transcript

Trained and evaluated on fluent transcript

Comparing disfluent and fluent models

Commands for disfluent training and decoding

Commands for fluent training and decoding

csukuangfj commented Feb 9, 2023

teowenshen commented Feb 9, 2023 • edited Loading

pkufool commented Feb 9, 2023 • edited Loading

teowenshen commented Feb 9, 2023

teowenshen commented Feb 10, 2023

csukuangfj left a comment

Choose a reason for hiding this comment

teowenshen commented Feb 13, 2023

teowenshen commented Feb 13, 2023 • edited Loading

CERs

Trained and evaluated on disfluent transcript

Trained and evaluated on fluent transcript

Comparing disfluent and fluent models

Commands for disfluent training and decoding

Commands for fluent training and decoding

teowenshen commented Feb 13, 2023

csukuangfj commented Feb 13, 2023

teowenshen commented Feb 13, 2023

csukuangfj commented Feb 15, 2023

csukuangfj commented Feb 21, 2023 • edited Loading

teowenshen commented Feb 21, 2023 • edited Loading

csukuangfj commented Feb 21, 2023

csukuangfj commented Feb 22, 2023

teowenshen commented Feb 22, 2023

csukuangfj commented Feb 22, 2023 • edited Loading

csukuangfj commented Feb 22, 2023

teowenshen commented Feb 22, 2023

csukuangfj commented Feb 22, 2023

teowenshen commented Feb 9, 2023 •

edited

Loading

teowenshen commented Feb 9, 2023 •

edited

Loading

pkufool commented Feb 9, 2023 •

edited

Loading

teowenshen commented Feb 13, 2023 •

edited

Loading

csukuangfj commented Feb 21, 2023 •

edited

Loading

teowenshen commented Feb 21, 2023 •

edited

Loading

csukuangfj commented Feb 22, 2023 •

edited

Loading