Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CSJ pruned_transducer_stateless7_streaming #892

Merged
merged 27 commits into from
Feb 13, 2023

Conversation

teowenshen
Copy link
Contributor

@teowenshen teowenshen commented Feb 9, 2023

CERs

[NOTE: These results are without padding during training. Later experiments showed that padding both during training and decoding decreased insertions at the end of utterances - at least for CSJ. See below for results of said later experiments.]

The CERs are:

Trained and evaluated on disfluent transcript

decoding method chunk size eval1 eval2 eval3 excluded valid average decoding mode
fast beam search 320ms 6.27 5.13 5.05 6.30 5.42 --epoch 30 --avg 14 simulated streaming
fast beam search 320ms 5.91 4.30 4.53 6.13 5.13 --epoch 30 --avg 14 chunk-wise
greedy search 320ms 5.81 4.29 4.63 6.07 5.13 --epoch 30 --avg 14 simulated streaming
greedy search 320ms 5.91 4.50 4.65 6.36 5.34 --epoch 30 --avg 14 chunk-wise
modified beam search 320ms 5.59 4.20 4.39 5.54 4.90 --epoch 30 --avg 14 simulated streaming
modified beam search 320ms 5.79 4.48 4.41 5.98 5.19 --epoch 30 --avg 14 chunk-wise
fast beam search 640ms 5.76 4.35 4.39 5.40 4.92 --epoch 30 --avg 14 simulated streaming
fast beam search 640ms 5.45 4.31 4.29 5.61 4.97 --epoch 30 --avg 14 chunk-wise
greedy search 640ms 5.37 3.94 4.03 5.22 4.77 --epoch 30 --avg 14 simulated streaming
greedy search 640ms 5.77 4.44 4.49 5.70 5.29 --epoch 30 --avg 14 chunk-wise
modified beam search 640ms 5.19 3.81 3.93 4.83 4.59 --epoch 30 --avg 14 simulated streaming
modified beam search 640ms 6.71 5.35 4.95 6.06 5.94 --epoch 30 --avg 14 chunk-wise

Trained and evaluated on fluent transcript

decoding method chunk size eval1 eval2 eval3 excluded valid average decoding mode
fast beam search pad30 320ms 4.72 3.74 4.21 5.21 4.39 --epoch 30 --avg 19 simulated streaming
fast beam search 320ms 4.63 3.63 4.18 5.3 4.31 --epoch 30 --avg 19 chunk-wise
greedy search 320ms 4.83 3.71 4.27 4.89 4.38 --epoch 30 --avg 19 simulated streaming
greedy search 320ms 4.7 3.87 4.24 5.39 4.39 --epoch 30 --avg 19 chunk-wise
modified beam search 320ms 4.61 3.55 4.07 4.89 4.18 --epoch 30 --avg 19 simulated streaming
modified beam search 320ms 4.53 3.73 3.98 5.9 4.25 --epoch 30 --avg 19 chunk-wise
fast beam search pad30 640ms 4.33 3.55 4.03 4.97 4.33 --epoch 30 --avg 19 simulated streaming
fast beam search 640ms 4.21 3.64 3.93 5.04 4.18 --epoch 30 --avg 19 chunk-wise
greedy search 640ms 4.3 3.51 3.91 4.45 4.04 --epoch 30 --avg 19 simulated streaming
greedy search 640ms 4.4 3.83 4.03 5.14 4.31 --epoch 30 --avg 19 chunk-wise
modified beam search 640ms 4.11 3.29 3.66 4.33 3.88 --epoch 30 --avg 19 simulated streaming
modified beam search 640ms 4.42 3.91 3.93 5.62 4.33 --epoch 30 --avg 19 chunk-wise

Comparing disfluent and fluent models

$$\texttt{CER}^{f}_d = \frac{\texttt{sub}_f + \texttt{ins} + \texttt{del}_f}{N_f}$$

This comparison evaluates the disfluent model on the fluent transcript (calculated by disfluent_recogs_to_fluent.py), forgiving the disfluent model's mistakes on fillers and partial words. It is meant as an illustrative metric only, so that the disfluent and fluent models can be compared.

decoding method chunk size eval1 (d vs f) eval2 (d vs f) eval3 (d vs f) excluded (d vs f) valid (d vs f) decoding mode
fast beam search 320ms 5.44 vs 4.72 4.49 vs 3.74 4.44 vs 4.21 5.14 vs 5.21 4.64 vs 4.39 simulated streaming
fast beam search 320ms 5.05 vs 4.63 3.63 vs 3.63 3.91 vs 4.18 4.75 vs 5.30 4.29 vs 4.31 chunk-wise
greedy search 320ms 4.97 vs 4.83 3.63 vs 3.71 4.02 vs 4.27 4.93 vs 4.89 4.32 vs 4.38 simulated streaming
greedy search 320ms 5.02 vs 4.70 3.78 vs 3.87 4.02 vs 4.24 5.11 vs 5.39 4.47 vs 4.39 chunk-wise
modified beam search 320ms 4.86 vs 4.61 3.62 vs 3.55 3.85 vs 4.07 4.66 vs 4.89 4.21 vs 4.18 simulated streaming
modified beam search 320ms 5.05 vs 4.53 3.89 vs 3.73 3.88 vs 3.98 4.88 vs 5.90 4.48 vs 4.25 chunk-wise
fast beam search 640ms 4.93 vs 4.33 3.74 vs 3.55 3.78 vs 4.03 4.31 vs 4.97 4.15 vs 4.33 simulated streaming
fast beam search 640ms 4.61 vs 4.21 3.67 vs 3.64 3.66 vs 3.93 4.34 vs 5.04 4.15 vs 4.18 chunk-wise
greedy search 640ms 4.48 vs 4.30 3.29 vs 3.51 3.43 vs 3.91 4.11 vs 4.45 3.96 vs 4.04 simulated streaming
greedy search 640ms 4.89 vs 4.40 3.77 vs 3.83 3.87 vs 4.03 4.41 vs 5.14 4.47 vs 4.31 chunk-wise
modified beam search 640ms 4.45 vs 4.11 3.28 vs 3.29 3.41 vs 3.66 3.97 vs 4.33 3.90 vs 3.88 simulated streaming
modified beam search 640ms 6.10 vs 4.42 4.86 vs 3.91 4.51 vs 3.93 5.16 vs 5.62 5.34 vs 4.33 chunk-wise
average of (d - f) 0.50 0.14 -0.13 -0.45 0.11

Commands for disfluent training and decoding

The training command was:

./pruned_transducer_stateless7_streaming/train.py \
  --feedforward-dims  "1024,1024,2048,2048,1024" \
  --context-size 2 \
  --world-size 8 \
  --num-epochs 30 \
  --start-epoch 1 \
  --use-fp16 1 \
  --exp-dir pruned_transducer_stateless7_streaming/exp_disfluent \
  --max-duration 375 \
  --transcript-mode disfluent \
  --lang data/lang_char \
  --musan-dir /mnt/host/corpus/musan/musan/fbank

Padding with 30 caused many insertions at the end of utterances. The simulated streaming decoding command was:

for chunk in 64 32; do
    for m in greedy_search fast_beam_search modified_beam_search; do
        python pruned_transducer_stateless7_streaming/decode.py \
            --feedforward-dims  "1024,1024,2048,2048,1024" \
            --exp-dir pruned_transducer_stateless7_streaming/exp_disfluent_2 \
            --epoch 30 \
            --avg 14 \
            --max-duration 250 \
            --decoding-method $m \
            --manifest-dir /mnt/host/corpus/csj/fbank \
            --lang data/lang_char \
            --transcript-mode disfluent \
            --res-dir pruned_transducer_stateless7_streaming/exp_disfluent_2/github/sim_"$chunk"_"$m" \
            --decode-chunk-len $chunk \
            --pad 4
        done
    done
done

The streaming chunk-wise decoding command was:

for chunk in 64 32; do
    for m in greedy_search fast_beam_search modified_beam_search; do
        python pruned_transducer_stateless7_streaming/streaming_decode.py \
            --feedforward-dims  "1024,1024,2048,2048,1024" \
            --exp-dir pruned_transducer_stateless7_streaming/exp_disfluent_2 \
            --epoch 30 \
            --avg 14 \
            --max-duration 250 \
            --decoding-method $m \
            --manifest-dir /mnt/host/corpus/csj/fbank \
            --lang data/lang_char \
            --transcript-mode disfluent \
            --res-dir pruned_transducer_stateless7_streaming/exp_disfluent_2/github/stream_"$chunk"_"$m" \
            --decode-chunk-len $chunk \
            --num-decode-streams 40
    done
done

Commands for fluent training and decoding

The training command was:

./pruned_transducer_stateless7_streaming/train.py \
  --feedforward-dims  "1024,1024,2048,2048,1024" \
  --context-size 2 \
  --world-size 8 \
  --num-epochs 30 \
  --start-epoch 1 \
  --use-fp16 1 \
  --exp-dir pruned_transducer_stateless7_streaming/exp_fluent_2 \
  --max-duration 375 \
  --transcript-mode fluent \
  --telegram-cred misc.ini \
  --lang data/lang_char \
  --manifest-dir $csj_fbank_dir \
  --musan-dir /mnt/host/corpus/musan/musan/fbank

The simulated streaming decoding command was:

for chunk in 64 32; do
    for m in greedy_search modified_beam_search; do
        python pruned_transducer_stateless7_streaming/decode.py \
            --feedforward-dims  "1024,1024,2048,2048,1024" \
            --exp-dir pruned_transducer_stateless7_streaming/exp_fluent_2 \
            --epoch 30 \
            --avg 19 \
            --max-duration 350 \
            --decoding-method $m \
            --manifest-dir /mnt/host/corpus/csj/fbank \
            --lang data/lang_char \
            --transcript-mode fluent \
            --res-dir pruned_transducer_stateless7_streaming/exp_fluent_2/github/sim_"$chunk"_"$m" \
            --decode-chunk-len $chunk \
            --pad 4
    done
    # Padding of 4 caused many deletions only in the fast_beam_search case. 
    python pruned_transducer_stateless7_streaming/decode.py \
        --feedforward-dims  "1024,1024,2048,2048,1024" \
        --exp-dir pruned_transducer_stateless7_streaming/exp_fluent_2 \
        --epoch 30 \
        --avg 19 \
        --max-duration 350 \
        --decoding-method fast_beam_search \
        --manifest-dir /mnt/host/corpus/csj/fbank \
        --lang data/lang_char \
        --transcript-mode fluent \
        --res-dir pruned_transducer_stateless7_streaming/exp_fluent_2/github/sim_"$chunk"_fast_beam_search \
        --decode-chunk-len $chunk \
        --pad 30
done

The streaming chunk-wise decoding command was:

for chunk in 64 32; do
    for m in greedy_search fast_beam_search modified_beam_search; do
        python pruned_transducer_stateless7_streaming/streaming_decode.py \
            --feedforward-dims  "1024,1024,2048,2048,1024" \
            --exp-dir pruned_transducer_stateless7_streaming/exp_fluent_2 \
            --epoch 30 \
            --avg 19 \
            --max-duration 250 \
            --decoding-method $m \
            --manifest-dir /mnt/host/corpus/csj/fbank \
            --lang data/lang_char \
            --transcript-mode fluent \
            --res-dir pruned_transducer_stateless7_streaming/exp_fluent_2/github/stream_"$chunk"_"$m" \
            --decode-chunk-len $chunk \
            --gpu 4 \
            --num-decode-streams 40
    done
done

@csukuangfj
Copy link
Collaborator

@teowenshen

Thanks! Is it ready for review?

@teowenshen
Copy link
Contributor Author

teowenshen commented Feb 9, 2023

I am writing up RESULTS.md and README.md now, but will publish them once I manage to upload my results and model to HuggingFace. Also, I still couldn't wrap my head around the difference between streaming_decode.py and decode.py.

Other than that, this PR is ready.

EDIT: Sorry, I just realised I haven't adapted export.py. I will commit it soon.

@pkufool
Copy link
Collaborator

pkufool commented Feb 9, 2023

Also, I still couldn't wrap my head around the difference between streaming_decode.py and decode.py.

About the difference between streaming_decode.py and decode.py please see the discussion here and the documents here

@teowenshen
Copy link
Contributor Author

Apologies! I meant, I couldn't find out why the results of streaming_decode.py and decode.py have such a CER gap, since given the same chunk length and left context they should be almost the same. I have a more detailed finding here: #807, and have since then retrained another model to compare.

@teowenshen
Copy link
Contributor Author

I have retrained an early model with padding (30) so that when decoding with padding 30 the insertions at end of utterances are less. This improved the simulated streaming decode.py results, but still the chunk-wise streaming_decode.py results are bad.

How can I pad the streaming_decode.py with 30 frames only, like decode.py?

Copy link
Collaborator

@csukuangfj csukuangfj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Left some minor comments.

egs/csj/ASR/RESULTS.md Show resolved Hide resolved
egs/csj/ASR/local/disfluent_recogs_to_fluent.py Outdated Show resolved Hide resolved
egs/csj/ASR/local/prepare_lang_char.py Outdated Show resolved Hide resolved
egs/csj/ASR/local/utils/tokenizer.py Outdated Show resolved Hide resolved
egs/csj/ASR/local/utils/tokenizer.py Show resolved Hide resolved
@teowenshen
Copy link
Contributor Author

@csukuangfj I hope it's not too late, but actually I have much better results after padding both in training and decoding. I am updating the CER tables, the Huggingface repo, and the script commands.

Code-wise, the only addition is a "--pad-feature" argument in train.py, so the existing bulk of codes that you've reviewed are mostly the same. So sorry for the hassle! 🙏

@teowenshen
Copy link
Contributor Author

teowenshen commented Feb 13, 2023

CERs

These CERs are trained with padding=30. They are introduced with --pad-feature:

Trained and evaluated on disfluent transcript

decoding method chunk size eval1 eval2 eval3 excluded valid average decoding mode
fast beam search 320ms 5.39 4.08 4.16 5.4 5.02 --epoch 30 --avg 17 simulated streaming
fast beam search 320ms 5.34 4.1 4.26 5.61 4.91 --epoch 30 --avg 17 chunk-wise
greedy search 320ms 5.43 4.14 4.31 5.48 4.88 --epoch 30 --avg 17 simulated streaming
greedy search 320ms 5.44 4.14 4.39 5.7 4.98 --epoch 30 --avg 17 chunk-wise
modified beam search 320ms 5.2 3.95 4.09 5.12 4.75 --epoch 30 --avg 17 simulated streaming
modified beam search 320ms 5.18 4.07 4.12 5.36 4.77 --epoch 30 --avg 17 chunk-wise
fast beam search 640ms 5.01 3.78 3.96 4.85 4.6 --epoch 30 --avg 17 simulated streaming
fast beam search 640ms 4.97 3.88 3.96 4.91 4.61 --epoch 30 --avg 17 chunk-wise
greedy search 640ms 5.02 3.84 4.14 5.02 4.59 --epoch 30 --avg 17 simulated streaming
greedy search 640ms 5.32 4.22 4.33 5.39 4.99 --epoch 30 --avg 17 chunk-wise
modified beam search 640ms 4.78 3.66 3.85 4.72 4.42 --epoch 30 --avg 17 simulated streaming
modified beam search 640ms 5.77 4.72 4.73 5.85 5.36 --epoch 30 --avg 17 chunk-wise

Trained and evaluated on fluent transcript

decoding method chunk size eval1 eval2 eval3 excluded valid average decoding mode
fast beam search 320ms 4.19 3.63 3.77 4.43 4.09 --epoch 30 --avg 12 simulated streaming
fast beam search 320ms 4.06 3.55 3.66 4.70 4.04 --epoch 30 --avg 12 chunk-wise
greedy search 320ms 4.22 3.62 3.82 4.45 3.98 --epoch 30 --avg 12 simulated streaming
greedy search 320ms 4.13 3.61 3.85 4.67 4.05 --epoch 30 --avg 12 chunk-wise
modified beam search 320ms 4.02 3.43 3.62 4.43 3.81 --epoch 30 --avg 12 simulated streaming
modified beam search 320ms 3.97 3.43 3.59 4.99 3.88 --epoch 30 --avg 12 chunk-wise
fast beam search 640ms 3.80 3.31 3.55 4.16 3.90 --epoch 30 --avg 12 simulated streaming
fast beam search 640ms 3.81 3.34 3.46 4.58 3.85 --epoch 30 --avg 12 chunk-wise
greedy search 640ms 3.92 3.38 3.65 4.31 3.88 --epoch 30 --avg 12 simulated streaming
greedy search 640ms 3.98 3.38 3.64 4.54 4.01 --epoch 30 --avg 12 chunk-wise
modified beam search 640ms 3.72 3.26 3.39 4.10 3.65 --epoch 30 --avg 12 simulated streaming
modified beam search 640ms 3.78 3.32 3.45 4.81 3.81 --epoch 30 --avg 12 chunk-wise

Comparing disfluent and fluent models

$$\texttt{CER}^{f}_d = \frac{\texttt{sub}_f + \texttt{ins} + \texttt{del}_f}{N_f}$$

This comparison evaluates the disfluent model on the fluent transcript (calculated by disfluent_recogs_to_fluent.py), forgiving the disfluent model's mistakes on fillers and partial words. It is meant as an illustrative metric only, so that the disfluent and fluent models can be compared.

decoding method chunk size eval1 (d vs f) eval2 (d vs f) eval3 (d vs f) excluded (d vs f) valid (d vs f) decoding mode
fast beam search 320ms 4.54 vs 4.19 3.44 vs 3.63 3.56 vs 3.77 4.22 vs 4.43 4.22 vs 4.09 simulated streaming
fast beam search 320ms 4.48 vs 4.06 3.41 vs 3.55 3.65 vs 3.66 4.26 vs 4.7 4.08 vs 4.04 chunk-wise
greedy search 320ms 4.53 vs 4.22 3.48 vs 3.62 3.69 vs 3.82 4.38 vs 4.45 4.05 vs 3.98 simulated streaming
greedy search 320ms 4.53 vs 4.13 3.46 vs 3.61 3.71 vs 3.85 4.48 vs 4.67 4.12 vs 4.05 chunk-wise
modified beam search 320ms 4.45 vs 4.02 3.38 vs 3.43 3.57 vs 3.62 4.19 vs 4.43 4.04 vs 3.81 simulated streaming
modified beam search 320ms 4.44 vs 3.97 3.47 vs 3.43 3.56 vs 3.59 4.28 vs 4.99 4.04 vs 3.88 chunk-wise
fast beam search 640ms 4.14 vs 3.8 3.12 vs 3.31 3.38 vs 3.55 3.72 vs 4.16 3.81 vs 3.9 simulated streaming
fast beam search 640ms 4.05 vs 3.81 3.23 vs 3.34 3.36 vs 3.46 3.65 vs 4.58 3.78 vs 3.85 chunk-wise
greedy search 640ms 4.1 vs 3.92 3.17 vs 3.38 3.5 vs 3.65 3.87 vs 4.31 3.77 vs 3.88 simulated streaming
greedy search 640ms 4.41 vs 3.98 3.56 vs 3.38 3.69 vs 3.64 4.26 vs 4.54 4.16 vs 4.01 chunk-wise
modified beam search 640ms 4 vs 3.72 3.08 vs 3.26 3.33 vs 3.39 3.75 vs 4.1 3.71 vs 3.65 simulated streaming
modified beam search 640ms 5.05 vs 3.78 4.22 vs 3.32 4.26 vs 3.45 5.02 vs 4.81 4.73 vs 3.81 chunk-wise
average (d - f) 0.43 -0.02 -0.02 -0.34 0.13

Commands for disfluent training and decoding

The training command was:

./pruned_transducer_stateless7_streaming/train.py \
  --feedforward-dims  "1024,1024,2048,2048,1024" \
  --world-size 8 \
  --num-epochs 30 \
  --start-epoch 1 \
  --use-fp16 1 \
  --exp-dir pruned_transducer_stateless7_streaming/exp_disfluent_2_pad30 \
  --max-duration 375 \
  --transcript-mode disfluent \
  --lang data/lang_char \
  --manifest-dir /mnt/host/corpus/csj/fbank \
  --pad-feature 30 \
  --musan-dir /mnt/host/corpus/musan/musan/fbank

The simulated streaming decoding command was:

for chunk in 64 32; do
    for m in greedy_search fast_beam_search modified_beam_search; do
        python pruned_transducer_stateless7_streaming/decode.py \
            --feedforward-dims  "1024,1024,2048,2048,1024" \
            --exp-dir pruned_transducer_stateless7_streaming/exp_disfluent_2_pad30 \
            --epoch 30 \
            --avg 17 \
            --max-duration 350 \
            --decoding-method $m \
            --manifest-dir /mnt/host/corpus/csj/fbank \
            --lang data/lang_char \
            --transcript-mode disfluent \
            --res-dir pruned_transducer_stateless7_streaming/exp_disfluent_2_pad30/github/sim_"$chunk"_"$m" \
            --decode-chunk-len $chunk \
            --pad-feature 30 \
            --gpu 0
    done
done

The streaming chunk-wise decoding command was:

for chunk in 64 32; do
    for m in greedy_search fast_beam_search modified_beam_search; do
        python pruned_transducer_stateless7_streaming/streaming_decode.py \
            --feedforward-dims  "1024,1024,2048,2048,1024" \
            --exp-dir pruned_transducer_stateless7_streaming/exp_disfluent_2_pad30 \
            --epoch 30 \
            --avg 17 \
            --max-duration 350 \
            --decoding-method $m \
            --manifest-dir /mnt/host/corpus/csj/fbank \
            --lang data/lang_char \
            --transcript-mode disfluent \
            --res-dir pruned_transducer_stateless7_streaming/exp_disfluent_2_pad30/github/stream_"$chunk"_"$m" \
            --decode-chunk-len $chunk \
            --gpu 2 \
            --num-decode-streams 40
    done
done

Commands for fluent training and decoding

The training command was:

./pruned_transducer_stateless7_streaming/train.py \
  --feedforward-dims  "1024,1024,2048,2048,1024" \
  --world-size 8 \
  --num-epochs 30 \
  --start-epoch 1 \
  --use-fp16 1 \
  --exp-dir pruned_transducer_stateless7_streaming/exp_fluent_2_pad30 \
  --max-duration 375 \
  --transcript-mode fluent \
  --lang data/lang_char \
  --manifest-dir /mnt/host/corpus/csj/fbank \
  --pad-feature 30 \
  --musan-dir /mnt/host/corpus/musan/musan/fbank

The simulated streaming decoding command was:

for chunk in 64 32; do
    for m in greedy_search fast_beam_search modified_beam_search; do
        python pruned_transducer_stateless7_streaming/decode.py \
            --feedforward-dims  "1024,1024,2048,2048,1024" \
            --exp-dir pruned_transducer_stateless7_streaming/exp_fluent_2_pad30 \
            --epoch 30 \
            --avg 12 \
            --max-duration 350 \
            --decoding-method $m \
            --manifest-dir /mnt/host/corpus/csj/fbank \
            --lang data/lang_char \
            --transcript-mode fluent \
            --res-dir pruned_transducer_stateless7_streaming/exp_fluent_2_pad30/github/sim_"$chunk"_"$m" \
            --decode-chunk-len $chunk \
            --pad-feature 30 \
            --gpu 1
    done
done

The streaming chunk-wise decoding command was:

for chunk in 64 32; do
    for m in greedy_search fast_beam_search modified_beam_search; do
        python pruned_transducer_stateless7_streaming/streaming_decode.py \
            --feedforward-dims  "1024,1024,2048,2048,1024" \
            --exp-dir pruned_transducer_stateless7_streaming/exp_fluent_2_pad30 \
            --epoch 30 \
            --avg 12 \
            --max-duration 350 \
            --decoding-method $m \
            --manifest-dir /mnt/host/corpus/csj/fbank \
            --lang data/lang_char \
            --transcript-mode fluent \
            --res-dir pruned_transducer_stateless7_streaming/exp_fluent_2_pad30/github/stream_"$chunk"_"$m" \
            --decode-chunk-len $chunk \
            --gpu 3 \
            --num-decode-streams 40
    done
done

@teowenshen
Copy link
Contributor Author

@csukuangfj Just to update that I have addressed your comments, and that this branch and the HuggingFace repo have been updated with the new results.

@csukuangfj
Copy link
Collaborator

@teowenshen Thank you very much!

Is it ready for merge?

@teowenshen
Copy link
Contributor Author

@csukuangfj Yes. It is ready for merge from my end. Thanks!

@csukuangfj csukuangfj merged commit e63a8c2 into k2-fsa:master Feb 13, 2023
@csukuangfj
Copy link
Collaborator

@teowenshen

Here is a demo on iPhone using the model trained from this pull-request.

It uses https://github.com/k2-fsa/sherpa-ncnn for deployment.

2023-02-15-sherpa-ncnn-streaming-zipformer-japanese-iPhone.mp4

@csukuangfj
Copy link
Collaborator

csukuangfj commented Feb 21, 2023

@teowenshen

Could you explain the differences between fluent and disfluent?

@teowenshen
Copy link
Contributor Author

teowenshen commented Feb 21, 2023

Yes, sure. disfluent is the transcript mode that most other CSJ ASR recipes are based on. fluent is where disfluency elements, like fillers and partial words, are removed.

In the CSJ transcript, words are explicitly tagged to express a variety of information. You can refer table 3 of this paper to get an idea. In this recipe, the fluent model is trained on transcripts where all F-, D-, and D2-tags are removed, and all other tags are dealt with the same way as disfluent.

(Kaldi is trained on the disfluent transcript. ESPnet uses the disfluent transcript too, if I'm not mistaken. Basically, for academic comparison purposes, we can use the disfluent transcript mode. However, because CSJ's original utterances are much shorter than 10s, this recipe's transcript is not exactly the same as Kaldi due to differences in the utterance concatenation algorithm.)

However, a problem I noticed early on while training the disfluent model is that many errors happen around fillers and partial words, and in fact those errors aren't quite important in the actual text.

For example,

Type Disfluent Fluent Remarks
Filler え (*->ー) こ の 図 は ... こ の 図 は ... Error in the spelling of the filler by the disfluent model.
Filler え ー (*->っ) と ー Whole utterance consists of a filler. The disfluent model wrongly spelled the filler, while the fluent model correctly ignored the filler.
Partial word: Stuttering (ね こ し ゅ ぼ ん->と な っ て し ま う) え ー (*->っ) と こ こ で ... こ こ で ... The disfluent model wrongly spelled the gibberish partial word and the filler, while the fluent model correctly ignored both the partial word and the filler.
Partial word: Speech correction (に ほ ん->日 本) 難 易 度 ... (*->日 本) 難 易 度 Because the partial word itself is a full word, the disfluent model spelled it with Chinese characters although the transcript used phonetic hiragana. The fluent model unfortunately picked up on the partial word too.

So, I uploaded the fluent model too because, in my opinion, the fluent model is more immediately useful. Plus, the training codes are the same anyway.

@csukuangfj
Copy link
Collaborator

@teowenshen

Thanks for your detailed explanation!

@csukuangfj
Copy link
Collaborator

@teowenshen

Is the model at https://huggingface.co/TeoWenShen/icefall-asr-csj-pruned-transducer-stateless7-streaming-230208
suitable for commercial use?

If yes, could you add a README.md to it containing something like below:

---
license: apache-2.0
---

You can find an example at
https://huggingface.co/csukuangfj/sherpa-ncnn-conv-emformer-transducer-2022-12-06/raw/main/README.md

@teowenshen
Copy link
Contributor Author

I'm sorry, I've just clarified - The models from this pull request are not available for commercial use.

I've taken down the models online. Can you help to delete the models from the demo page at https://huggingface.co/spaces/k2-fsa/automatic-speech-recognition too, since it is no longer working?

Since the original Huggingface link to the model has been deleted, I will send in a pull request to edit the README file.

Deepest apologies for the inconvenience! I will try to attempt a recipe for another openly available Japanese corpus if time permits.

@csukuangfj
Copy link
Collaborator

csukuangfj commented Feb 22, 2023

@csukuangfj
Copy link
Collaborator

@teowenshen
Also, can I make the exported models private and use them only to demonstrate k2-fsa?

@teowenshen
Copy link
Contributor Author

By the way, I have converted this model to sherpa-ncnn and sherpa-onnx, shall I also take them down?

https://huggingface.co/csukuangfj/sherpa-ncnn-streaming-zipformer-ja-fluent-2023-02-14
https://huggingface.co/csukuangfj/sherpa-ncnn-streaming-zipformer-ja-disfluent-2023-02-14
https://huggingface.co/csukuangfj/sherpa-ncnn-streaming-zipformer-ja-fluent-2023-02-14

Yes, please help to take them down as well. Thanks!

Also, can I make the exported models private and use them only to demonstrate k2-fsa?

I am really sorry as it is not my place to permit any use of derivatives from CSJ.

I just want to say that I am truly on board with the open source spirit of k2-fsa / NGK, and personally have nothing against the commercial use of models. Once again, really sorry for the inconvenience!

@csukuangfj
Copy link
Collaborator

@teowenshen
Thanks, I see. I am taking them down.

csukuangfj added a commit to csukuangfj/sherpa that referenced this pull request Feb 22, 2023
csukuangfj added a commit to k2-fsa/sherpa that referenced this pull request Feb 22, 2023
csukuangfj added a commit to csukuangfj/sherpa-ncnn that referenced this pull request Feb 22, 2023
csukuangfj added a commit to k2-fsa/sherpa-ncnn that referenced this pull request Feb 22, 2023
LeeTZhi pushed a commit to LeeTZhi/sherpa-ncnn that referenced this pull request Oct 27, 2023
@teowenshen teowenshen deleted the csj_pts7stream branch December 12, 2023 11:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants