Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Add timestamps for streaming ASR #119

Closed
wants to merge 3 commits into from

Conversation

csukuangfj
Copy link
Collaborator

@csukuangfj csukuangfj commented Sep 19, 2022

Use the model from k2-fsa/icefall#558 for testing.

git lfs install
git clone https://huggingface.co/csukuangfj/icefall-asr-librispeech-lstm-transducer-stateless2-2022-09-03

Start the server

export CUDA_VISIBLE_DEVICES=0

nn_encoder_filename=./icefall-asr-librispeech-lstm-transducer-stateless2-2022-09-03/exp/encoder_jit_trace-iter-468000-avg-16.pt
nn_decoder_filename=./icefall-asr-librispeech-lstm-transducer-stateless2-2022-09-03/exp/decoder_jit_trace-iter-468000-avg-16.pt
nn_joiner_filename=./icefall-asr-librispeech-lstm-transducer-stateless2-2022-09-03/exp/joiner_jit_trace-iter-468000-avg-16.pt

bpe_model_filename=./icefall-asr-librispeech-lstm-transducer-stateless2-2022-09-03/data/lang_bpe_500/bpe.model

./sherpa/bin/lstm_transducer_stateless/streaming_server.py \
  --endpoint.rule1.must-contain-nonsilence=false \
  --endpoint.rule1.min-trailing-silence=5.0 \
  --endpoint.rule2.min-trailing-silence=2.0 \
  --endpoint.rule3.min-utterance-length=50.0 \
  --port 6006 \
  --decoding-method greedy_search \
  --max-batch-size 50 \
  --max-wait-ms 5 \
  --nn-pool-size 1 \
  --max-active-connections 10 \
  --nn-encoder-filename $nn_encoder_filename \
  --nn-decoder-filename $nn_decoder_filename \
  --nn-joiner-filename $nn_joiner_filename \
  --bpe-model-filename $bpe_model_filename

Start the client

wave=./test_wavs/1089-134686-0001.wav
wave=./test_wavs/1221-135766-0002.wav

./sherpa/bin/pruned_stateless_emformer_rnnt2/streaming_client.py  \
  --server-port 6006 \
  $wave

Output from the client:

2022-09-19 13:08:23,757 INFO [streaming_client.py:93] Final result of segment 0: YET THESE THOUGHTS AFFECTED HESTER PRYNNE LESS WITH HOPE THAN APPREHENSION
2022-09-19 13:08:23,758 INFO [streaming_client.py:142] ./test_wavs/1221-135766-0002.wav
segment: 0
text: YET THESE THOUGHTS AFFECTED HESTER PRYNNE LESS WITH HOPE THAN APPREHENSION
timestamps: [0.56, 0.76, 1.04, 1.12, 1.36, 1.52, 1.76, 1.84, 1.8800000000000001, 
1.92, 2.0, 2.12, 2.24, 2.36, 2.56, 2.6, 2.72, 2.7600000000000002, 2.96, 3.0, 3.24, 
3.4, 3.44, 3.88, 4.2, 4.28, 4.32, 4.4, 4.48, 4.5600000000000005, 4.6000000000000005]
(token, time): [('_YE', 0.56), ('T', 0.76), ('_THE', 1.04), ('SE', 1.12), ('_THOUGHT', 1.36), ('S', 1.52),
 ('_A', 1.76), ('FF', 1.84), ('E', 1.8800000000000001), ('C', 1.92), ('TED', 2.0), ('_HE', 2.12), ('S', 2.24), 
('TER', 2.36), ('_P', 2.56), ('RY', 2.6), ('N', 2.72), ('NE', 2.7600000000000002), ('_', 2.96), ('LESS', 3.0), 
('_WITH', 3.24), ('_HO', 3.4), ('PE', 3.44), ('_THAN', 3.88), ('_A', 4.2), ('PP', 4.28), ('RE', 4.32), ('HE', 4.4), 
('N', 4.48), ('S', 4.5600000000000005), ('ION', 4.6000000000000005)]

@csukuangfj
Copy link
Collaborator Author

Comparing the alignment with https://github.com/CorentinJ/librispeech-alignments

word CorentinJ/librispeech-alignments greedy_search delay
AFTER 0.36 0.56 0.56 - 0.36 = 0.20
EARLY 0.73 1.16 0.43
NIGHTFALL 1.04 1.60 0.56
THE 1.77 2.16 0.39
YELLOW 1.90 2.32 0.42
LAMPS 2.16 2.68 0.52
WOULD 2.59 3.12 0.53
LIGHT 2.76 3.28 0.52
UP 3.07 3.52 0.45
HERE 3.27 3.76 0.49
AND 3.52 3.96 0.44
THERE 3.66 4.24 0.58
THE 4.09 4.56 0.47
SQUALID 4.21 4.76 0.55
QUARTER 4.78 5.28 0.50
OF 5.31 5.72 0.41
THE 5.42 5.84 0.42
BROTHELS 5.50 6.00 0.50
silence 6.16-6.625 N/A N/A

@csukuangfj
Copy link
Collaborator Author

A second comparison using a different utterance:

word CorentinJ/librispeech-alignments greedy_search delay
YET 0.42 0.56 0.56 - 0.42 = 0.14
THESE 0.65 1.04 0.39
THOUGHTS 0.93 1.36 0.43
AFFECTED 1.26 1.76 0.50
HESTER 1.66 2.12 0.46
PRYNNE 2.02 2.56 0.54
LESS 2.46 2.96 0.50
WITH 2.83 3.24 0.41
HOPE 3.03 3.40 0.37
silence 3.48 N/A N/A
THAN 3.55 3.88 0.33
APPREHENSION 3.76 4.20 0.44
silence 4.56 N/A N/A

@csukuangfj
Copy link
Collaborator Author

Different from #52, the encoder model in this PR uses LSTM instead of Conformer.

Also, the first token is no longer emitted on the first frame.

@danpovey
Copy link
Collaborator

Cool!!
It might be nice at some point to have a way of computing average delays, as would be experienced by the user.
[e.g.. between the times printed in our alignment, and the time it was output.] That way, if we compute the delay from the reference alignment to our alignment, we can add the delay due to the latency of the algorithm to find the total delay.

@ezerhouni
Copy link
Collaborator

@csukuangfj What is missing in this PR ?

@csukuangfj
Copy link
Collaborator Author

I think I only made changes to lstm_transducer_stateless.

Other folders for streaming models have not been updated yet.

@ezerhouni
Copy link
Collaborator

Ok ! Let me try to take care of it today

@csukuangfj
Copy link
Collaborator Author

You can use the changes from this PR. I am closing it now.

Thanks again!

@csukuangfj csukuangfj closed this Sep 30, 2022
@ezerhouni
Copy link
Collaborator

@csukuangfj You mean I create my own branch with your changes right ?

@csukuangfj
Copy link
Collaborator Author

@csukuangfj You mean I create my own branch with your changes right ?

Yes, you can use any approach that you think work the best.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants