-
Notifications
You must be signed in to change notification settings - Fork 110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add streaming ASR with Emformer RNN-T #6
Conversation
Wow, Cool! Once you merge this PR, I will try to add streaming conformer model. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
decoder_out = model.ForwardDecoder(decoder_input.to(device)).squeeze(1); | ||
} | ||
} | ||
return decoder_out; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to return decoder_out
here, the only reason I can think of is to avoid an extra ForwardDecoder
, are there any other reasons?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
decoder_out
is to save an extra op decoder.forward()
.
For streaming decoding, the input chunk size is fixed and there are no paddings. We can figure out the encoder_out_len from encoder_out.
We can add it for fast_beam_search later if it turns out it is necesseary.
@@ -53,7 +53,7 @@ class RnntModel { | |||
* @param features A 3-D tensor of shape (N, T, C). | |||
* @param features_length A 1-D tensor of shape (N,) containing the number of | |||
* valid frames in `features`. | |||
* @return Return a tuple containing two tensors: | |||
* @return Return a pair containing two tensors: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we will need feature_lens
in fast_beam_search
. But surely we might add it when needed.
Here is a demo for this PR: https://www.youtube.com/watch?v=z7HgaZv5W0U |
I have tested it and it works. Will upload a pretrained Emformer model later.
Note that the framework is quite general and it is easy to adapt to other kinds of stateless RNN-T models, not limited to Emformer RNN-T models.