Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft of RNN-T decoding method #905

Closed
wants to merge 6 commits into from
Closed

Conversation

danpovey
Copy link
Collaborator

Guys (especially @pkufool),

This is a draft of the core parts of the RNN-T decoding method. It supports streams having different
graphs, and aggregation and disaggregation of streams (to cope with asynchronous input).
As we discussed, I am limiting it to max_sym_per_frame=1, which substantially simplifies the
decoder.

This code is far from being able to compile or run, but all the nontrivial parts are drafted so I
am reasonably confident that there is nothing major missing. It will need the Unstack() function.
Please notice that I have slightly changed (simplified) the extended interface of SubsampleRaggedShape(), versus
#900, to optionally output a new2old array and not a Renumbering object.

The code (interface drafted) in array_of_ragged.h is some general-purpose utility code that can be used
in a bunch of low-level things; it substantially simplifies the interfaces of this drafted code, so I thought
it was worth adding. Much of its functionality is actually not needed for this PR; it would be OK to just
write the needed parts and leave the rest as TODOs.

There would also be some thinking needed, to decide how to write the Python interfaces. I hope
that @csukuangfj might be able to contribute here.

The overall vision is to be able to create RNN-T acoustic models that can be decoded in real-time with very
high concurrency (maybe hundreds of streams). This would probably require a model topology that
is memory-efficient for decoding, e.g. replacing transformer encoder with LSTM encoder (I hope that
some of the work we are separately doing with teacher-student ideas might make it possible to
train the LSTM as a student with a better-generalizing transformer as teacher).

In order to decode without a graph, I propose just creating a "trivial" graph with one state with a self-loop for
each symbol. I don't think this will cause a substantial slow-down because the work done is very tiny compared
with the model forward().

I am hoping that you guys will be able to do most of the work from this point.

@pkufool
Copy link
Collaborator

pkufool commented Jan 22, 2022

@danpovey Did you miss some commits, I don't see any difference from #900 .

@danpovey
Copy link
Collaborator Author

Fixed.

@pkufool
Copy link
Collaborator

pkufool commented Mar 16, 2022

closed via #926

@pkufool pkufool closed this Mar 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants