Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Guys (especially @pkufool),
This is a draft of the core parts of the RNN-T decoding method. It supports streams having different
graphs, and aggregation and disaggregation of streams (to cope with asynchronous input).
As we discussed, I am limiting it to max_sym_per_frame=1, which substantially simplifies the
decoder.
This code is far from being able to compile or run, but all the nontrivial parts are drafted so I
am reasonably confident that there is nothing major missing. It will need the Unstack() function.
Please notice that I have slightly changed (simplified) the extended interface of SubsampleRaggedShape(), versus
#900, to optionally output a new2old array and not a Renumbering object.
The code (interface drafted) in array_of_ragged.h is some general-purpose utility code that can be used
in a bunch of low-level things; it substantially simplifies the interfaces of this drafted code, so I thought
it was worth adding. Much of its functionality is actually not needed for this PR; it would be OK to just
write the needed parts and leave the rest as TODOs.
There would also be some thinking needed, to decide how to write the Python interfaces. I hope
that @csukuangfj might be able to contribute here.
The overall vision is to be able to create RNN-T acoustic models that can be decoded in real-time with very
high concurrency (maybe hundreds of streams). This would probably require a model topology that
is memory-efficient for decoding, e.g. replacing transformer encoder with LSTM encoder (I hope that
some of the work we are separately doing with teacher-student ideas might make it possible to
train the LSTM as a student with a better-generalizing transformer as teacher).
In order to decode without a graph, I propose just creating a "trivial" graph with one state with a self-loop for
each symbol. I don't think this will cause a substantial slow-down because the work done is very tiny compared
with the model forward().
I am hoping that you guys will be able to do most of the work from this point.