-
Notifications
You must be signed in to change notification settings - Fork 303
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Randomly combining intermediate layers in RNN-T training #229
Conversation
…g/icefall into attention_relu_specaug
BTW, this will of course break compatibility with older models, so it may be necessary to introduce an option for it. |
I can test it with the multiple datasets setup, which does converge on the 100h subset. |
BTW, for things like this and the diagnostics, I'd really like to have them also applied to the pruned recipe. |
* Copy files for editing. * Add random combine from #229. * Minor fixes. * Pass model parameters from the command line. * Fix warnings. * Fix warnings. * Update readme. * Rename to avoid conflicts. * Update results. * Add CI for pruned_transducer_stateless5 * Typo fixes. * Remove random combiner. * Update decode.py and train.py to use periodically averaged models. * Minor fixes. * Revert to use random combiner. * Update results. * Minor fixes.
This PR demonstrates how to do something like "iterated loss" using intermediate layers, but with only one
loss-function evaluation. It does this by randomly interpolating different combinations of different layers, with
linear "adapter layers" for all but the last layer, and with the interpolation weights different per frame.
There is a significant WER improvement from this: on test-clean-100 on this setup i have 7.58/20.36 with greedy search, whereas with similar setups I was more usually getting 8.xx / 22.xx or so.
(This is hard to have an exact baseline for because the baseline didn't converge on 100 hours).
I am hoping someone could test this with a current setup somehow.