-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dynamic sparse embedding layer size for flexible incremental training #8232
Comments
I think it cannot be done in |
Totally correct. I meant doing it in |
I'm working on this issue. |
@ka-bu assigned as reviewer. |
@tttthomasssss This is a large issue broken down into 3 smaller issues (and PRs) that you can see linked above. I am already in the process of reviewing them (and almost close to getting the PRs merged), so I think we can skip adding another reviewer here. |
@jupyterjazz Once this issue is complete, we should verify that the bug in #8496 is not persisting still. |
Description of Problem:
Currently in order to account for new vocabulary items during incremental training, we create a buffer for extra vocabulary inside
CountVectorsFeaturizer
because of which users need to specify values foradditional_vocabulary_size
option. This can be cumbersome to account for right at the first training run of your assistant. Moreover, once the buffer for extra vocabulary is exhausted, the user needs to re-train from scratch which can be time consuming. Is there a more efficient approach for incremental training?Overview of the Solution:
One alternative is to account for new vocabulary directly in the architecture of DIET/TED. Both the architectures have a sparse embedding layer in the beginning which transform the one hot vector of incoming tokens into embedding vectors. The input size of the sparse embedding layer can be computed on the fly during the graph building stage according to the length of incoming sparse features of inputs. In every new incremental training run, if the vocabulary increased, then there will be new set of weights initialized in the sparse embedding layer to account for the new vocabulary items.
So, for example if the length of the incoming sparse features vector of input is 50 in the first run, then the input size of the sparse embedding layer will be 50. In the next fine-tuning run assuming the length increased by 20, the pre-trained weights corresponding to the existing first 50 vocabulary items will be loaded and set appropriately and an extra set of weights will be initialized corresponding to the new 20 dimensions in the sparse feature vector.
Things to note:
CountVectorsFeaturizer
but is a detail for the user which they don't have to worry about.train
with the existing sparse embedding layer after the previous weights are already loaded inload
.Open Questions:
Experiments should be run to validate any changes to performance of fine-tuning with the proposed approach v/s the existing approach. Theoretically there shouldn't be a lot of difference in performance.
The text was updated successfully, but these errors were encountered: