Dynamic sparse embedding layer size for flexible incremental training #8232

dakshvar22 · 2021-03-18T09:19:55Z

Description of Problem:
Currently in order to account for new vocabulary items during incremental training, we create a buffer for extra vocabulary inside CountVectorsFeaturizer because of which users need to specify values for additional_vocabulary_size option. This can be cumbersome to account for right at the first training run of your assistant. Moreover, once the buffer for extra vocabulary is exhausted, the user needs to re-train from scratch which can be time consuming. Is there a more efficient approach for incremental training?

Overview of the Solution:
One alternative is to account for new vocabulary directly in the architecture of DIET/TED. Both the architectures have a sparse embedding layer in the beginning which transform the one hot vector of incoming tokens into embedding vectors. The input size of the sparse embedding layer can be computed on the fly during the graph building stage according to the length of incoming sparse features of inputs. In every new incremental training run, if the vocabulary increased, then there will be new set of weights initialized in the sparse embedding layer to account for the new vocabulary items.

So, for example if the length of the incoming sparse features vector of input is 50 in the first run, then the input size of the sparse embedding layer will be 50. In the next fine-tuning run assuming the length increased by 20, the pre-trained weights corresponding to the existing first 50 vocabulary items will be loaded and set appropriately and an extra set of weights will be initialized corresponding to the new 20 dimensions in the sparse feature vector.

Things to note:

The vocabulary index for a token should not change across fine-tuning runs. This still has to be handled by CountVectorsFeaturizer but is a detail for the user which they don't have to worry about.
Since the size of sparse features will keep changing across fine-tuning runs, the persisted model data that is used during load will also have to account for that. Subsequently, new set of weights will have to be merged inside train with the existing sparse embedding layer after the previous weights are already loaded in load.

Open Questions:

What should be the initialization scheme for the new weights being added during the fine-tuning run? It could be either random normal distribution with usual set of means and std, or the mean and std can be computed from the existing set of weights in that layer.

Experiments should be run to validate any changes to performance of fine-tuning with the proposed approach v/s the existing approach. Theoretically there shouldn't be a lot of difference in performance.

The text was updated successfully, but these errors were encountered:

Ghostvv · 2021-03-18T13:25:08Z

new set of weights will have to be merged with the existing sparse embedding layer after the previous weights are already loaded.

I think it cannot be done in load since you don't see new training data there, it should be handled in train

dakshvar22 · 2021-03-18T13:27:43Z

Totally correct. I meant doing it in train once the weights are loaded in load. Updating the description above 👍

jupyterjazz · 2021-05-18T13:54:57Z

I'm working on this issue.
I have written a simple prototype that you can see here with some documents about my ideas and an implementation proposal.
At this point, I'm trying to make a draft version work inside Rasa OSS.

tttthomasssss · 2021-06-08T08:17:23Z

@ka-bu assigned as reviewer.

dakshvar22 · 2021-06-08T08:59:58Z

@tttthomasssss This is a large issue broken down into 3 smaller issues (and PRs) that you can see linked above. I am already in the process of reviewing them (and almost close to getting the PRs merged), so I think we can skip adding another reviewer here.

dakshvar22 · 2021-06-16T12:13:14Z

@jupyterjazz Once this issue is complete, we should verify that the bug in #8496 is not persisting still.

alopez added the research:incremental-training label Apr 28, 2021

dakshvar22 assigned jupyterjazz May 18, 2021

This was referenced May 26, 2021

remove adding extra buffer in regex and count vectors featurizers #8749

Closed

Get sparse feature sizes #8750

Closed

Adjusting DenseForSparse layer size #8751

Closed

tttthomasssss assigned ka-bu Jun 8, 2021

dakshvar22 unassigned ka-bu Jun 16, 2021

dakshvar22 mentioned this issue Jun 16, 2021

Fix for #8496 - shape mismatch error when finetuning #8895

Closed

tttthomasssss mentioned this issue Jun 16, 2021

Cannot train finetuned model: ValueError: Shapes (2575, 128) and (2570, 128) are incompatible #8496

Closed

3 tasks

jupyterjazz mentioned this issue Jun 30, 2021

Dynamic sparse feature allocation for flexible incremental training #8985

Merged

4 tasks

alopez closed this as completed Aug 19, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dynamic sparse embedding layer size for flexible incremental training #8232

Dynamic sparse embedding layer size for flexible incremental training #8232

dakshvar22 commented Mar 18, 2021 •

edited

Loading

Ghostvv commented Mar 18, 2021

dakshvar22 commented Mar 18, 2021

jupyterjazz commented May 18, 2021

tttthomasssss commented Jun 8, 2021

dakshvar22 commented Jun 8, 2021

dakshvar22 commented Jun 16, 2021

Dynamic sparse embedding layer size for flexible incremental training #8232

Dynamic sparse embedding layer size for flexible incremental training #8232

Comments

dakshvar22 commented Mar 18, 2021 • edited Loading

Ghostvv commented Mar 18, 2021

dakshvar22 commented Mar 18, 2021

jupyterjazz commented May 18, 2021

tttthomasssss commented Jun 8, 2021

dakshvar22 commented Jun 8, 2021

dakshvar22 commented Jun 16, 2021

dakshvar22 commented Mar 18, 2021 •

edited

Loading