Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incremental training #6971

Closed
tabergma opened this issue Oct 8, 2020 · 20 comments · Fixed by #7498
Closed

Incremental training #6971

tabergma opened this issue Oct 8, 2020 · 20 comments · Fixed by #7498
Assignees
Labels
area:rasa-oss 🎡 Anything related to the open source Rasa framework type:enhancement ✨ Additions of new features or changes to existing ones, should be doable in a single PR

Comments

@tabergma
Copy link
Contributor

tabergma commented Oct 8, 2020

Description of Problem:
Once a model is trained it cannot be updated. It is not possible to continue training the model on new data that came in. Instead the model needs to be retrained from scratch, which takes up a lot of time.

Overview of the Solution:
It should be possible to load a model from a previous checkpoint and continue training with new data added.

@tabergma tabergma added type:enhancement ✨ Additions of new features or changes to existing ones, should be doable in a single PR area:rasa-oss 🎡 Anything related to the open source Rasa framework labels Oct 8, 2020
@evgeniiaraz
Copy link
Contributor

@tabergma how urgent is this one?

@tabergma
Copy link
Contributor Author

@evgeniiaraz We want to tackle this issue this quarter. @dakshvar22 is leading this topic. Why do you ask?

@evgeniiaraz
Copy link
Contributor

evgeniiaraz commented Oct 22, 2020

@tabergma I wanted to work on it not to lose shape :) but if it is urgent, I'll pick something non-essential

@dakshvar22
Copy link
Contributor

dakshvar22 commented Nov 7, 2020

Based on the discussion in the document, here are more fine-grained implementation tasks that are needed -

Changes to CLI and rasa/train.py

  1. Add a parameter to rasa train called finetune_previous_model which starts training in finetuning mode.
  2. Add a parameter to rasa train called finetune_model_path which lets you specify the path to a previous model which should be finetuned.
  3. rasa.train_async_internal should be refactored to check if training should proceed in finetuning mode(if finetune_previous_model is set to True). If yes, then it should check if it is possible to do so.(Check the doc for constraints for finetuning to be possible).
  4. rasa.nlu.train should be refactored to create the Trainer object in fine-tune mode which means each component should be loaded with the model to be finetuned from. This will involve building the pipeline in a way similar to how the pipeline is built during inference, i.e. when rasa shell is run or rasa test is run.
  5. rasa.core.train should also be refactored similarly as above.
  6. Add telemetry event.

Changes to ML components

CountVectorsFeaturizer(CVF)

  1. Add a parameter max_additional_vocabulary_size which lets users specify the additional buffer size that CVF should keep to accommodate new vocabulary tokens during fine-tuning.
  2. _train_with_independent_vocab should be refactored to construct the vocabulary with the additional buffer specified above. Things to keep in mind here -
  • When a new training cycle is triggered, the ordering of existing vocabulary tokens should not be changed and the new vocabulary tokens should only occupy the empty slots in the vocabulary.
  • If the vocabulary size of CVF is exhausted, we should continue training, but warn the user that the vocabulary is exhausted and treat the new tokens which overflow as OOV tokens. At this point, the user should also be informed about the total vocabulary size of their dataset and should be prompted to retrain with full vocabulary.

DIETClassifier, ResponseSelector and TEDPolicy

  1. load() should be refactored to load the models with weights in training mode and not in prediction mode. Currently, _load_model() builds the TF graph in predict mode which should be changed if the classifier is being loaded for finetuning. So instead of calling _get_tf_call_model_function(), _get_tf_train_functions() should be reused to build the graph for training.
  2. Make sure the signature of RasaModelData in finetune mode is the same as what is constructed during training from scratch.

@dakshvar22
Copy link
Contributor

dakshvar22 commented Nov 9, 2020

A working version(very draft) of the above steps is implemented on this branch. From early observations, what needs to be improved/additionally done to make this mergeable as a feature -

  1. Ability to specify a model path to fine-tune from in the CLI.
  2. Implement checks here to see if the previous model is compatible to be fine-tuned with the current configuration specified, for e.g., all parameters for the two configurations should be the same except number of epochs for training.
  3. The above working version loads up the pipeline in fine-tune mode only for the NLU pipeline. Still needs to be done for the Core pipeline. The refactoring needed inside TEDPolicy is straightforward and identical to what is done for DIETClassifier. Loading up the instance of Agent class with the old model in fine-tune mode is what needs to be implemented.
  4. While loading the NLU pipeline, currently the config of the loaded model is passed to the components, which means if I change the number of epochs in my new configuration, it is not used by the component. Will need to refactor that.
  5. Make sure fine-tuning is possible for rasa train nlu and rasa train core as well. Currently it works for rasa train.

Of course docs, code quality and tests also need to be added.

@wochinge
Copy link
Contributor

Next steps based on the call with @dakshvar22 @joejuzl

  1. Create engineering issues from this (should be around 2-3 issues 🤔 )
  2. Get started with the engineering issues in the week of November 23rd

Other things to keep in mind:

  • Can we branch off master or does it make more sense to branch off the e2e branch?

@wochinge wochinge self-assigned this Nov 13, 2020
@dakshvar22
Copy link
Contributor

dakshvar22 commented Nov 13, 2020

I ran some initial experiments using the working version on this branch -

Setup

Data: Financial Bot NLU data split into 80:20 train test split. The train split is further divided into 2 sets - split 80:20. The first set is used for training an initial model from scratch. The second set is used for finetuning the first model trained. Consider the second set as new annotations that a user added to their training data.

Size of Set 1: 233
Size of Set 2: 59
Size of held-out test set: 73

Training: We train the first model from scratch for 100 epochs. Then add the second set to the training data and further train the first model for 30 more epochs.

Config

Note: Finetuning is done by mixing the new data with the old data and then training on batches from the combined data.

Results:

Initial Model Training data Number of epochs Intent F1(held out test set) Entity F1(held out test set) Time for training
Randomly initialized Set 1 100 0.753 0.9 48s
Model trained on set 1 Set 1 + Set 2 30 0.861 0.927 16s
Randomly initialized Set 1 + Set 2 130 0.876 0.911 1 min 16s

@dakshvar22
Copy link
Contributor

dakshvar22 commented Nov 15, 2020

Experiments on Sara data -

Size of Set 1: 3166
Size of Set 2: 792
Size of held-out test set: 990

Config

Note: additional_vocabulary_size was set to 1000 for char based CVF and 100 for word based CVF.

Results:

Initial Model Training data Number of epochs Intent F1 Entity F1 Response F1 Time for training
Randomly initialized Set 1 40 0.789 0.832 0.927 4m 10s
Model trained on set 1 Set 1 + Set 2 10 0.823 0.861 0.935 1m 39s
Randomly initialized Set 1 + Set 2 50 0.818 0.854 0.938 6m 2s

@wochinge
Copy link
Contributor

@dakshvar22 Do I understand it correctly that the incremental training is in total faster than training everything at once? This seems somewhat counterintuitive for me as I'd expect overhead from loading training data / pipelines etc.

@dakshvar22
Copy link
Contributor

@wochinge The time mentioned above are the time to train DIETClassifier alone and does not include the pipeline and training data loading time. We should measure that too, but it would be much smaller in comparison to the amount of time required to train DIETClassifier for an additional 40/70 epochs as shown in the examples above.

@wochinge
Copy link
Contributor

Thanks for clarifying! Even if we measure the DIETClassier on its own - shouldn't the total time of the incremental timing be greater than when training everything in one go?

@dakshvar22
Copy link
Contributor

@wochinge The small overhead(11s) that you see when trained in one go is because of the increase in input feature vector size and hence bigger matrix multiplications. The first two experiments on Sara data have an input feature vector of size 11752(actual vocabulary size + buffer added). The third experiment has an input feature vector of size 12752(actual vocabulary size + buffer added). The additional 1000 dimensions are present because the model is trained from scratch and hence new buffer space is added in CountVectorsFeaturizer. I did run an additional experiment to validate this with additional_vocabulary_size set to 0 in CountVectorsFeaturizer and the training times were then comparable with a small stochastic overhead(+-2 secs) either side. Does that help clarify?

@wochinge
Copy link
Contributor

Thanks a lot for digging into and clarifying this! 🙌

@wochinge
Copy link
Contributor

I had a short look on the e2e branch and at least for the engineering changes we don't need to branch off e2e. However, DietClassifier has huge changes @dakshvar22 so you want to probably branch off from e2e for your changes, what do you think?

@dakshvar22
Copy link
Contributor

@wochinge The only change that we need for incremental training inside DIETClassifier is a change in the load method which isn't touched on e2e. So, we should be fine branching off master. Would like to decouple it from e2e as much as possible.

@dakshvar22
Copy link
Contributor

@wochinge @joejuzl Created a shared branch named continuous_training for us to merge our respective PRs into.

@wochinge
Copy link
Contributor

wochinge commented Dec 2, 2020

@dakshvar22 cc @joejuzl Can we finetune a core model when NLU was finetuned previously? Or do we have to train Core from scratch as the featurization of messages will change?

@dakshvar22
Copy link
Contributor

Not sure if I understand the case completely. Do you mean that rasa train nlu finetune was run and then rasa train core finetune was run?

@wochinge
Copy link
Contributor

wochinge commented Dec 2, 2020

  1. We run rasa train --finetune
  2. NLU model is finetuned
  3. Do we now finetune the core model or do we train it from scratch?

@dakshvar22
Copy link
Contributor

Ohh, we can finetune the core model as long as we are inside our current constraints, i.e. no change to labels(intents, actions, slots, entities, etc.). Why do you think we would need to train it from scratch?

@joejuzl joejuzl added this to the 2.2 Rasa Open Source milestone Dec 4, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:rasa-oss 🎡 Anything related to the open source Rasa framework type:enhancement ✨ Additions of new features or changes to existing ones, should be doable in a single PR
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants