Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incremental Training inside Rasa Open Source #7498

Merged
merged 80 commits into from
Dec 15, 2020
Merged
Show file tree
Hide file tree
Changes from 72 commits
Commits
Show all changes
80 commits
Select commit Hold shift + click to select a range
60e0ac1
Add functionality to check if a model is fine-tunable
joejuzl Dec 1, 2020
14093aa
add params for finetuning
wochinge Nov 23, 2020
f16e689
simplify temporary directory creation
wochinge Nov 23, 2020
334c607
fix usage of deprecated `asyncio.coroutine`
wochinge Nov 23, 2020
71ffac4
load potential model for finetuning
wochinge Nov 23, 2020
2b4b6a9
fix types
wochinge Nov 24, 2020
687ff5b
pass in `Agent` / `Interpreter` to finetune
wochinge Nov 24, 2020
b76d8ea
add docstrings
wochinge Nov 24, 2020
2e245a9
use faster model instead of moodbot model
wochinge Nov 25, 2020
5cfe28e
improve performance for getting model to finetune
wochinge Nov 25, 2020
bff637b
load model from directory and polish
wochinge Nov 25, 2020
25b6190
move test module to correct location
wochinge Nov 25, 2020
0ad6739
test edge cases of `get_models_for_finetuning`
wochinge Nov 25, 2020
9d51715
add docstrings
wochinge Nov 25, 2020
3f69d45
undo not necessary changes
wochinge Nov 25, 2020
08fa050
improve phrasing
wochinge Nov 25, 2020
7527342
use absolute import
wochinge Nov 25, 2020
b5f2037
add documentation
wochinge Nov 25, 2020
2dff126
add telemetry
wochinge Nov 25, 2020
3860eba
Also return correctly loaded agent
wochinge Nov 25, 2020
7b33f58
fix typos / phrasing
wochinge Dec 2, 2020
6294644
describe `
wochinge Dec 2, 2020
6415072
use `True` instead of weird string
wochinge Dec 2, 2020
ab250f7
simplify by using helper to mock `async` things
wochinge Dec 2, 2020
32d739c
de-duplicate tests
wochinge Dec 2, 2020
dfebe10
refactor model loading
wochinge Dec 2, 2020
b37bde8
move debug message to correct location
wochinge Dec 2, 2020
70094e9
debug CI
wochinge Dec 2, 2020
b859861
remove unused param
wochinge Dec 2, 2020
597225d
unpack tuples for older python versions
wochinge Dec 2, 2020
0a2f24b
use `AsyncMock` class instead of helper function
wochinge Dec 2, 2020
08d7526
make tests faster
wochinge Dec 3, 2020
1985e6c
comments
joejuzl Dec 3, 2020
f323617
check what is unequal
wochinge Dec 3, 2020
770e625
copy domain to avoid tests interfering with each other
wochinge Dec 3, 2020
b461545
use session scoped fixture
wochinge Dec 3, 2020
0a53fd9
use function scoped domain in nlg
wochinge Dec 3, 2020
d5144a6
Merge pull request #7358 from RasaHQ/incremental-training-cli
wochinge Dec 3, 2020
a3b116d
use dicts in tests
joejuzl Dec 4, 2020
06d41e5
Merge remote-tracking branch 'origin/continuous_training' into 7330/c…
joejuzl Dec 4, 2020
c7d776b
doc
joejuzl Dec 4, 2020
a52e19d
Load NLU model in fine-tune mode with updated config
joejuzl Dec 4, 2020
25c23ca
Test
joejuzl Dec 4, 2020
9debc86
Handle no pipeline of policies in config
joejuzl Dec 7, 2020
6260d4c
PR comments
joejuzl Dec 7, 2020
638008f
PR comments
joejuzl Dec 7, 2020
496b47b
wip
joejuzl Dec 7, 2020
d3c2b67
Use default epochs if not provided
joejuzl Dec 7, 2020
d9f563a
Name change
joejuzl Dec 7, 2020
36fc2b6
Merge pull request #7456 from RasaHQ/7329/load_models_in_finetune_mod…
joejuzl Dec 7, 2020
598dced
Merge branch 'master' into continuous_training
dakshvar22 Dec 7, 2020
b4e677c
Merge pull request #7427 from RasaHQ/7330/check_if_model_is_fine-tunable
joejuzl Dec 8, 2020
d1f4b10
Merge branch 'master' into continuous_training
dakshvar22 Dec 8, 2020
ca76810
Docs for incremental training (#7469)
dakshvar22 Dec 8, 2020
241a075
#7329 load models in finetune mode core (#7458)
joejuzl Dec 9, 2020
3363971
merge
dakshvar22 Dec 9, 2020
02e7341
add changelog
dakshvar22 Dec 9, 2020
fb0b167
add lines of advice to docs
dakshvar22 Dec 9, 2020
f31d268
Merge branch 'master' into continuous_training
dakshvar22 Dec 9, 2020
d2d0a94
Update docs/docs/command-line-interface.mdx
dakshvar22 Dec 11, 2020
64e4342
changelog to reflect issue number
dakshvar22 Dec 11, 2020
6505351
Merge branch 'continuous_training' of github.com:RasaHQ/rasa into con…
dakshvar22 Dec 11, 2020
5d4e466
Integrate the finetune fingerprint checks into the train commands. (#…
joejuzl Dec 11, 2020
cd680e3
Merge remote-tracking branch 'origin/master' into continuous_training
joejuzl Dec 11, 2020
e1a71ef
Get ML components ready for incremental training (#7419)
dakshvar22 Dec 11, 2020
3fe63be
Merge branch 'continuous_training' of github.com:RasaHQ/rasa into con…
joejuzl Dec 11, 2020
d8c64dd
Merge branch 'continuous_training' of github.com:RasaHQ/rasa into con…
joejuzl Dec 11, 2020
df926ec
fix regex test
dakshvar22 Dec 11, 2020
e18640d
Fix min version test
joejuzl Dec 11, 2020
a78c71c
Merge branch 'continuous_training' of github.com:RasaHQ/rasa into con…
joejuzl Dec 11, 2020
4a0a642
fix regex tests
dakshvar22 Dec 11, 2020
903d25a
Add migration guide for policies (#7522)
joejuzl Dec 11, 2020
5a1d75e
review comments
dakshvar22 Dec 13, 2020
5ee6a58
add kwarg
dakshvar22 Dec 14, 2020
977de59
Merge remote-tracking branch 'origin/master' into continuous_training
joejuzl Dec 14, 2020
235b5f9
Mark incremental training experimental (#7543)
dakshvar22 Dec 14, 2020
3ce26e5
Add basic cli test, and stop train test from mocking train methods
joejuzl Dec 14, 2020
45f7171
Merge branch 'continuous_training' of github.com:RasaHQ/rasa into con…
joejuzl Dec 14, 2020
08726d0
increase timeout for fintuning tests
dakshvar22 Dec 14, 2020
bfedf85
Merge branch 'master' into continuous_training
wochinge Dec 14, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 14 additions & 0 deletions changelog/6971.feature.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
Incremental training of models in a pipeline is now supported.

If you have added new NLU training examples or new stories/rules for
dialogue manager, you don't need to train the pipeline from scratch.
Instead, you can initialize the pipeline with a previously trained model
and continue finetuning the model on the complete dataset consisting of
new training examples. To do so, use `rasa train --finetune`. For more
detailed explanation of the command, check out the docs on [incremental
training](./command-line-interface.mdx#incremental-training).

Added a configuration parameter `additional_vocabulary_size` to
[`CountVectorsFeaturizer`](./components.mdx#countvectorsfeaturizer)
dakshvar22 marked this conversation as resolved.
Show resolved Hide resolved
and `number_additional_patterns` to [`RegexFeaturizer`](./components.mdx#regexfeaturizer).
These parameters are useful to configure when using incremental training for your pipelines.
2 changes: 2 additions & 0 deletions changelog/7458.removal.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
Interfaces for `Policy.__init__` and `Policy.load` have changed.
See [migration guide](./migration-guide.mdx#rasa-21-to-rasa-22) for details.
2 changes: 2 additions & 0 deletions data/test_domains/default_with_slots_and_no_actions.yml
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
version: "2.0"

# all hashtags are comments :)
intents:
- greet
Expand Down
45 changes: 45 additions & 0 deletions docs/docs/command-line-interface.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -85,6 +85,51 @@ The following arguments can be used to configure the training process:
```text [rasa train --help]
```

### Incremental training

In order to improve the performance of an assistant, it's helpful to practice [CDD](./conversation-driven-development.mdx)
and add new training examples based on how your users have talked to your assistant. You can use `rasa train --finetune`
to initialize the pipeline with an already trained model and further finetune it on the
new training dataset that includes the additional training examples. This will help reduce the
training time of the new model.

By default, the command picks up the latest model in the `models/` directory. If you have a specific model
which you want to improve, you may specify the path to this by
running `rasa train --finetune <path to model to finetune>`. Finetuning a model usually
requires fewer epochs to train machine learning components like `DIETClassifier`, `ResponseSelector` and `TEDPolicy` compared to training from scratch.
Either use a model configuration for finetuning
which defines fewer epochs than before or use the flag
`--epoch-fraction`. `--epoch-fraction` will use a fraction of the epochs specified for each machine learning component
in the model configuration file. For example, if `DIETClassifier` is configured to use 100 epochs,
specifying `--epoch-fraction 0.5` will only use 50 epochs for finetuning.

You can also finetune an NLU-only or dialogue management-only model by using
`rasa train nlu --finetune` and `rasa train core --finetune` respectively.

To be able to fine tune a model, the following conditions must be met:

1. The configuration supplied should be exactly the same as the
configuration used to train the model which is being finetuned.
The only parameter that you can change is `epochs` for the individual machine learning components and policies.

2. The set of labels(intents, actions, entities and slots) for which the base model is trained
should be exactly the same as the ones present in the training data used for finetuning. This
means that you cannot add new intent, action, entity or slot labels to your training data
during incremental training. You can still add new training examples for each of the existing
labels. If you have added/removed labels in the training data, the pipeline needs to be trained
from scratch.

3. The model to be finetuned is trained with `MINIMUM_COMPATIBLE_VERSION` of the currently installed rasa version.

Checkout the docs for [`CountVectorsFeaturizer`](./components.mdx#countvectorsfeaturizer) and
[`RegexFeaturizer`](./components.mdx#regexfeaturizer) to understand how to configure them appropriately for incremental training.

:::note
Finetuned models are expected to be on-par with performance of models trained from scratch. However,
make sure to train your pipelines from scratch frequently to avoid running out of additional
vocabulary slots for the models.
:::

## rasa interactive

You can [use Rasa X in local mode](https://rasa.com/docs/rasa-x) to do interactive learning in a UI,
Expand Down
147 changes: 101 additions & 46 deletions docs/docs/components.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -836,6 +836,28 @@ Note: The `feature-dimension` for sequence and sentence features does not have t
"use_word_boundaries": True
```

**Configuring for incremental training**

To ensure that `sparse_features` are of fixed size during
[incremental training](./command-line-interface.mdx#incremental-training), the
component should be configured to account for additional patterns that may be
added to the training data in future. To do so, configure the `number_additional_patterns`
parameter while training the base model from scratch:

```yaml-rasa {3}
pipeline:
- name: RegexFeaturizer
number_additional_patterns: 10
```

If not configured by the user, the component will account for a minimum of 10 additional
patterns and a maximum of twice the number of patterns currently
Ghostvv marked this conversation as resolved.
Show resolved Hide resolved
present in the training data (including lookup tables and regex patterns).
Once the component runs out of additional pattern slots
during incremental training, the new patterns are dropped
and not considered during featurization. At this point, it is advisable
to retrain a new model from scratch.


### CountVectorsFeaturizer

Expand Down Expand Up @@ -960,58 +982,91 @@ Note: The `feature-dimension` for sequence and sentence features does not have t
"use_shared_vocab": False
```

**Configuring for incremental training**

To ensure that `sparse_features` are of fixed size during
[incremental training](./command-line-interface.mdx#incremental-training), the
component should be configured to account for additional vocabulary tokens
that may be added as part of new training examples in the future.
To do so, configure the `additional_vocabulary_size` parameter while training the base model from scratch:

```yaml-rasa {3-6}
pipeline:
- name: CountVectorsFeaturizer
additional_vocabulary_size:
text: 1000
response: 1000
action_text: 1000
```

As in the above example, you can define additional vocabulary size for each of
`text` (user messages), `response` (bot responses used by `ResponseSelector`) and
`action_text` (bot responses not used by `ResponseSelector`). If you are building a shared
vocabulary (`use_shared_vocab=True`), you only need to define a value for the `text` attribute.
If any of the attributes is not configured by the user, the component will
account for a minimum of 1000 additional vocabulary slots and a maximum of half the current vocabulary size.
Ghostvv marked this conversation as resolved.
Show resolved Hide resolved
Once the component runs out of additional available vocabulary slots
during incremental training, the new vocabulary tokens are dropped
and not considered during featurization. At this point, it is advisable
to retrain a new model from scratch.


The above configuration parameters are the ones you should configure to fit your model to your data.
However, additional parameters exist that can be adapted.

<details><summary>More configurable parameters</summary>

```
+-------------------+-------------------------+--------------------------------------------------------------+
| Parameter | Default Value | Description |
+===================+=========================+==============================================================+
| use_shared_vocab | False | If set to 'True' a common vocabulary is used for labels |
| | | and user message. |
+-------------------+-------------------------+--------------------------------------------------------------+
| analyzer | word | Whether the features should be made of word n-gram or |
| | | character n-grams. Option 'char_wb' creates character |
| | | n-grams only from text inside word boundaries; |
| | | n-grams at the edges of words are padded with space. |
| | | Valid values: 'word', 'char', 'char_wb'. |
+-------------------+-------------------------+--------------------------------------------------------------+
| strip_accents | None | Remove accents during the pre-processing step. |
| | | Valid values: 'ascii', 'unicode', 'None'. |
+-------------------+-------------------------+--------------------------------------------------------------+
| stop_words | None | A list of stop words to use. |
| | | Valid values: 'english' (uses an internal list of |
| | | English stop words), a list of custom stop words, or |
| | | 'None'. |
+-------------------+-------------------------+--------------------------------------------------------------+
| min_df | 1 | When building the vocabulary ignore terms that have a |
| | | document frequency strictly lower than the given threshold. |
+-------------------+-------------------------+--------------------------------------------------------------+
| max_df | 1 | When building the vocabulary ignore terms that have a |
| | | document frequency strictly higher than the given threshold |
| | | (corpus-specific stop words). |
+-------------------+-------------------------+--------------------------------------------------------------+
| min_ngram | 1 | The lower boundary of the range of n-values for different |
| | | word n-grams or char n-grams to be extracted. |
+-------------------+-------------------------+--------------------------------------------------------------+
| max_ngram | 1 | The upper boundary of the range of n-values for different |
| | | word n-grams or char n-grams to be extracted. |
+-------------------+-------------------------+--------------------------------------------------------------+
| max_features | None | If not 'None', build a vocabulary that only consider the top |
| | | max_features ordered by term frequency across the corpus. |
+-------------------+-------------------------+--------------------------------------------------------------+
| lowercase | True | Convert all characters to lowercase before tokenizing. |
+-------------------+-------------------------+--------------------------------------------------------------+
| OOV_token | None | Keyword for unseen words. |
+-------------------+-------------------------+--------------------------------------------------------------+
| OOV_words | [] | List of words to be treated as 'OOV_token' during training. |
+-------------------+-------------------------+--------------------------------------------------------------+
| alias | CountVectorFeaturizer | Alias name of featurizer. |
+-------------------+-------------------------+--------------------------------------------------------------+
| use_lemma | True | Use the lemma of words for featurization. |
+-------------------+-------------------------+--------------------------------------------------------------+
+---------------------------+-------------------------+--------------------------------------------------------------+
| Parameter | Default Value | Description |
+===========================+=========================+==============================================================+
| use_shared_vocab | False | If set to 'True' a common vocabulary is used for labels |
| | | and user message. |
+---------------------------+-------------------------+--------------------------------------------------------------+
| analyzer | word | Whether the features should be made of word n-gram or |
| | | character n-grams. Option 'char_wb' creates character |
| | | n-grams only from text inside word boundaries; |
| | | n-grams at the edges of words are padded with space. |
| | | Valid values: 'word', 'char', 'char_wb'. |
+---------------------------+-------------------------+--------------------------------------------------------------+
| strip_accents | None | Remove accents during the pre-processing step. |
| | | Valid values: 'ascii', 'unicode', 'None'. |
+---------------------------+-------------------------+--------------------------------------------------------------+
| stop_words | None | A list of stop words to use. |
| | | Valid values: 'english' (uses an internal list of |
| | | English stop words), a list of custom stop words, or |
| | | 'None'. |
+---------------------------+-------------------------+--------------------------------------------------------------+
| min_df | 1 | When building the vocabulary ignore terms that have a |
| | | document frequency strictly lower than the given threshold. |
+---------------------------+-------------------------+--------------------------------------------------------------+
| max_df | 1 | When building the vocabulary ignore terms that have a |
| | | document frequency strictly higher than the given threshold |
| | | (corpus-specific stop words). |
+---------------------------+-------------------------+--------------------------------------------------------------+
| min_ngram | 1 | The lower boundary of the range of n-values for different |
| | | word n-grams or char n-grams to be extracted. |
+---------------------------+-------------------------+--------------------------------------------------------------+
| max_ngram | 1 | The upper boundary of the range of n-values for different |
| | | word n-grams or char n-grams to be extracted. |
+---------------------------+-------------------------+--------------------------------------------------------------+
| max_features | None | If not 'None', build a vocabulary that only consider the top |
| | | max_features ordered by term frequency across the corpus. |
+---------------------------+-------------------------+--------------------------------------------------------------+
| lowercase | True | Convert all characters to lowercase before tokenizing. |
+---------------------------+-------------------------+--------------------------------------------------------------+
| OOV_token | None | Keyword for unseen words. |
+---------------------------+-------------------------+--------------------------------------------------------------+
| OOV_words | [] | List of words to be treated as 'OOV_token' during training. |
+---------------------------+-------------------------+--------------------------------------------------------------+
| alias | CountVectorFeaturizer | Alias name of featurizer. |
+---------------------------+-------------------------+--------------------------------------------------------------+
| use_lemma | True | Use the lemma of words for featurization. |
+---------------------------+-------------------------+--------------------------------------------------------------+
| additional_vocabulary_size| text: 1000 | Size of additional vocabulary to account for incremental |
| | response: 1000 | training while training a model from scratch |
| | action_text: 1000 | |
+---------------------------+-------------------------+--------------------------------------------------------------+
```

</details>
Expand Down
9 changes: 9 additions & 0 deletions docs/docs/migration-guide.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,15 @@ description: |
This page contains information about changes between major versions and
how you can migrate from one version to another.

## Rasa 2.1 to Rasa 2.2

### Policies

[Policies](./policies.mdx) now require a `**kwargs` argument in their constructor and `load` method.
Policies without `**kwargs` will be supported until Rasa version `3.0.0`.
However when using [incremental training](./command-line-interface.mdx#incremental-training)
`**kwargs` **must** be included.

## Rasa 2.0 to Rasa 2.1

### Deprecations
Expand Down
4 changes: 4 additions & 0 deletions docs/docs/telemetry/events.json
Original file line number Diff line number Diff line change
Expand Up @@ -88,6 +88,10 @@
"num_regexes": {
"type": "integer",
"description": "Total number of regexes defined."
},
"is_finetuning": {
"type": "boolean",
"description": "True if a model is trained by finetuning an existing model."
}
},
"additionalProperties": false,
Expand Down
Loading