Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove library specific tokenizers #108 #7027

Merged
merged 41 commits into from
Nov 10, 2020
Merged
Show file tree
Hide file tree
Changes from 28 commits
Commits
Show all changes
41 commits
Select commit Hold shift + click to select a range
d3869f7
Removed LanguageModelTokenizer. Logic moved into HFTransformersNLP, L…
Oct 14, 2020
e4cc85b
Updated Components doc to reflect deprecation of LanguageModelTokenizer
Oct 14, 2020
955312d
Updated Components doc to reflect decoupling of LMFeaturizer and LMTo…
Oct 14, 2020
2466515
Reformatted lm_tokenizer and hf_transformers
Oct 15, 2020
a7fcffb
Removed ConveRTTokenizer and moved its logic into ConveRTFeaturizer. …
Oct 15, 2020
adf3c31
Changed ConveRT model url back to official (broken) poly ai url
Oct 15, 2020
df92bec
Updated documentation to reflect deprecation of ConveRTTokenizer
Oct 15, 2020
ddea518
Moved all featurizer and tokenizer logic from HFTransformersNLP to La…
Oct 25, 2020
1705813
Adjusted warning, removed incorrect reference to MIGRATION_DOCS
Oct 25, 2020
2ec801a
Updated docs to reflect deprecation of HFTransformersNLP
Oct 25, 2020
61a3a63
Merge branch 'master' into iss-res-108
koernerfelicia Oct 26, 2020
ef85054
Adjusted the docstrings a little
Oct 26, 2020
0eca21a
Merge branch 'iss-res-108' of github.com:RasaHQ/rasa into iss-res-108
Oct 26, 2020
5abfe12
Adjusted to reflect code review comments
Oct 28, 2020
19d1c14
Fixed some deepsource errors
Oct 28, 2020
5557cfb
Update docstring for rasa/nlu/featurizers/dense_featurizer/convert_fe…
koernerfelicia Oct 29, 2020
d6468cb
Update warning about use of deprecated HFTransformersNLP in LMFeaturizer
koernerfelicia Oct 29, 2020
717c7cf
Update docstring in LMFeaturizer
koernerfelicia Oct 29, 2020
a725f0c
merge master, move url validation for ConveRT from tokenizer to Featu…
dakshvar22 Oct 30, 2020
080a823
Merge branch 'master' into iss-res-108
dakshvar22 Oct 30, 2020
b9ec75e
add changes to migration guide
dakshvar22 Oct 30, 2020
d166473
Merge branch 'iss-res-108' of github.com:RasaHQ/rasa into iss-res-108
dakshvar22 Oct 30, 2020
04590af
make linter happy
dakshvar22 Nov 2, 2020
f75308a
fix pytests
dakshvar22 Nov 2, 2020
5202a18
Add check for HFTransformersNLP in pipeline, prevent model loading in…
Nov 4, 2020
239d196
Add language check to create method in LanguageModelFeaturizer
Nov 5, 2020
58f0add
Put info for deprecated components back into docs, with deprecation w…
Nov 5, 2020
2688fbe
Merge branch 'master' into iss-res-108
Nov 6, 2020
917efbf
Apply suggestions from code review
koernerfelicia Nov 6, 2020
8fcd715
Apply suggestions from code review
Nov 6, 2020
066091b
Fix typo
Nov 6, 2020
e6bc2a8
Merge branch 'master' into iss-res-108
koernerfelicia Nov 6, 2020
429bd2d
Merge branch 'master' into iss-res-108
koernerfelicia Nov 9, 2020
1af7b9e
Fix pytests
Nov 9, 2020
4d9e3d9
Merge branch 'iss-res-108' of github.com:RasaHQ/rasa into iss-res-108
Nov 9, 2020
37c92d2
Use create instead of constructor for tests
Nov 9, 2020
a82adfb
Fix pytest
Nov 10, 2020
5443f89
Some deepsource checks and hopefully actually fix pytests
Nov 10, 2020
b02ed91
Reformat tests again
Nov 10, 2020
b548deb
Overloaded load method so that we do not call the constructor for LMF…
Nov 10, 2020
05847b9
Merge branch 'master' into iss-res-108
koernerfelicia Nov 10, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions changelog/7027.improvement.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
Remove dependency between `ConveRTTokenizer` and `ConveRTFeaturizer`. The `ConveRTTokenizer` is now deprecated, and the
`ConveRTFeaturizer` can be used with any other `Tokenizer`.

Remove dependency between `HFTransformersNLP`, `LanguageModelTokenizer`, and `LanguageModelFeaturizer`. Both
`HFTransformersNLP` and `LanguageModelTokenizer` are now deprecated. `LanguageModelFeaturizer` implements the behavior
of the stack and can be used with any other `Tokenizer`.
168 changes: 107 additions & 61 deletions docs/docs/components.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -139,82 +139,86 @@ word vectors in your pipeline.

### HFTransformersNLP
tabergma marked this conversation as resolved.
Show resolved Hide resolved
koernerfelicia marked this conversation as resolved.
Show resolved Hide resolved

:::caution Deprecated
The `HFTransformersNLP` is deprecated and will be removed in a future release. The [LanguageModelFeaturizer](./components.mdx#languagemodelfeaturizer)
now implements its behavior.
:::

* **Short**
* **Short**
koernerfelicia marked this conversation as resolved.
Show resolved Hide resolved

HuggingFace's Transformers based pre-trained language model initializer
HuggingFace's Transformers based pre-trained language model initializer



* **Outputs**
* **Outputs**

Nothing
Nothing



* **Requires**
* **Requires**

Nothing
Nothing



* **Description**
* **Description**

Initializes specified pre-trained language model from HuggingFace's [Transformers library](https://huggingface.co/transformers/). The component applies language model specific tokenization and
featurization to compute sequence and sentence level representations for each example in the training data.
Include [LanguageModelTokenizer](./components.mdx#languagemodeltokenizer) and [LanguageModelFeaturizer](./components.mdx#languagemodelfeaturizer) to utilize the output of this
component for downstream NLU models.
Initializes specified pre-trained language model from HuggingFace's [Transformers library](https://huggingface.co/transformers/). The component applies language model specific tokenization and
featurization to compute sequence and sentence level representations for each example in the training data.
Include [LanguageModelTokenizer](./components.mdx#languagemodeltokenizer) and [LanguageModelFeaturizer](./components.mdx#languagemodelfeaturizer) to utilize the output of this
component for downstream NLU models.

:::note
To use `HFTransformersNLP` component, install Rasa Open Source with `pip3 install rasa[transformers]`.
:::note
To use `HFTransformersNLP` component, install Rasa Open Source with `pip3 install rasa[transformers]`.

:::
:::



* **Configuration**
* **Configuration**

You should specify what language model to load via the parameter `model_name`. See the below table for the
available language models.
Additionally, you can also specify the architecture variation of the chosen language model by specifying the
parameter `model_weights`.
The full list of supported architectures can be found in the
[HuggingFace documentation](https://huggingface.co/transformers/pretrained_models.html).
If left empty, it uses the default model architecture that original Transformers library loads (see table below).
You should specify what language model to load via the parameter `model_name`. See the below table for the
available language models.
Additionally, you can also specify the architecture variation of the chosen language model by specifying the
parameter `model_weights`.
The full list of supported architectures can be found in the
[HuggingFace documentation](https://huggingface.co/transformers/pretrained_models.html).
If left empty, it uses the default model architecture that original Transformers library loads (see table below).

```
+----------------+--------------+-------------------------+
| Language Model | Parameter | Default value for |
| | "model_name" | "model_weights" |
+----------------+--------------+-------------------------+
| BERT | bert | rasa/LaBSE |
+----------------+--------------+-------------------------+
| GPT | gpt | openai-gpt |
+----------------+--------------+-------------------------+
| GPT-2 | gpt2 | gpt2 |
+----------------+--------------+-------------------------+
| XLNet | xlnet | xlnet-base-cased |
+----------------+--------------+-------------------------+
| DistilBERT | distilbert | distilbert-base-uncased |
+----------------+--------------+-------------------------+
| RoBERTa | roberta | roberta-base |
+----------------+--------------+-------------------------+
```
```
+----------------+--------------+-------------------------+
| Language Model | Parameter | Default value for |
| | "model_name" | "model_weights" |
+----------------+--------------+-------------------------+
| BERT | bert | rasa/LaBSE |
+----------------+--------------+-------------------------+
| GPT | gpt | openai-gpt |
+----------------+--------------+-------------------------+
| GPT-2 | gpt2 | gpt2 |
+----------------+--------------+-------------------------+
| XLNet | xlnet | xlnet-base-cased |
+----------------+--------------+-------------------------+
| DistilBERT | distilbert | distilbert-base-uncased |
+----------------+--------------+-------------------------+
| RoBERTa | roberta | roberta-base |
+----------------+--------------+-------------------------+
```

The following configuration loads the language model BERT:
The following configuration loads the language model BERT:

```yaml-rasa
pipeline:
- name: HFTransformersNLP
# Name of the language model to use
model_name: "bert"
# Pre-Trained weights to be loaded
model_weights: "rasa/LaBSE"
```yaml-rasa
pipeline:
- name: HFTransformersNLP
# Name of the language model to use
model_name: "bert"
# Pre-Trained weights to be loaded
model_weights: "rasa/LaBSE"

# An optional path to a specific directory to download and cache the pre-trained model weights.
# The `default` cache_dir is the same as https://huggingface.co/transformers/serialization.html#cache-directory .
cache_dir: null
```
# An optional path to a specific directory to download and cache the pre-trained model weights.
# The `default` cache_dir is the same as https://huggingface.co/transformers/serialization.html#cache-directory .
cache_dir: null
```


## Tokenizers
Expand Down Expand Up @@ -406,6 +410,10 @@ word vectors in your pipeline.

### ConveRTTokenizer

:::caution Deprecated
The `ConveRTTokenizer` is deprecated and will be removed in a future release. The [ConveRTFeaturizer](./components.mdx#convertfeaturizer)
now implements its behavior. Any [tokenizer](./components.mdx#tokenizers) can be used in its place.
:::

* **Short**

Expand Down Expand Up @@ -466,6 +474,10 @@ word vectors in your pipeline.

### LanguageModelTokenizer

:::caution Deprecated
The `LanguageModelTokenizer` is deprecated and will be removed in a future release. The [LanguageModelFeaturizer](./components.mdx#languagemodelfeaturizer)
now implements its behavior.
koernerfelicia marked this conversation as resolved.
Show resolved Hide resolved
:::

* **Short**

Expand Down Expand Up @@ -644,7 +656,7 @@ Note: The `feature-dimension` for sequence and sentence features does not have t

* **Requires**

[ConveRTTokenizer](./components.mdx#converttokenizer)
`tokens`



Expand All @@ -667,7 +679,7 @@ Note: The `feature-dimension` for sequence and sentence features does not have t
:::

:::note
To use `ConveRTTokenizer`, install Rasa Open Source with `pip3 install rasa[convert]`.
To use `ConveRTFeaturizer`, install Rasa Open Source with `pip3 install rasa[convert]`.

:::

Expand Down Expand Up @@ -698,7 +710,7 @@ Note: The `feature-dimension` for sequence and sentence features does not have t

* **Requires**

[HFTransformersNLP](./components.mdx#hftransformersnlp) and [LanguageModelTokenizer](./components.mdx#languagemodeltokenizer)
`tokens`.



Expand All @@ -711,8 +723,7 @@ Note: The `feature-dimension` for sequence and sentence features does not have t
* **Description**

Creates features for entity extraction, intent classification, and response selection.
Uses the pre-trained language model specified in upstream [HFTransformersNLP](./components.mdx#hftransformersnlp) component to compute vector
representations of input text.
Uses the pre-trained language model to compute vector representations of input text.
koernerfelicia marked this conversation as resolved.
Show resolved Hide resolved

:::note
Please make sure that you use a language model which is pre-trained on the same language corpus as that of your
Expand All @@ -724,14 +735,49 @@ Note: The `feature-dimension` for sequence and sentence features does not have t

* **Configuration**

Include [HFTransformersNLP](./components.mdx#hftransformersnlp) and [LanguageModelTokenizer](./components.mdx#languagemodeltokenizer) components before this component. Use
[LanguageModelTokenizer](./components.mdx#languagemodeltokenizer) to ensure tokens are correctly set for all components throughout the pipeline.
Include a [Tokenizer](./components.mdx#tokenizers) component before this component.

You should specify what language model to load via the parameter `model_name`. See the below table for the
available language models.
Additionally, you can also specify the architecture variation of the chosen language model by specifying the
parameter `model_weights`.
The full list of supported architectures can be found in the
[HuggingFace documentation](https://huggingface.co/transformers/pretrained_models.html).
If left empty, it uses the default model architecture that original Transformers library loads (see table below).

```
+----------------+--------------+-------------------------+
| Language Model | Parameter | Default value for |
| | "model_name" | "model_weights" |
+----------------+--------------+-------------------------+
| BERT | bert | rasa/LaBSE |
+----------------+--------------+-------------------------+
| GPT | gpt | openai-gpt |
+----------------+--------------+-------------------------+
| GPT-2 | gpt2 | gpt2 |
+----------------+--------------+-------------------------+
| XLNet | xlnet | xlnet-base-cased |
+----------------+--------------+-------------------------+
| DistilBERT | distilbert | distilbert-base-uncased |
+----------------+--------------+-------------------------+
| RoBERTa | roberta | roberta-base |
+----------------+--------------+-------------------------+
```

The following configuration loads the language model BERT:

```yaml-rasa
pipeline:
- name: "LanguageModelFeaturizer"
```
- name: LanguageModelFeaturizer
# Name of the language model to use
model_name: "bert"
# Pre-Trained weights to be loaded
model_weights: "rasa/LaBSE"

# An optional path to a specific directory to download and cache the pre-trained model weights.
# The `default` cache_dir is the same as https://huggingface.co/transformers/serialization.html#cache-directory .
cache_dir: null
```

### RegexFeaturizer

Expand Down
28 changes: 28 additions & 0 deletions docs/docs/migration-guide.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,34 @@ description: |
This page contains information about changes between major versions and
how you can migrate from one version to another.

## Rasa 2.0 to Rasa 2.1

### Deprecations

`ConveRTTokenizer` is now deprecated. [ConveRTFeaturizer](./components.mdx#convertfeaturizer) now implements
its behaviour. To migrate, remove `ConveRTTokenizer` with any other tokenizer, for e.g.:
koernerfelicia marked this conversation as resolved.
Show resolved Hide resolved

```yaml
pipeline:
- name: WhitespaceTokenizer
- name: ConveRTFeaturizer
model_url: <Remote/Local path to model files>
...
```

`HFTransformersNLP` and `LanguageModelTokenizer` components are now deprecated.
[LanguageModelFeaturizer](./components.mdx#languagemodelfeaturizer) now implements their behaviour.
To migrate, remove both the above components with any tokenizer and specify the model architecture and model weights
koernerfelicia marked this conversation as resolved.
Show resolved Hide resolved
as part of `LanguageModelFeaturizer`, for e.g.:

```yaml
pipeline:
- name: WhitespaceTokenizer
- name: LanguageModelFeaturizer
model_name: "bert"
model_weights: "rasa/LaBSE"
...
```

## Rasa 1.10 to Rasa 2.0

Expand Down
3 changes: 0 additions & 3 deletions rasa/nlu/constants.py
Original file line number Diff line number Diff line change
Expand Up @@ -63,9 +63,6 @@
rasa.shared.nlu.constants.INTENT_RESPONSE_KEY: "intent_response_key_tokens",
}

TOKENS = "tokens"
TOKEN_IDS = "token_ids"

SEQUENCE_FEATURES = "sequence_features"
SENTENCE_FEATURES = "sentence_features"

Expand Down
Loading