Skip to content

Commit

Permalink
Merge pull request #9120 from RasaHQ/deprecate-tokenizers
Browse files Browse the repository at this point in the history
Remove deprecated tokenizers ConveRTTokenizer and LanguageModelTokenizer and HFTransformersNLP featurizer.
  • Loading branch information
Chris Kedzie authored Jul 22, 2021
2 parents 7c2206d + 18ee772 commit 2039cef
Show file tree
Hide file tree
Showing 15 changed files with 13 additions and 1,554 deletions.
10 changes: 5 additions & 5 deletions CHANGELOG.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -1321,7 +1321,7 @@ https://github.com/RasaHQ/rasa/tree/main/changelog/ . -->


### Bugfixes
- [#7089](https://github.com/rasahq/rasa/issues/7089): Fix [ConveRTTokenizer](components.mdx#converttokenizer) failing because of wrong model URL by making the `model_url` parameter of `ConveRTTokenizer` mandatory.
- [#7089](https://github.com/rasahq/rasa/issues/7089): Fix `ConveRTTokenizer` failing because of wrong model URL by making the `model_url` parameter of `ConveRTTokenizer` mandatory.

Since the ConveRT model was taken [offline](https://github.com/RasaHQ/rasa/issues/6806), we can no longer use
the earlier public URL of the model. Additionally, since the licence for the model is unknown,
Expand Down Expand Up @@ -2362,7 +2362,7 @@ https://github.com/RasaHQ/rasa/tree/main/changelog/ . -->

* [#5006](https://github.com/rasahq/rasa/issues/5006): Channel `hangouts` for Rasa integration with Google Hangouts Chat is now supported out-of-the-box.

* [#5389](https://github.com/rasahq/rasa/issues/5389): Add an optional path to a specific directory to download and cache the pre-trained model weights for [HFTransformersNLP](./components.mdx#hftransformersnlp).
* [#5389](https://github.com/rasahq/rasa/issues/5389): Add an optional path to a specific directory to download and cache the pre-trained model weights for `HFTransformersNLP`.

* [#5422](https://github.com/rasahq/rasa/issues/5422): Add options `tensorboard_log_directory` and `tensorboard_log_level` to `EmbeddingIntentClassifier`,
`DIETClasifier`, `ResponseSelector`, `EmbeddingPolicy` and `TEDPolicy`.
Expand Down Expand Up @@ -2529,10 +2529,10 @@ https://github.com/RasaHQ/rasa/tree/main/changelog/ . -->

* [#5187](https://github.com/rasahq/rasa/issues/5187): Integrate language models from HuggingFace's [Transformers](https://github.com/huggingface/transformers) Library.

Add a new NLP component [HFTransformersNLP](./components.mdx#hftransformersnlp) which tokenizes and featurizes incoming messages using a specified
Add a new NLP component `HFTransformersNLP` which tokenizes and featurizes incoming messages using a specified
pre-trained model with the Transformers library as the backend.
Add [LanguageModelTokenizer](./components.mdx#languagemodeltokenizer) and [LanguageModelFeaturizer](./components.mdx#languagemodelfeaturizer) which use the information from
[HFTransformersNLP](./components.mdx#hftransformersnlp) and sets them correctly for message object.
Add `LanguageModelTokenizer` and [LanguageModelFeaturizer](./components.mdx#languagemodelfeaturizer) which use the information from
`HFTransformersNLP` and sets them correctly for message object.
Language models currently supported: BERT, OpenAIGPT, GPT-2, XLNet, DistilBert, RoBERTa.

* [#5225](https://github.com/rasahq/rasa/issues/5225): Added a new CLI command `rasa export` to publish tracker events from a persistent
Expand Down
1 change: 1 addition & 0 deletions changelog/8881.removal.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Follow through on deprecation warnings and remove code, tests, and docs for `ConveRTTokenizer`, `LanguageModelTokenizer` and `HFTransformersNLP`.
193 changes: 0 additions & 193 deletions docs/docs/components.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -130,97 +130,6 @@ word vectors in your pipeline.
attach spaCy models that you've trained yourself.


### HFTransformersNLP

:::caution Deprecated
The `HFTransformersNLP` is deprecated and will be removed in a future release. The [LanguageModelFeaturizer](./components.mdx#languagemodelfeaturizer)
now implements its behavior.
:::

* **Short**

HuggingFace's Transformers based pre-trained language model initializer



* **Outputs**

Nothing



* **Requires**

Nothing



* **Description**

Initializes specified pre-trained language model from HuggingFace's [Transformers library](https://huggingface.co/transformers/). The component applies language model specific tokenization and
featurization to compute sequence and sentence level representations for each example in the training data.
Include [LanguageModelTokenizer](./components.mdx#languagemodeltokenizer) and [LanguageModelFeaturizer](./components.mdx#languagemodelfeaturizer) to utilize the output of this
component for downstream NLU models.

:::note
To use `HFTransformersNLP` component, install Rasa Open Source with `pip3 install rasa[transformers]`.

:::



* **Configuration**

You should specify what language model to load via the parameter `model_name`. See the below table for the
available language models.
Additionally, you can also specify the architecture variation of the chosen language model by specifying the
parameter `model_weights`.
The full list of supported architectures can be found in the
[HuggingFace documentation](https://huggingface.co/transformers/pretrained_models.html).
If left empty, it uses the default model architecture that original Transformers library loads (see table below).

```
+----------------+--------------+-------------------------+
| Language Model | Parameter | Default value for |
| | "model_name" | "model_weights" |
+----------------+--------------+-------------------------+
| BERT | bert | rasa/LaBSE |
+----------------+--------------+-------------------------+
| GPT | gpt | openai-gpt |
+----------------+--------------+-------------------------+
| GPT-2 | gpt2 | gpt2 |
+----------------+--------------+-------------------------+
| XLNet | xlnet | xlnet-base-cased |
+----------------+--------------+-------------------------+
| DistilBERT | distilbert | distilbert-base-uncased |
+----------------+--------------+-------------------------+
| RoBERTa | roberta | roberta-base |
+----------------+--------------+-------------------------+
```

The following configuration loads the language model BERT:

```yaml-rasa
pipeline:
- name: HFTransformersNLP
# Name of the language model to use
model_name: "bert"
# Pre-Trained weights to be loaded
model_weights: "rasa/LaBSE"
# An optional path to a directory from which
# to load pre-trained model weights.
# If the requested model is not found in the
# directory, it will be downloaded and
# cached in this directory for future use.
# The default value of `cache_dir` can be
# set using the environment variable
# `TRANSFORMERS_CACHE`, as per the
# Transformers library.
cache_dir: null
```


## Tokenizers

Tokenizers split text into tokens.
Expand Down Expand Up @@ -428,108 +337,6 @@ now implements its behavior.
```
### ConveRTTokenizer
:::caution Deprecated
The `ConveRTTokenizer` is deprecated and will be removed in a future release. The [ConveRTFeaturizer](./components.mdx#convertfeaturizer)
should now be used with any other [tokenizer](./components.mdx#tokenizers), for example [WhitespaceTokenizer](./components.mdx#whitespacetokenizer).
:::
* **Short**
Tokenizer using [ConveRT](https://github.com/PolyAI-LDN/polyai-models#convert) model.
* **Outputs**
`tokens` for user messages, responses (if present), and intents (if specified)
* **Requires**
Nothing
* **Description**
Creates tokens using the ConveRT tokenizer.
:::note
Since `ConveRT` model is trained only on an English corpus of conversations, this tokenizer should only
be used if your training data is in English language.
:::
:::note
To use `ConveRTTokenizer`, install Rasa Open Source with `pip3 install rasa[convert]`.
:::
* **Configuration**
```yaml-rasa
pipeline:
- name: "ConveRTTokenizer"
# Flag to check whether to split intents
"intent_tokenization_flag": False
# Symbol on which intent should be split
"intent_split_symbol": "_"
# Regular expression to detect tokens
"token_pattern": None
# Remote URL/Local directory of model files(Required)
"model_url": None
```
### LanguageModelTokenizer
:::caution Deprecated
The `LanguageModelTokenizer` is deprecated and will be removed in a future release. The [LanguageModelFeaturizer](./components.mdx#languagemodelfeaturizer)
should now be used with any other [tokenizer](./components.mdx#tokenizers), for example [WhitespaceTokenizer](./components.mdx#whitespacetokenizer).
:::
* **Short**
Tokenizer from pre-trained language models
* **Outputs**
`tokens` for user messages, responses (if present), and intents (if specified)
* **Requires**
[HFTransformersNLP](./components.mdx#hftransformersnlp)
* **Description**
Creates tokens using the pre-trained language model specified in upstream [HFTransformersNLP](./components.mdx#hftransformersnlp) component.
* **Configuration**
```yaml-rasa
pipeline:
- name: "LanguageModelTokenizer"
# Flag to check whether to split intents
"intent_tokenization_flag": False
# Symbol on which intent should be split
"intent_split_symbol": "_"
```
## Featurizers
Text featurizers are divided into two different categories: sparse featurizers and dense featurizers.
Expand Down
2 changes: 1 addition & 1 deletion docs/docs/tuning-your-model.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -230,7 +230,7 @@ for both is highly likely to be the same. This is also useful if you don't have

An alternative to [ConveRTFeaturizer](./components.mdx#convertfeaturizer) is the [LanguageModelFeaturizer](./components.mdx#languagemodelfeaturizer) which uses pre-trained language
models such as BERT, GPT-2, etc. to extract similar contextual vector representations for the complete sentence. See
[HFTransformersNLP](./components.mdx#hftransformersnlp) for a full list of supported language models.
[LanguageModelFeaturizer](./components.mdx#languagemodelfeaturizer) for a full list of supported language models.

If your training data is not in English you can also use a different variant of a language model which
is pre-trained in the language specific to your training data.
Expand Down
66 changes: 1 addition & 65 deletions rasa/nlu/featurizers/dense_featurizer/lm_featurizer.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,6 @@
NO_LENGTH_RESTRICTION,
NUMBER_OF_SUB_TOKENS,
TOKENS_NAMES,
LANGUAGE_MODEL_DOCS,
)
from rasa.shared.nlu.constants import (
TEXT,
Expand Down Expand Up @@ -71,19 +70,14 @@ def __init__(
self,
component_config: Optional[Dict[Text, Any]] = None,
skip_model_load: bool = False,
hf_transformers_loaded: bool = False,
) -> None:
"""Initializes LanguageModelFeaturizer with the specified model.
Args:
component_config: Configuration for the component.
skip_model_load: Skip loading the model for pytests.
hf_transformers_loaded: Skip loading of model and metadata, use
HFTransformers output instead.
"""
super(LanguageModelFeaturizer, self).__init__(component_config)
if hf_transformers_loaded:
return
self._load_model_metadata()
self._load_model_instance(skip_model_load)

Expand All @@ -95,52 +89,7 @@ def create(
if not cls.can_handle_language(language):
# check failed
raise UnsupportedLanguageError(cls.name, language)
# TODO: remove this when HFTransformersNLP is removed for good
if isinstance(config, Metadata):
hf_transformers_loaded = "HFTransformersNLP" in [
c["name"] for c in config.metadata["pipeline"]
]
else:
hf_transformers_loaded = "HFTransformersNLP" in config.component_names
return cls(component_config, hf_transformers_loaded=hf_transformers_loaded)

@classmethod
def load(
cls,
meta: Dict[Text, Any],
model_dir: Text,
model_metadata: Optional["Metadata"] = None,
cached_component: Optional["Component"] = None,
**kwargs: Any,
) -> "Component":
"""Load this component from file.
After a component has been trained, it will be persisted by
calling `persist`. When the pipeline gets loaded again,
this component needs to be able to restore itself.
Components can rely on any context attributes that are
created by :meth:`components.Component.create`
calls to components previous to this one.
This method differs from the parent method only in that it calls create
rather than the constructor if the component is not found. This is to
trigger the check for HFTransformersNLP and the method can be removed
when HFTRansformersNLP is removed.
Args:
meta: Any configuration parameter related to the model.
model_dir: The directory to load the component from.
model_metadata: The model's :class:`rasa.nlu.model.Metadata`.
cached_component: The cached component.
Returns:
the loaded component
"""
# TODO: remove this when HFTransformersNLP is removed for good
if cached_component:
return cached_component

return cls.create(meta, model_metadata)
return cls(component_config)

def _load_model_metadata(self) -> None:
"""Load the metadata for the specified model and sets these properties.
Expand Down Expand Up @@ -744,19 +693,6 @@ def _get_docs_for_batch(
Returns:
List of language model docs for each message in batch.
"""
hf_transformers_doc = batch_examples[0].get(LANGUAGE_MODEL_DOCS[attribute])
if hf_transformers_doc:
# This should only be the case if the deprecated
# HFTransformersNLP component is used in the pipeline
# TODO: remove this when HFTransformersNLP is removed for good
logging.debug(
f"'{LANGUAGE_MODEL_DOCS[attribute]}' set: this "
f"indicates you're using the deprecated component "
f"HFTransformersNLP, please remove it from your "
f"pipeline."
)
return [ex.get(LANGUAGE_MODEL_DOCS[attribute]) for ex in batch_examples]

batch_tokens, batch_token_ids = self._get_token_ids_for_batch(
batch_examples, attribute
)
Expand Down
6 changes: 0 additions & 6 deletions rasa/nlu/registry.py
Original file line number Diff line number Diff line change
Expand Up @@ -34,15 +34,12 @@
from rasa.nlu.featurizers.sparse_featurizer.regex_featurizer import RegexFeaturizer
from rasa.nlu.model import Metadata
from rasa.nlu.selectors.response_selector import ResponseSelector
from rasa.nlu.tokenizers.convert_tokenizer import ConveRTTokenizer
from rasa.nlu.tokenizers.jieba_tokenizer import JiebaTokenizer
from rasa.nlu.tokenizers.mitie_tokenizer import MitieTokenizer
from rasa.nlu.tokenizers.spacy_tokenizer import SpacyTokenizer
from rasa.nlu.tokenizers.whitespace_tokenizer import WhitespaceTokenizer
from rasa.nlu.tokenizers.lm_tokenizer import LanguageModelTokenizer
from rasa.nlu.utils.mitie_utils import MitieNLP
from rasa.nlu.utils.spacy_utils import SpacyNLP
from rasa.nlu.utils.hugging_face.hf_transformers import HFTransformersNLP
from rasa.shared.exceptions import RasaException
import rasa.shared.utils.common
import rasa.shared.utils.io
Expand All @@ -61,14 +58,11 @@
# utils
SpacyNLP,
MitieNLP,
HFTransformersNLP,
# tokenizers
MitieTokenizer,
SpacyTokenizer,
WhitespaceTokenizer,
ConveRTTokenizer,
JiebaTokenizer,
LanguageModelTokenizer,
# extractors
SpacyEntityExtractor,
MitieEntityExtractor,
Expand Down
Loading

0 comments on commit 2039cef

Please sign in to comment.