Merge pull request #9120 from RasaHQ/deprecate-tokenizers

Remove deprecated tokenizers ConveRTTokenizer and LanguageModelTokenizer and HFTransformersNLP featurizer.
RasaHQ · Jul 22, 2021 · 2039cef · 2039cef
2 parents 7c2206d + 18ee772
commit 2039cef
Show file tree

Hide file tree

Showing 15 changed files with 13 additions and 1,554 deletions.
diff --git a/CHANGELOG.mdx b/CHANGELOG.mdx
@@ -1321,7 +1321,7 @@ https://github.com/RasaHQ/rasa/tree/main/changelog/ . -->
 
 
 ### Bugfixes
-- [#7089](https://github.com/rasahq/rasa/issues/7089): Fix [ConveRTTokenizer](components.mdx#converttokenizer) failing because of wrong model URL by making the `model_url` parameter of `ConveRTTokenizer` mandatory.
+- [#7089](https://github.com/rasahq/rasa/issues/7089): Fix `ConveRTTokenizer` failing because of wrong model URL by making the `model_url` parameter of `ConveRTTokenizer` mandatory.
 
   Since the ConveRT model was taken [offline](https://github.com/RasaHQ/rasa/issues/6806), we can no longer use
   the earlier public URL of the model. Additionally, since the licence for the model is unknown,
@@ -2362,7 +2362,7 @@ https://github.com/RasaHQ/rasa/tree/main/changelog/ . -->
 
 * [#5006](https://github.com/rasahq/rasa/issues/5006): Channel `hangouts` for Rasa integration with Google Hangouts Chat is now supported out-of-the-box.
 
-* [#5389](https://github.com/rasahq/rasa/issues/5389): Add an optional path to a specific directory to download and cache the pre-trained model weights for [HFTransformersNLP](./components.mdx#hftransformersnlp).
+* [#5389](https://github.com/rasahq/rasa/issues/5389): Add an optional path to a specific directory to download and cache the pre-trained model weights for `HFTransformersNLP`.
 
 * [#5422](https://github.com/rasahq/rasa/issues/5422): Add options `tensorboard_log_directory` and `tensorboard_log_level` to `EmbeddingIntentClassifier`,
   `DIETClasifier`, `ResponseSelector`, `EmbeddingPolicy` and `TEDPolicy`.
@@ -2529,10 +2529,10 @@ https://github.com/RasaHQ/rasa/tree/main/changelog/ . -->
 
 * [#5187](https://github.com/rasahq/rasa/issues/5187): Integrate language models from HuggingFace's [Transformers](https://github.com/huggingface/transformers) Library.
 
-  Add a new NLP component [HFTransformersNLP](./components.mdx#hftransformersnlp) which tokenizes and featurizes incoming messages using a specified
+  Add a new NLP component `HFTransformersNLP` which tokenizes and featurizes incoming messages using a specified
   pre-trained model with the Transformers library as the backend.
-  Add [LanguageModelTokenizer](./components.mdx#languagemodeltokenizer) and [LanguageModelFeaturizer](./components.mdx#languagemodelfeaturizer) which use the information from
-  [HFTransformersNLP](./components.mdx#hftransformersnlp) and sets them correctly for message object.
+  Add `LanguageModelTokenizer` and [LanguageModelFeaturizer](./components.mdx#languagemodelfeaturizer) which use the information from
+  `HFTransformersNLP` and sets them correctly for message object.
   Language models currently supported: BERT, OpenAIGPT, GPT-2, XLNet, DistilBert, RoBERTa.
 
 * [#5225](https://github.com/rasahq/rasa/issues/5225): Added a new CLI command `rasa export` to publish tracker events from a persistent

diff --git a/changelog/8881.removal.md b/changelog/8881.removal.md
@@ -0,0 +1 @@
+Follow through on deprecation warnings and remove code, tests, and docs for `ConveRTTokenizer`, `LanguageModelTokenizer` and `HFTransformersNLP`.
diff --git a/docs/docs/components.mdx b/docs/docs/components.mdx
@@ -130,97 +130,6 @@ word vectors in your pipeline.
   attach spaCy models that you've trained yourself.
 
 
-### HFTransformersNLP
-
-:::caution Deprecated
-The `HFTransformersNLP` is deprecated and will be removed in a future release. The [LanguageModelFeaturizer](./components.mdx#languagemodelfeaturizer)
-now implements its behavior.
-:::
-
-* **Short**
-
-  HuggingFace's Transformers based pre-trained language model initializer
-
-
-
-* **Outputs**
-
-  Nothing
-
-
-
-* **Requires**
-
-  Nothing
-
-
-
-* **Description**
-
-  Initializes specified pre-trained language model from HuggingFace's [Transformers library](https://huggingface.co/transformers/).  The component applies language model specific tokenization and
-  featurization to compute sequence and sentence level representations for each example in the training data.
-  Include [LanguageModelTokenizer](./components.mdx#languagemodeltokenizer) and [LanguageModelFeaturizer](./components.mdx#languagemodelfeaturizer) to utilize the output of this
-  component for downstream NLU models.
-
-  :::note
-  To use `HFTransformersNLP` component, install Rasa Open Source with `pip3 install rasa[transformers]`.
-
-  :::
-
-
-
-* **Configuration**
-
-  You should specify what language model to load via the parameter `model_name`. See the below table for the
-  available language models.
-  Additionally, you can also specify the architecture variation of the chosen language model by specifying the
-  parameter `model_weights`.
-  The full list of supported architectures can be found in the
-  [HuggingFace documentation](https://huggingface.co/transformers/pretrained_models.html).
-  If left empty, it uses the default model architecture that original Transformers library loads (see table below).
-
-  ```
-  +----------------+--------------+-------------------------+
-  | Language Model | Parameter    | Default value for       |
-  |                | "model_name" | "model_weights"         |
-  +----------------+--------------+-------------------------+
-  | BERT           | bert         | rasa/LaBSE              |
-  +----------------+--------------+-------------------------+
-  | GPT            | gpt          | openai-gpt              |
-  +----------------+--------------+-------------------------+
-  | GPT-2          | gpt2         | gpt2                    |
-  +----------------+--------------+-------------------------+
-  | XLNet          | xlnet        | xlnet-base-cased        |
-  +----------------+--------------+-------------------------+
-  | DistilBERT     | distilbert   | distilbert-base-uncased |
-  +----------------+--------------+-------------------------+
-  | RoBERTa        | roberta      | roberta-base            |
-  +----------------+--------------+-------------------------+
-  ```
-
-  The following configuration loads the language model BERT:
-
-  ```yaml-rasa
-  pipeline:
-    - name: HFTransformersNLP
-      # Name of the language model to use
-      model_name: "bert"
-      # Pre-Trained weights to be loaded
-      model_weights: "rasa/LaBSE"
-
-      # An optional path to a directory from which
-      # to load pre-trained model weights.
-      # If the requested model is not found in the
-      # directory, it will be downloaded and
-      # cached in this directory for future use.
-      # The default value of `cache_dir` can be
-      # set using the environment variable
-      # `TRANSFORMERS_CACHE`, as per the
-      # Transformers library.
-      cache_dir: null
-  ```
-
-
   ## Tokenizers
 
   Tokenizers split text into tokens.
@@ -428,108 +337,6 @@ now implements its behavior.
     ```
 
 
-  ### ConveRTTokenizer
-
-:::caution Deprecated
-The `ConveRTTokenizer` is deprecated and will be removed in a future release. The [ConveRTFeaturizer](./components.mdx#convertfeaturizer)
-should now be used with any other [tokenizer](./components.mdx#tokenizers), for example [WhitespaceTokenizer](./components.mdx#whitespacetokenizer).
-:::
-
-  * **Short**
-
-    Tokenizer using [ConveRT](https://github.com/PolyAI-LDN/polyai-models#convert) model.
-
-
-
-  * **Outputs**
-
-    `tokens` for user messages, responses (if present), and intents (if specified)
-
-
-
-  * **Requires**
-
-    Nothing
-
-
-
-  * **Description**
-
-    Creates tokens using the ConveRT tokenizer.
-
-    :::note
-    Since `ConveRT` model is trained only on an English corpus of conversations, this tokenizer should only
-    be used if your training data is in English language.
-
-    :::
-
-    :::note
-    To use `ConveRTTokenizer`, install Rasa Open Source with `pip3 install rasa[convert]`.
-
-    :::
-
-
-
-  * **Configuration**
-
-    ```yaml-rasa
-    pipeline:
-    - name: "ConveRTTokenizer"
-      # Flag to check whether to split intents
-      "intent_tokenization_flag": False
-      # Symbol on which intent should be split
-      "intent_split_symbol": "_"
-      # Regular expression to detect tokens
-      "token_pattern": None
-      # Remote URL/Local directory of model files(Required)
-      "model_url": None
-    ```
-
-
-
-  ### LanguageModelTokenizer
-
-:::caution Deprecated
-The `LanguageModelTokenizer` is deprecated and will be removed in a future release. The [LanguageModelFeaturizer](./components.mdx#languagemodelfeaturizer)
-should now be used with any other [tokenizer](./components.mdx#tokenizers), for example [WhitespaceTokenizer](./components.mdx#whitespacetokenizer).
-:::
-
-* **Short**
-
-Tokenizer from pre-trained language models
-
-
-
-* **Outputs**
-
-`tokens` for user messages, responses (if present), and intents (if specified)
-
-
-
-* **Requires**
-
-[HFTransformersNLP](./components.mdx#hftransformersnlp)
-
-
-
-* **Description**
-
-Creates tokens using the pre-trained language model specified in upstream [HFTransformersNLP](./components.mdx#hftransformersnlp) component.
-
-
-
-* **Configuration**
-
-```yaml-rasa
-pipeline:
-- name: "LanguageModelTokenizer"
-  # Flag to check whether to split intents
-  "intent_tokenization_flag": False
-  # Symbol on which intent should be split
-  "intent_split_symbol": "_"
-```
-
-
 ## Featurizers
 
 Text featurizers are divided into two different categories: sparse featurizers and dense featurizers.

diff --git a/docs/docs/tuning-your-model.mdx b/docs/docs/tuning-your-model.mdx
@@ -230,7 +230,7 @@ for both is highly likely to be the same. This is also useful if you don't have
 
 An alternative to [ConveRTFeaturizer](./components.mdx#convertfeaturizer) is the [LanguageModelFeaturizer](./components.mdx#languagemodelfeaturizer) which uses pre-trained language
 models such as BERT, GPT-2, etc. to extract similar contextual vector representations for the complete sentence. See
-[HFTransformersNLP](./components.mdx#hftransformersnlp) for a full list of supported language models.
+[LanguageModelFeaturizer](./components.mdx#languagemodelfeaturizer) for a full list of supported language models.
 
 If your training data is not in English you can also use a different variant of a language model which
 is pre-trained in the language specific to your training data.

diff --git a/rasa/nlu/featurizers/dense_featurizer/lm_featurizer.py b/rasa/nlu/featurizers/dense_featurizer/lm_featurizer.py
@@ -21,7 +21,6 @@
     NO_LENGTH_RESTRICTION,
     NUMBER_OF_SUB_TOKENS,
     TOKENS_NAMES,
-    LANGUAGE_MODEL_DOCS,
 )
 from rasa.shared.nlu.constants import (
     TEXT,
@@ -71,19 +70,14 @@ def __init__(
         self,
         component_config: Optional[Dict[Text, Any]] = None,
         skip_model_load: bool = False,
-        hf_transformers_loaded: bool = False,
     ) -> None:
         """Initializes LanguageModelFeaturizer with the specified model.
 
         Args:
             component_config: Configuration for the component.
             skip_model_load: Skip loading the model for pytests.
-            hf_transformers_loaded: Skip loading of model and metadata, use
-            HFTransformers output instead.
         """
         super(LanguageModelFeaturizer, self).__init__(component_config)
-        if hf_transformers_loaded:
-            return
         self._load_model_metadata()
         self._load_model_instance(skip_model_load)
 
@@ -95,52 +89,7 @@ def create(
         if not cls.can_handle_language(language):
             # check failed
             raise UnsupportedLanguageError(cls.name, language)
-        # TODO: remove this when HFTransformersNLP is removed for good
-        if isinstance(config, Metadata):
-            hf_transformers_loaded = "HFTransformersNLP" in [
-                c["name"] for c in config.metadata["pipeline"]
-            ]
-        else:
-            hf_transformers_loaded = "HFTransformersNLP" in config.component_names
-        return cls(component_config, hf_transformers_loaded=hf_transformers_loaded)
-
-    @classmethod
-    def load(
-        cls,
-        meta: Dict[Text, Any],
-        model_dir: Text,
-        model_metadata: Optional["Metadata"] = None,
-        cached_component: Optional["Component"] = None,
-        **kwargs: Any,
-    ) -> "Component":
-        """Load this component from file.
-
-        After a component has been trained, it will be persisted by
-        calling `persist`. When the pipeline gets loaded again,
-        this component needs to be able to restore itself.
-        Components can rely on any context attributes that are
-        created by :meth:`components.Component.create`
-        calls to components previous to this one.
-
-        This method differs from the parent method only in that it calls create
-        rather than the constructor if the component is not found. This is to
-        trigger the check for HFTransformersNLP and the method can be removed
-        when HFTRansformersNLP is removed.
-
-        Args:
-                meta: Any configuration parameter related to the model.
-                model_dir: The directory to load the component from.
-                model_metadata: The model's :class:`rasa.nlu.model.Metadata`.
-                cached_component: The cached component.
-
-        Returns:
-                the loaded component
-        """
-        # TODO: remove this when HFTransformersNLP is removed for good
-        if cached_component:
-            return cached_component
-
-        return cls.create(meta, model_metadata)
+        return cls(component_config)
 
     def _load_model_metadata(self) -> None:
         """Load the metadata for the specified model and sets these properties.
@@ -744,19 +693,6 @@ def _get_docs_for_batch(
         Returns:
             List of language model docs for each message in batch.
         """
-        hf_transformers_doc = batch_examples[0].get(LANGUAGE_MODEL_DOCS[attribute])
-        if hf_transformers_doc:
-            # This should only be the case if the deprecated
-            # HFTransformersNLP component is used in the pipeline
-            # TODO: remove this when HFTransformersNLP is removed for good
-            logging.debug(
-                f"'{LANGUAGE_MODEL_DOCS[attribute]}' set: this "
-                f"indicates you're using the deprecated component "
-                f"HFTransformersNLP, please remove it from your "
-                f"pipeline."
-            )
-            return [ex.get(LANGUAGE_MODEL_DOCS[attribute]) for ex in batch_examples]
-
         batch_tokens, batch_token_ids = self._get_token_ids_for_batch(
             batch_examples, attribute
         )

diff --git a/rasa/nlu/registry.py b/rasa/nlu/registry.py
@@ -34,15 +34,12 @@
 from rasa.nlu.featurizers.sparse_featurizer.regex_featurizer import RegexFeaturizer
 from rasa.nlu.model import Metadata
 from rasa.nlu.selectors.response_selector import ResponseSelector
-from rasa.nlu.tokenizers.convert_tokenizer import ConveRTTokenizer
 from rasa.nlu.tokenizers.jieba_tokenizer import JiebaTokenizer
 from rasa.nlu.tokenizers.mitie_tokenizer import MitieTokenizer
 from rasa.nlu.tokenizers.spacy_tokenizer import SpacyTokenizer
 from rasa.nlu.tokenizers.whitespace_tokenizer import WhitespaceTokenizer
-from rasa.nlu.tokenizers.lm_tokenizer import LanguageModelTokenizer
 from rasa.nlu.utils.mitie_utils import MitieNLP
 from rasa.nlu.utils.spacy_utils import SpacyNLP
-from rasa.nlu.utils.hugging_face.hf_transformers import HFTransformersNLP
 from rasa.shared.exceptions import RasaException
 import rasa.shared.utils.common
 import rasa.shared.utils.io
@@ -61,14 +58,11 @@
     # utils
     SpacyNLP,
     MitieNLP,
-    HFTransformersNLP,
     # tokenizers
     MitieTokenizer,
     SpacyTokenizer,
     WhitespaceTokenizer,
-    ConveRTTokenizer,
     JiebaTokenizer,
-    LanguageModelTokenizer,
     # extractors
     SpacyEntityExtractor,
     MitieEntityExtractor,
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1 @@
		Follow through on deprecation warnings and remove code, tests, and docs for `ConveRTTokenizer`, `LanguageModelTokenizer` and `HFTransformersNLP`.