diff --git a/changelog/7027.improvement.md b/changelog/7027.improvement.md
new file mode 100644
index 000000000000..baaa4813790e
--- /dev/null
+++ b/changelog/7027.improvement.md
@@ -0,0 +1,6 @@
+Remove dependency between `ConveRTTokenizer` and `ConveRTFeaturizer`. The `ConveRTTokenizer` is now deprecated, and the 
+`ConveRTFeaturizer` can be used with any other `Tokenizer`.
+
+Remove dependency between `HFTransformersNLP`, `LanguageModelTokenizer`, and `LanguageModelFeaturizer`. Both 
+`HFTransformersNLP` and `LanguageModelTokenizer` are now deprecated. `LanguageModelFeaturizer` implements the behavior 
+of the stack and can be used with any other `Tokenizer`.
diff --git a/docs/docs/components.mdx b/docs/docs/components.mdx
index c3cf7a003e24..62e308ab6c24 100644
--- a/docs/docs/components.mdx
+++ b/docs/docs/components.mdx
@@ -139,6 +139,10 @@ word vectors in your pipeline.
 
 ### HFTransformersNLP
 
+:::caution Deprecated
+The `HFTransformersNLP` is deprecated and will be removed in a future release. The [LanguageModelFeaturizer](./components.mdx#languagemodelfeaturizer)
+now implements its behavior.
+:::
 
 * **Short**
 
@@ -406,6 +410,10 @@ word vectors in your pipeline.
 
   ### ConveRTTokenizer
 
+:::caution Deprecated
+The `ConveRTTokenizer` is deprecated and will be removed in a future release. The [ConveRTFeaturizer](./components.mdx#convertfeaturizer)
+now implements its behavior. Any [tokenizer](./components.mdx#tokenizers) can be used in its place.
+:::
 
   * **Short**
 
@@ -466,42 +474,46 @@ word vectors in your pipeline.
 
   ### LanguageModelTokenizer
 
+:::caution Deprecated
+The `LanguageModelTokenizer` is deprecated and will be removed in a future release. The [LanguageModelFeaturizer](./components.mdx#languagemodelfeaturizer)
+now implements its behavior. Any [tokenizer](./components.mdx#tokenizers) can be used in its place.
+:::
 
-  * **Short**
+* **Short**
 
-    Tokenizer from pre-trained language models
+Tokenizer from pre-trained language models
 
 
 
-  * **Outputs**
+* **Outputs**
 
-    `tokens` for user messages, responses (if present), and intents (if specified)
+`tokens` for user messages, responses (if present), and intents (if specified)
 
 
 
-  * **Requires**
+* **Requires**
 
-    [HFTransformersNLP](./components.mdx#hftransformersnlp)
+[HFTransformersNLP](./components.mdx#hftransformersnlp)
 
 
 
-  * **Description**
+* **Description**
 
-    Creates tokens using the pre-trained language model specified in upstream [HFTransformersNLP](./components.mdx#hftransformersnlp) component.
-    Must be used whenever the [LanguageModelFeaturizer](./components.mdx#languagemodelfeaturizer) is used.
+Creates tokens using the pre-trained language model specified in upstream [HFTransformersNLP](./components.mdx#hftransformersnlp) component.
+Must be used whenever the [LanguageModelFeaturizer](./components.mdx#languagemodelfeaturizer) is used.
 
 
 
-  * **Configuration**
+* **Configuration**
 
-    ```yaml-rasa
-    pipeline:
-    - name: "LanguageModelTokenizer"
-      # Flag to check whether to split intents
-      "intent_tokenization_flag": False
-      # Symbol on which intent should be split
-      "intent_split_symbol": "_"
-    ```
+```yaml-rasa
+pipeline:
+- name: "LanguageModelTokenizer"
+  # Flag to check whether to split intents
+  "intent_tokenization_flag": False
+  # Symbol on which intent should be split
+  "intent_split_symbol": "_"
+```
 
 
 ## Featurizers
@@ -644,7 +656,7 @@ Note: The `feature-dimension` for sequence and sentence features does not have t
 
 * **Requires**
 
-  [ConveRTTokenizer](./components.mdx#converttokenizer)
+  `tokens`
 
 
 
@@ -667,7 +679,7 @@ Note: The `feature-dimension` for sequence and sentence features does not have t
   :::
 
   :::note
-  To use `ConveRTTokenizer`, install Rasa Open Source with `pip3 install rasa[convert]`.
+  To use `ConveRTFeaturizer`, install Rasa Open Source with `pip3 install rasa[convert]`.
 
   :::
 
@@ -698,7 +710,7 @@ Note: The `feature-dimension` for sequence and sentence features does not have t
 
 * **Requires**
 
-  [HFTransformersNLP](./components.mdx#hftransformersnlp) and [LanguageModelTokenizer](./components.mdx#languagemodeltokenizer)
+  `tokens`.
 
 
 
@@ -711,8 +723,7 @@ Note: The `feature-dimension` for sequence and sentence features does not have t
 * **Description**
 
   Creates features for entity extraction, intent classification, and response selection.
-  Uses the pre-trained language model specified in upstream [HFTransformersNLP](./components.mdx#hftransformersnlp) component to compute vector
-  representations of input text.
+  Uses a pre-trained language model to compute vector representations of input text.
 
   :::note
   Please make sure that you use a language model which is pre-trained on the same language corpus as that of your
@@ -724,14 +735,49 @@ Note: The `feature-dimension` for sequence and sentence features does not have t
 
 * **Configuration**
 
-  Include [HFTransformersNLP](./components.mdx#hftransformersnlp) and [LanguageModelTokenizer](./components.mdx#languagemodeltokenizer) components before this component. Use
-  [LanguageModelTokenizer](./components.mdx#languagemodeltokenizer) to ensure tokens are correctly set for all components throughout the pipeline.
+  Include a [Tokenizer](./components.mdx#tokenizers) component before this component.
+
+  You should specify what language model to load via the parameter `model_name`. See the below table for the
+  available language models.
+  Additionally, you can also specify the architecture variation of the chosen language model by specifying the
+  parameter `model_weights`.
+  The full list of supported architectures can be found in the
+  [HuggingFace documentation](https://huggingface.co/transformers/pretrained_models.html).
+  If left empty, it uses the default model architecture that original Transformers library loads (see table below).
+
+  ```
+  +----------------+--------------+-------------------------+
+  | Language Model | Parameter    | Default value for       |
+  |                | "model_name" | "model_weights"         |
+  +----------------+--------------+-------------------------+
+  | BERT           | bert         | rasa/LaBSE              |
+  +----------------+--------------+-------------------------+
+  | GPT            | gpt          | openai-gpt              |
+  +----------------+--------------+-------------------------+
+  | GPT-2          | gpt2         | gpt2                    |
+  +----------------+--------------+-------------------------+
+  | XLNet          | xlnet        | xlnet-base-cased        |
+  +----------------+--------------+-------------------------+
+  | DistilBERT     | distilbert   | distilbert-base-uncased |
+  +----------------+--------------+-------------------------+
+  | RoBERTa        | roberta      | roberta-base            |
+  +----------------+--------------+-------------------------+
+  ```
+
+  The following configuration loads the language model BERT:
 
   ```yaml-rasa
   pipeline:
-  - name: "LanguageModelFeaturizer"
-  ```
+    - name: LanguageModelFeaturizer
+      # Name of the language model to use
+      model_name: "bert"
+      # Pre-Trained weights to be loaded
+      model_weights: "rasa/LaBSE"
 
+      # An optional path to a specific directory to download and cache the pre-trained model weights.
+      # The `default` cache_dir is the same as https://huggingface.co/transformers/serialization.html#cache-directory .
+      cache_dir: null
+  ```
 
 ### RegexFeaturizer
 
diff --git a/docs/docs/migration-guide.mdx b/docs/docs/migration-guide.mdx
index 3ab49ab105a1..eb01641c49a5 100644
--- a/docs/docs/migration-guide.mdx
+++ b/docs/docs/migration-guide.mdx
@@ -10,6 +10,34 @@ description: |
 This page contains information about changes between major versions and
 how you can migrate from one version to another.
 
+## Rasa 2.0 to Rasa 2.1
+
+### Deprecations
+
+`ConveRTTokenizer` is now deprecated. [ConveRTFeaturizer](./components.mdx#convertfeaturizer) now implements
+its behaviour. To migrate, replace `ConveRTTokenizer` with any other tokenizer, for e.g.:
+
+```yaml
+pipeline:
+    - name: WhitespaceTokenizer
+    - name: ConveRTFeaturizer
+      model_url: <Remote/Local path to model files>
+    ...
+```
+
+`HFTransformersNLP` and `LanguageModelTokenizer` components are now deprecated.
+[LanguageModelFeaturizer](./components.mdx#languagemodelfeaturizer) now implements their behaviour.
+To migrate, replace both the above components with any tokenizer and specify the model architecture and model weights
+as part of `LanguageModelFeaturizer`, for e.g.:
+
+```yaml
+pipeline:
+    - name: WhitespaceTokenizer
+    - name: LanguageModelFeaturizer
+      model_name: "bert"
+      model_weights: "rasa/LaBSE"
+    ...
+```
 
 ## Rasa 1.10 to Rasa 2.0
 
diff --git a/rasa/nlu/constants.py b/rasa/nlu/constants.py
index 49e0978b075b..14297822acb3 100644
--- a/rasa/nlu/constants.py
+++ b/rasa/nlu/constants.py
@@ -63,9 +63,6 @@
     rasa.shared.nlu.constants.INTENT_RESPONSE_KEY: "intent_response_key_tokens",
 }
 
-TOKENS = "tokens"
-TOKEN_IDS = "token_ids"
-
 SEQUENCE_FEATURES = "sequence_features"
 SENTENCE_FEATURES = "sentence_features"
 
diff --git a/rasa/nlu/featurizers/dense_featurizer/convert_featurizer.py b/rasa/nlu/featurizers/dense_featurizer/convert_featurizer.py
index 9d65e3ef3460..e24c82d27219 100644
--- a/rasa/nlu/featurizers/dense_featurizer/convert_featurizer.py
+++ b/rasa/nlu/featurizers/dense_featurizer/convert_featurizer.py
@@ -2,11 +2,14 @@
 
 from typing import Any, Dict, List, NoReturn, Optional, Text, Tuple, Type
 from tqdm import tqdm
+import os
 
 import rasa.shared.utils.io
-from rasa.nlu.tokenizers.convert_tokenizer import ConveRTTokenizer
+import rasa.core.utils
+from rasa.utils import common
+from rasa.nlu.tokenizers.tokenizer import Token, Tokenizer
+from rasa.nlu.model import Metadata
 from rasa.shared.constants import DOCS_URL_COMPONENTS
-from rasa.nlu.tokenizers.tokenizer import Token
 from rasa.nlu.components import Component
 from rasa.nlu.featurizers.featurizer import DenseFeaturizer
 from rasa.shared.nlu.training_data.features import Features
@@ -17,8 +20,16 @@
     DENSE_FEATURIZABLE_ATTRIBUTES,
     FEATURIZER_CLASS_ALIAS,
     TOKENS_NAMES,
+    NUMBER_OF_SUB_TOKENS,
 )
-from rasa.shared.nlu.constants import TEXT, FEATURE_TYPE_SENTENCE, FEATURE_TYPE_SEQUENCE
+from rasa.shared.nlu.constants import (
+    TEXT,
+    FEATURE_TYPE_SENTENCE,
+    FEATURE_TYPE_SEQUENCE,
+    ACTION_TEXT,
+)
+from rasa.exceptions import RasaException
+import rasa.nlu.utils
 import numpy as np
 import tensorflow as tf
 
@@ -26,6 +37,16 @@
 
 logger = logging.getLogger(__name__)
 
+# URL to the old remote location of the model which
+# users might use. The model is no longer hosted here.
+ORIGINAL_TF_HUB_MODULE_URL = (
+    "https://github.com/PolyAI-LDN/polyai-models/releases/download/v1.0/model.tar.gz"
+)
+
+# Warning: This URL is only intended for running pytests on ConveRT
+# related components. This URL should not be allowed to be used by the user.
+RESTRICTED_ACCESS_URL = "https://storage.googleapis.com/continuous-integration-model-storage/convert_tf2.tar.gz"
+
 
 class ConveRTFeaturizer(DenseFeaturizer):
     """Featurizer using ConveRT model.
@@ -35,22 +56,135 @@ class ConveRTFeaturizer(DenseFeaturizer):
     for dense featurizable attributes of each message object.
     """
 
+    defaults = {
+        # Remote URL/Local path to model files
+        "model_url": None
+    }
+
     @classmethod
     def required_components(cls) -> List[Type[Component]]:
-        return [ConveRTTokenizer]
+        """Components that should be included in the pipeline before this component."""
+        return [Tokenizer]
 
     @classmethod
     def required_packages(cls) -> List[Text]:
+        """Packages needed to be installed."""
         return ["tensorflow_text", "tensorflow_hub"]
 
     def __init__(self, component_config: Optional[Dict[Text, Any]] = None) -> None:
+        """Initializes ConveRTFeaturizer with the model and different
+        encoding signatures.
 
+        Args:
+            component_config: Configuration for the component.
+        """
         super(ConveRTFeaturizer, self).__init__(component_config)
+        self.model_url = self._get_validated_model_url()
+
+        self.module = train_utils.load_tf_hub_model(self.model_url)
+
+        self.tokenize_signature = self._get_signature("tokenize", self.module)
+        self.sequence_encoding_signature = self._get_signature(
+            "encode_sequence", self.module
+        )
+        self.sentence_encoding_signature = self._get_signature("default", self.module)
 
     @staticmethod
-    def __get_signature(signature: Text, module: Any) -> NoReturn:
-        """Retrieve a signature from a (hopefully loaded) TF model."""
+    def _validate_model_files_exist(model_directory: Text) -> None:
+        """Check if essential model files exist inside the model_directory.
+
+        Args:
+            model_directory: Directory to investigate
+        """
+        files_to_check = [
+            os.path.join(model_directory, "saved_model.pb"),
+            os.path.join(model_directory, "variables/variables.index"),
+            os.path.join(model_directory, "variables/variables.data-00001-of-00002"),
+            os.path.join(model_directory, "variables/variables.data-00000-of-00002"),
+        ]
 
+        for file_path in files_to_check:
+            if not os.path.exists(file_path):
+                raise RasaException(
+                    f"""File {file_path} does not exist.
+                        Re-check the files inside the directory {model_directory}.
+                        It should contain the following model
+                        files - [{", ".join(files_to_check)}]"""
+                )
+
+    def _get_validated_model_url(self) -> Text:
+        """Validates the specified `model_url` parameter.
+
+        The `model_url` parameter cannot be left empty. It can either
+        be set to a remote URL where the model is hosted or it can be
+        a path to a local directory.
+
+        Returns:
+            Validated path to model
+        """
+        model_url = self.component_config.get("model_url", None)
+
+        if not model_url:
+            raise RasaException(
+                f"""Parameter "model_url" was not specified in the configuration
+                of "{ConveRTFeaturizer.__name__}". It is mandatory to pass a value for this parameter.
+                You can either use a community hosted URL of the model
+                or if you have a local copy of the model, pass the
+                path to the directory containing the model files."""
+            )
+
+        if model_url == ORIGINAL_TF_HUB_MODULE_URL:
+            # Can't use the originally hosted URL
+            raise RasaException(
+                f"""Parameter "model_url" of "{ConveRTFeaturizer.__name__}" was
+                set to "{model_url}" which does not contain the model any longer.
+                You can either use a community hosted URL or if you have a
+                local copy of the model, pass the path to the directory
+                containing the model files."""
+            )
+
+        if model_url == RESTRICTED_ACCESS_URL:
+            # Can't use the URL that is reserved for tests only
+            raise RasaException(
+                f"""Parameter "model_url" of "{ConveRTFeaturizer.__name__}" was
+                set to "{model_url}" which is strictly reserved for pytests of Rasa Open Source only.
+                Due to licensing issues you are not allowed to use the model from this URL.
+                You can either use a community hosted URL or if you have a
+                local copy of the model, pass the path to the directory
+                containing the model files."""
+            )
+
+        if os.path.isfile(model_url):
+            # Definitely invalid since the specified path should be a directory
+            raise RasaException(
+                f"""Parameter "model_url" of "{ConveRTFeaturizer.__name__}" was
+                set to the path of a file which is invalid. You
+                can either use a community hosted URL or if you have a
+                local copy of the model, pass the path to the directory
+                containing the model files."""
+            )
+
+        if rasa.nlu.utils.is_url(model_url):
+            return model_url
+
+        if os.path.isdir(model_url):
+            # Looks like a local directory. Inspect the directory
+            # to see if model files exist.
+            self._validate_model_files_exist(model_url)
+            # Convert the path to an absolute one since
+            # TFHUB doesn't like relative paths
+            return os.path.abspath(model_url)
+
+        raise RasaException(
+            f"""{model_url} is neither a valid remote URL nor a local directory.
+            You can either use a community hosted URL or if you have a
+            local copy of the model, pass the path to
+            the directory containing the model files."""
+        )
+
+    @staticmethod
+    def _get_signature(signature: Text, module: Any) -> NoReturn:
+        """Retrieve a signature from a (hopefully loaded) TF model."""
         if not module:
             raise Exception(
                 "ConveRTFeaturizer needs a proper loaded tensorflow module when used. "
@@ -60,39 +194,34 @@ def __get_signature(signature: Text, module: Any) -> NoReturn:
         return module.signatures[signature]
 
     def _compute_features(
-        self, batch_examples: List[Message], module: Any, attribute: Text = TEXT
+        self, batch_examples: List[Message], attribute: Text = TEXT
     ) -> Tuple[np.ndarray, np.ndarray]:
-
-        sentence_encodings = self._compute_sentence_encodings(
-            batch_examples, module, attribute
-        )
+        sentence_encodings = self._compute_sentence_encodings(batch_examples, attribute)
 
         (
             sequence_encodings,
             number_of_tokens_in_sentence,
-        ) = self._compute_sequence_encodings(batch_examples, module, attribute)
+        ) = self._compute_sequence_encodings(batch_examples, attribute)
 
         return self._get_features(
             sentence_encodings, sequence_encodings, number_of_tokens_in_sentence
         )
 
     def _compute_sentence_encodings(
-        self, batch_examples: List[Message], module: Any, attribute: Text = TEXT
+        self, batch_examples: List[Message], attribute: Text = TEXT
     ) -> np.ndarray:
         # Get text for attribute of each example
         batch_attribute_text = [ex.get(attribute) for ex in batch_examples]
-        sentence_encodings = self._sentence_encoding_of_text(
-            batch_attribute_text, module
-        )
+        sentence_encodings = self._sentence_encoding_of_text(batch_attribute_text)
 
         # convert them to a sequence of 1
         return np.reshape(sentence_encodings, (len(batch_examples), 1, -1))
 
     def _compute_sequence_encodings(
-        self, batch_examples: List[Message], module: Any, attribute: Text = TEXT
+        self, batch_examples: List[Message], attribute: Text = TEXT
     ) -> Tuple[np.ndarray, List[int]]:
         list_of_tokens = [
-            example.get(TOKENS_NAMES[attribute]) for example in batch_examples
+            self.tokenize(example, attribute) for example in batch_examples
         ]
 
         number_of_tokens_in_sentence = [
@@ -103,7 +232,7 @@ def _compute_sequence_encodings(
         # the returned embeddings from ConveRT matches the length of the tokens
         # (including sub-tokens)
         tokenized_texts = self._tokens_to_text(list_of_tokens)
-        token_features = self._sequence_encoding_of_text(tokenized_texts, module)
+        token_features = self._sequence_encoding_of_text(tokenized_texts)
 
         # ConveRT might split up tokens into sub-tokens
         # take the mean of the sub-token vectors and use that as the token vector
@@ -120,7 +249,6 @@ def _get_features(
         number_of_tokens_in_sentence: List[int],
     ) -> Tuple[np.ndarray, np.ndarray]:
         """Get the sequence and sentence features."""
-
         sentence_embeddings = []
         sequence_embeddings = []
 
@@ -138,8 +266,9 @@ def _get_features(
     def _tokens_to_text(list_of_tokens: List[List[Token]]) -> List[Text]:
         """Convert list of tokens to text.
 
-        Add a whitespace between two tokens if the end value of the first tokens is
-        not the same as the end value of the second token."""
+        Add a whitespace between two tokens if the end value of the first tokens
+        is not the same as the end value of the second token.
+        """
         texts = []
         for tokens in list_of_tokens:
             text = ""
@@ -154,23 +283,31 @@ def _tokens_to_text(list_of_tokens: List[List[Token]]) -> List[Text]:
 
         return texts
 
-    def _sentence_encoding_of_text(self, batch: List[Text], module: Any) -> np.ndarray:
-        signature = self.__get_signature("default", module)
-        return signature(tf.convert_to_tensor(batch))["default"].numpy()
+    def _sentence_encoding_of_text(self, batch: List[Text]) -> np.ndarray:
 
-    def _sequence_encoding_of_text(self, batch: List[Text], module: Any) -> np.ndarray:
-        signature = self.__get_signature("encode_sequence", module)
+        return self.sentence_encoding_signature(tf.convert_to_tensor(batch))[
+            "default"
+        ].numpy()
 
-        return signature(tf.convert_to_tensor(batch))["sequence_encoding"].numpy()
+    def _sequence_encoding_of_text(self, batch: List[Text]) -> np.ndarray:
+
+        return self.sequence_encoding_signature(tf.convert_to_tensor(batch))[
+            "sequence_encoding"
+        ].numpy()
 
     def train(
         self,
         training_data: TrainingData,
         config: Optional[RasaNLUModelConfig] = None,
-        *,
-        tf_hub_module: Any = None,
         **kwargs: Any,
     ) -> None:
+        """Featurize all message attributes in the training data with the ConveRT model.
+
+        Args:
+            training_data: Training data to be featurized
+            config: Pipeline configuration
+            **kwargs: Any other arguments.
+        """
         if config is not None and config.language != "en":
             rasa.shared.utils.io.raise_warning(
                 f"Since ``ConveRT`` model is trained only on an english "
@@ -203,7 +340,7 @@ def train(
                 (
                     batch_sequence_features,
                     batch_sentence_features,
-                ) = self._compute_features(batch_examples, tf_hub_module, attribute)
+                ) = self._compute_features(batch_examples, attribute)
 
                 self._set_features(
                     batch_examples,
@@ -212,14 +349,17 @@ def train(
                     attribute,
                 )
 
-    def process(
-        self, message: Message, *, tf_hub_module: Any = None, **kwargs: Any
-    ) -> None:
+    def process(self, message: Message, **kwargs: Any) -> None:
+        """Featurize an incoming message with the ConveRT model.
 
-        for attribute in DENSE_FEATURIZABLE_ATTRIBUTES:
+        Args:
+            message: Message to be featurized
+            **kwargs: Any other arguments.
+        """
+        for attribute in {TEXT, ACTION_TEXT}:
             if message.get(attribute):
                 sequence_features, sentence_features = self._compute_features(
-                    [message], tf_hub_module, attribute=attribute
+                    [message], attribute=attribute
                 )
 
                 self._set_features(
@@ -249,3 +389,61 @@ def _set_features(
                 self.component_config[FEATURIZER_CLASS_ALIAS],
             )
             example.add_features(_sentence_features)
+
+    @classmethod
+    def cache_key(
+        cls, component_meta: Dict[Text, Any], model_metadata: Metadata
+    ) -> Optional[Text]:
+        """Cache the component for future use.
+
+        Args:
+            component_meta: configuration for the component.
+            model_metadata: configuration for the whole pipeline.
+
+        Returns: key of the cache for future retrievals.
+        """
+        _config = common.update_existing_keys(cls.defaults, component_meta)
+        return f"{cls.name}-{rasa.core.utils.get_dict_hash(_config)}"
+
+    def provide_context(self) -> Dict[Text, Any]:
+        """Store the model in pipeline context for future use."""
+        return {"tf_hub_module": self.module}
+
+    def _tokenize(self, sentence: Text) -> Any:
+
+        return self.tokenize_signature(tf.convert_to_tensor([sentence]))[
+            "default"
+        ].numpy()
+
+    def tokenize(self, message: Message, attribute: Text) -> List[Token]:
+        """Tokenize the text using the ConveRT model.
+
+        ConveRT adds a special char in front of (some) words and splits words into
+        sub-words. To ensure the entity start and end values matches the token values,
+        reuse the tokens that are already assigned to the message. If individual tokens
+        are split up into multiple tokens, add this information to the
+        respected tokens.
+        """
+        tokens_in = message.get(TOKENS_NAMES[attribute])
+
+        tokens_out = []
+
+        for token in tokens_in:
+            # use ConveRT model to tokenize the text
+            split_token_strings = self._tokenize(token.text)[0]
+
+            # clean tokens (remove special chars and empty tokens)
+            split_token_strings = self._clean_tokens(split_token_strings)
+
+            token.set(NUMBER_OF_SUB_TOKENS, len(split_token_strings))
+
+            tokens_out.append(token)
+
+        message.set(TOKENS_NAMES[attribute], tokens_out)
+        return tokens_out
+
+    @staticmethod
+    def _clean_tokens(tokens: List[bytes]) -> List[Text]:
+        """Encode tokens and remove special char added by ConveRT."""
+        tokens = [string.decode("utf-8").replace("﹏", "") for string in tokens]
+        return [string for string in tokens if string]
diff --git a/rasa/nlu/featurizers/dense_featurizer/lm_featurizer.py b/rasa/nlu/featurizers/dense_featurizer/lm_featurizer.py
index d0bea59d1c78..4583dcd6fad1 100644
--- a/rasa/nlu/featurizers/dense_featurizer/lm_featurizer.py
+++ b/rasa/nlu/featurizers/dense_featurizer/lm_featurizer.py
@@ -1,33 +1,776 @@
-from typing import Any, Optional, Text, List, Type
+import numpy as np
+import logging
 
+from typing import Any, Optional, Text, List, Type, Dict, Tuple
+
+import rasa.core.utils
 from rasa.nlu.config import RasaNLUModelConfig
-from rasa.nlu.components import Component
+from rasa.nlu.components import Component, UnsupportedLanguageError
 from rasa.nlu.featurizers.featurizer import DenseFeaturizer
+from rasa.nlu.model import Metadata
 from rasa.shared.nlu.training_data.features import Features
-from rasa.nlu.utils.hugging_face.hf_transformers import HFTransformersNLP
-from rasa.nlu.tokenizers.lm_tokenizer import LanguageModelTokenizer
+from rasa.nlu.tokenizers.tokenizer import Tokenizer, Token
 from rasa.shared.nlu.training_data.training_data import TrainingData
 from rasa.shared.nlu.training_data.message import Message
 from rasa.nlu.constants import (
-    LANGUAGE_MODEL_DOCS,
     DENSE_FEATURIZABLE_ATTRIBUTES,
     SEQUENCE_FEATURES,
     SENTENCE_FEATURES,
     FEATURIZER_CLASS_ALIAS,
+    NO_LENGTH_RESTRICTION,
+    NUMBER_OF_SUB_TOKENS,
+    TOKENS_NAMES,
+    LANGUAGE_MODEL_DOCS,
+)
+from rasa.shared.nlu.constants import (
+    TEXT,
+    FEATURE_TYPE_SENTENCE,
+    FEATURE_TYPE_SEQUENCE,
+    ACTION_TEXT,
 )
-from rasa.shared.nlu.constants import TEXT, FEATURE_TYPE_SENTENCE, FEATURE_TYPE_SEQUENCE
+from rasa.utils import train_utils
+
+MAX_SEQUENCE_LENGTHS = {
+    "bert": 512,
+    "gpt": 512,
+    "gpt2": 512,
+    "xlnet": NO_LENGTH_RESTRICTION,
+    "distilbert": 512,
+    "roberta": 512,
+}
+
+logger = logging.getLogger(__name__)
 
 
 class LanguageModelFeaturizer(DenseFeaturizer):
-    """Featurizer using transformer based language models.
+    """Featurizer using transformer-based language models.
 
-    Uses the output of HFTransformersNLP component to set the sequence and sentence
-    level representations for dense featurizable attributes of each message object.
+    The transformers(https://github.com/huggingface/transformers) library
+    is used to load pre-trained language models like BERT, GPT-2, etc.
+    The component also tokenizes and featurizes dense featurizable attributes of
+    each message.
     """
 
+    defaults = {
+        # name of the language model to load.
+        "model_name": "bert",
+        # Pre-Trained weights to be loaded(string)
+        "model_weights": None,
+        # an optional path to a specific directory to download
+        # and cache the pre-trained model weights.
+        "cache_dir": None,
+    }
+
     @classmethod
     def required_components(cls) -> List[Type[Component]]:
-        return [HFTransformersNLP, LanguageModelTokenizer]
+        """Packages needed to be installed."""
+        return [Tokenizer]
+
+    def __init__(
+        self,
+        component_config: Optional[Dict[Text, Any]] = None,
+        skip_model_load: bool = False,
+        hf_transformers_loaded: bool = False,
+    ) -> None:
+        """Initializes LanguageModelFeaturizer with the specified model.
+
+        Args:
+            component_config: Configuration for the component.
+            skip_model_load: Skip loading the model for pytests.
+            hf_transformers_loaded: Skip loading of model and metadata, use
+            HFTransformers output instead.
+        """
+        super(LanguageModelFeaturizer, self).__init__(component_config)
+        if hf_transformers_loaded:
+            return
+        self._load_model_metadata()
+        self._load_model_instance(skip_model_load)
+
+    @classmethod
+    def create(
+        cls, component_config: Dict[Text, Any], config: RasaNLUModelConfig
+    ) -> "DenseFeaturizer":
+        language = config.language
+        if not cls.can_handle_language(language):
+            # check failed
+            raise UnsupportedLanguageError(cls.name, language)
+        # TODO: remove this when HFTransformersNLP is removed for good
+        if isinstance(config, Metadata):
+            hf_transformers_loaded = "HFTransformersNLP" in [
+                c["name"] for c in config.metadata["pipeline"]
+            ]
+        else:
+            hf_transformers_loaded = "HFTransformersNLP" in config.component_names
+        return cls(component_config, hf_transformers_loaded=hf_transformers_loaded)
+
+    @classmethod
+    def load(
+        cls,
+        meta: Dict[Text, Any],
+        model_dir: Optional[Text] = None,
+        model_metadata: Optional["Metadata"] = None,
+        cached_component: Optional["Component"] = None,
+        **kwargs: Any,
+    ) -> "Component":
+        """Load this component from file.
+
+        After a component has been trained, it will be persisted by
+        calling `persist`. When the pipeline gets loaded again,
+        this component needs to be able to restore itself.
+        Components can rely on any context attributes that are
+        created by :meth:`components.Component.create`
+        calls to components previous to this one.
+
+        This method differs from the parent method only in that it calls create
+        rather than the constructor if the component is not found. This is to
+        trigger the check for HFTransformersNLP and the method can be removed
+        when HFTRansformersNLP is removed.
+
+        Args:
+                meta: Any configuration parameter related to the model.
+                model_dir: The directory to load the component from.
+                model_metadata: The model's :class:`rasa.nlu.model.Metadata`.
+                cached_component: The cached component.
+
+        Returns:
+                the loaded component
+        """
+        # TODO: remove this when HFTransformersNLP is removed for good
+        if cached_component:
+            return cached_component
+
+        return cls.create(meta, model_metadata)
+
+    def _load_model_metadata(self) -> None:
+        """Load the metadata for the specified model and sets these properties.
+
+        This includes the model name, model weights, cache directory and the
+        maximum sequence length the model can handle.
+        """
+        from rasa.nlu.utils.hugging_face.registry import (
+            model_class_dict,
+            model_weights_defaults,
+        )
+
+        self.model_name = self.component_config["model_name"]
+
+        if self.model_name not in model_class_dict:
+            raise KeyError(
+                f"'{self.model_name}' not a valid model name. Choose from "
+                f"{str(list(model_class_dict.keys()))} or create"
+                f"a new class inheriting from this class to support your model."
+            )
+
+        self.model_weights = self.component_config["model_weights"]
+        self.cache_dir = self.component_config["cache_dir"]
+
+        if not self.model_weights:
+            logger.info(
+                f"Model weights not specified. Will choose default model "
+                f"weights: {model_weights_defaults[self.model_name]}"
+            )
+            self.model_weights = model_weights_defaults[self.model_name]
+
+        self.max_model_sequence_length = MAX_SEQUENCE_LENGTHS[self.model_name]
+
+    def _load_model_instance(self, skip_model_load: bool) -> None:
+        """Try loading the model instance.
+
+        Args:
+            skip_model_load: Skip loading the model instances to save time. This
+            should be True only for pytests
+        """
+        if skip_model_load:
+            # This should be True only during pytests
+            return
+
+        from rasa.nlu.utils.hugging_face.registry import (
+            model_class_dict,
+            model_tokenizer_dict,
+        )
+
+        logger.debug(f"Loading Tokenizer and Model for {self.model_name}")
+
+        self.tokenizer = model_tokenizer_dict[self.model_name].from_pretrained(
+            self.model_weights, cache_dir=self.cache_dir
+        )
+        self.model = model_class_dict[self.model_name].from_pretrained(
+            self.model_weights, cache_dir=self.cache_dir
+        )
+
+        # Use a universal pad token since all transformer architectures do not have a
+        # consistent token. Instead of pad_token_id we use unk_token_id because
+        # pad_token_id is not set for all architectures. We can't add a new token as
+        # well since vocabulary resizing is not yet supported for TF classes.
+        # Also, this does not hurt the model predictions since we use an attention mask
+        # while feeding input.
+        self.pad_token_id = self.tokenizer.unk_token_id
+
+    @classmethod
+    def cache_key(
+        cls, component_meta: Dict[Text, Any], model_metadata: Metadata
+    ) -> Optional[Text]:
+        """Cache the component for future use.
+
+        Args:
+            component_meta: configuration for the component.
+            model_metadata: configuration for the whole pipeline.
+
+        Returns: key of the cache for future retrievals.
+        """
+        weights = component_meta.get("model_weights") or {}
+
+        return (
+            f"{cls.name}-{component_meta.get('model_name')}-"
+            f"{rasa.core.utils.get_dict_hash(weights)}"
+        )
+
+    @classmethod
+    def required_packages(cls) -> List[Text]:
+        """Packages needed to be installed."""
+        return ["transformers"]
+
+    def _lm_tokenize(self, text: Text) -> Tuple[List[int], List[Text]]:
+        """Pass the text through the tokenizer of the language model.
+
+        Args:
+            text: Text to be tokenized.
+
+        Returns: List of token ids and token strings.
+        """
+        split_token_ids = self.tokenizer.encode(text, add_special_tokens=False)
+
+        split_token_strings = self.tokenizer.convert_ids_to_tokens(split_token_ids)
+
+        return split_token_ids, split_token_strings
+
+    def _add_lm_specific_special_tokens(
+        self, token_ids: List[List[int]]
+    ) -> List[List[int]]:
+        """Add language model specific special tokens which were used during
+        their training.
+
+        Args:
+            token_ids: List of token ids for each example in the batch.
+
+        Returns: Augmented list of token ids for each example in the batch.
+        """
+        from rasa.nlu.utils.hugging_face.registry import (
+            model_special_tokens_pre_processors,
+        )
+
+        augmented_tokens = [
+            model_special_tokens_pre_processors[self.model_name](example_token_ids)
+            for example_token_ids in token_ids
+        ]
+        return augmented_tokens
+
+    def _lm_specific_token_cleanup(
+        self, split_token_ids: List[int], token_strings: List[Text]
+    ) -> Tuple[List[int], List[Text]]:
+        """Clean up special chars added by tokenizers of language models.
+
+        Many language models add a special char in front/back of (some) words. We clean
+        up those chars as they are not
+        needed once the features are already computed.
+
+        Args:
+            split_token_ids: List of token ids received as output from the language
+            model specific tokenizer.
+            token_strings: List of token strings received as output from the language
+            model specific tokenizer.
+
+        Returns: Cleaned up token ids and token strings.
+        """
+        from rasa.nlu.utils.hugging_face.registry import model_tokens_cleaners
+
+        return model_tokens_cleaners[self.model_name](split_token_ids, token_strings)
+
+    def _post_process_sequence_embeddings(
+        self, sequence_embeddings: np.ndarray
+    ) -> Tuple[np.ndarray, np.ndarray]:
+        """Compute sentence and sequence level representations for relevant tokens.
+
+        Args:
+            sequence_embeddings: Sequence level dense features received as output from
+            language model.
+
+        Returns: Sentence and sequence level representations.
+        """
+        from rasa.nlu.utils.hugging_face.registry import (
+            model_embeddings_post_processors,
+        )
+
+        sentence_embeddings = []
+        post_processed_sequence_embeddings = []
+
+        for example_embedding in sequence_embeddings:
+            (
+                example_sentence_embedding,
+                example_post_processed_embedding,
+            ) = model_embeddings_post_processors[self.model_name](example_embedding)
+
+            sentence_embeddings.append(example_sentence_embedding)
+            post_processed_sequence_embeddings.append(example_post_processed_embedding)
+
+        return (
+            np.array(sentence_embeddings),
+            np.array(post_processed_sequence_embeddings),
+        )
+
+    def _tokenize_example(
+        self, message: Message, attribute: Text
+    ) -> Tuple[List[Token], List[int]]:
+        """Tokenize a single message example.
+
+        Many language models add a special char in front of (some) words and split
+        words into sub-words. To ensure the entity start and end values matches the
+        token values, use the tokens produced by the Tokenizer component. If
+        individual tokens are split up into multiple tokens, we add this information
+        to the respected token.
+
+        Args:
+            message: Single message object to be processed.
+            attribute: Property of message to be processed, one of ``TEXT`` or
+            ``RESPONSE``.
+
+        Returns: List of token strings and token ids for the corresponding
+                attribute of the message.
+        """
+        tokens_in = message.get(TOKENS_NAMES[attribute])
+        tokens_out = []
+
+        token_ids_out = []
+
+        for token in tokens_in:
+            # use lm specific tokenizer to further tokenize the text
+            split_token_ids, split_token_strings = self._lm_tokenize(token.text)
+
+            (split_token_ids, split_token_strings) = self._lm_specific_token_cleanup(
+                split_token_ids, split_token_strings
+            )
+
+            token_ids_out += split_token_ids
+
+            token.set(NUMBER_OF_SUB_TOKENS, len(split_token_strings))
+
+            tokens_out.append(token)
+
+        return tokens_out, token_ids_out
+
+    def _get_token_ids_for_batch(
+        self, batch_examples: List[Message], attribute: Text
+    ) -> Tuple[List[List[Token]], List[List[int]]]:
+        """Compute token ids and token strings for each example in batch.
+
+        A token id is the id of that token in the vocabulary of the language model.
+
+        Args:
+            batch_examples: Batch of message objects for which tokens need to be
+            computed.
+            attribute: Property of message to be processed, one of ``TEXT`` or
+            ``RESPONSE``.
+
+        Returns: List of token strings and token ids for each example in the batch.
+        """
+        batch_token_ids = []
+        batch_tokens = []
+        for example in batch_examples:
+
+            example_tokens, example_token_ids = self._tokenize_example(
+                example, attribute
+            )
+            batch_tokens.append(example_tokens)
+            batch_token_ids.append(example_token_ids)
+
+        return batch_tokens, batch_token_ids
+
+    @staticmethod
+    def _compute_attention_mask(
+        actual_sequence_lengths: List[int], max_input_sequence_length: int
+    ) -> np.ndarray:
+        """Compute a mask for padding tokens.
+
+        This mask will be used by the language model so that it does not attend to
+        padding tokens.
+
+        Args:
+            actual_sequence_lengths: List of length of each example without any
+            padding.
+            max_input_sequence_length: Maximum length of a sequence that will be
+            present in the input batch. This is
+            after taking into consideration the maximum input sequence the model
+            can handle. Hence it can never be
+            greater than self.max_model_sequence_length in case the model
+            applies length restriction.
+
+        Returns: Computed attention mask, 0 for padding and 1 for non-padding
+        tokens.
+        """
+        attention_mask = []
+
+        for actual_sequence_length in actual_sequence_lengths:
+            # add 1s for present tokens, fill up the remaining space up to max
+            # sequence length with 0s (non-existing tokens)
+            padded_sequence = [1] * min(
+                actual_sequence_length, max_input_sequence_length
+            ) + [0] * (
+                max_input_sequence_length
+                - min(actual_sequence_length, max_input_sequence_length)
+            )
+            attention_mask.append(padded_sequence)
+
+        attention_mask = np.array(attention_mask).astype(np.float32)
+        return attention_mask
+
+    def _extract_sequence_lengths(
+        self, batch_token_ids: List[List[int]]
+    ) -> Tuple[List[int], int]:
+        """Extracts the sequence length for each example and maximum sequence length.
+
+        Args:
+            batch_token_ids: List of token ids for each example in the batch.
+
+        Returns:
+            Tuple consisting of: the actual sequence lengths for each example,
+            and the maximum input sequence length (taking into account the
+            maximum sequence length that the model can handle.
+        """
+        # Compute max length across examples
+        max_input_sequence_length = 0
+        actual_sequence_lengths = []
+
+        for example_token_ids in batch_token_ids:
+            sequence_length = len(example_token_ids)
+            actual_sequence_lengths.append(sequence_length)
+            max_input_sequence_length = max(
+                max_input_sequence_length, len(example_token_ids)
+            )
+
+        # Take into account the maximum sequence length the model can handle
+        max_input_sequence_length = (
+            max_input_sequence_length
+            if self.max_model_sequence_length == NO_LENGTH_RESTRICTION
+            else min(max_input_sequence_length, self.max_model_sequence_length)
+        )
+
+        return actual_sequence_lengths, max_input_sequence_length
+
+    def _add_padding_to_batch(
+        self, batch_token_ids: List[List[int]], max_sequence_length_model: int
+    ) -> List[List[int]]:
+        """Add padding so that all examples in the batch are of the same length.
+
+        Args:
+            batch_token_ids: Batch of examples where each example is a non-padded list
+            of token ids.
+            max_sequence_length_model: Maximum length of any input sequence in the batch
+            to be fed to the model.
+
+        Returns:
+            Padded batch with all examples of the same length.
+        """
+        padded_token_ids = []
+
+        # Add padding according to max_sequence_length
+        # Some models don't contain pad token, we use unknown token as padding token.
+        # This doesn't affect the computation since we compute an attention mask
+        # anyways.
+        for example_token_ids in batch_token_ids:
+
+            # Truncate any longer sequences so that they can be fed to the model
+            if len(example_token_ids) > max_sequence_length_model:
+                example_token_ids = example_token_ids[:max_sequence_length_model]
+
+            padded_token_ids.append(
+                example_token_ids
+                + [self.pad_token_id]
+                * (max_sequence_length_model - len(example_token_ids))
+            )
+        return padded_token_ids
+
+    @staticmethod
+    def _extract_nonpadded_embeddings(
+        embeddings: np.ndarray, actual_sequence_lengths: List[int]
+    ) -> np.ndarray:
+        """Extract embeddings for actual tokens.
+
+        Use pre-computed non-padded lengths of each example to extract embeddings
+        for non-padding tokens.
+
+        Args:
+            embeddings: sequence level representations for each example of the batch.
+            actual_sequence_lengths: non-padded lengths of each example of the batch.
+
+        Returns:
+            Sequence level embeddings for only non-padding tokens of the batch.
+        """
+        nonpadded_sequence_embeddings = []
+        for index, embedding in enumerate(embeddings):
+            unmasked_embedding = embedding[: actual_sequence_lengths[index]]
+            nonpadded_sequence_embeddings.append(unmasked_embedding)
+
+        return np.array(nonpadded_sequence_embeddings)
+
+    def _compute_batch_sequence_features(
+        self, batch_attention_mask: np.ndarray, padded_token_ids: List[List[int]]
+    ) -> np.ndarray:
+        """Feed the padded batch to the language model.
+
+        Args:
+            batch_attention_mask: Mask of 0s and 1s which indicate whether the token
+            is a padding token or not.
+            padded_token_ids: Batch of token ids for each example. The batch is padded
+            and hence can be fed at once.
+
+        Returns:
+            Sequence level representations from the language model.
+        """
+        model_outputs = self.model(
+            np.array(padded_token_ids), attention_mask=np.array(batch_attention_mask)
+        )
+
+        # sequence hidden states is always the first output from all models
+        sequence_hidden_states = model_outputs[0]
+
+        sequence_hidden_states = sequence_hidden_states.numpy()
+        return sequence_hidden_states
+
+    def _validate_sequence_lengths(
+        self,
+        actual_sequence_lengths: List[int],
+        batch_examples: List[Message],
+        attribute: Text,
+        inference_mode: bool = False,
+    ) -> None:
+        """Validate if sequence lengths of all inputs are less the max sequence
+        length the model can handle.
+
+        This method should throw an error during training, whereas log a debug
+        message during inference if any of the input examples have a length
+        greater than maximum sequence length allowed.
+
+        Args:
+            actual_sequence_lengths: original sequence length of all inputs
+            batch_examples: all message instances in the batch
+            attribute: attribute of message object to be processed
+            inference_mode: Whether this is during training or during inferencing
+        """
+        if self.max_model_sequence_length == NO_LENGTH_RESTRICTION:
+            # There is no restriction on sequence length from the model
+            return
+
+        for sequence_length, example in zip(actual_sequence_lengths, batch_examples):
+            if sequence_length > self.max_model_sequence_length:
+                if not inference_mode:
+                    raise RuntimeError(
+                        f"The sequence length of '{example.get(attribute)[:20]}...' "
+                        f"is too long({sequence_length} tokens) for the "
+                        f"model chosen {self.model_name} which has a maximum "
+                        f"sequence length of {self.max_model_sequence_length} tokens. Either "
+                        f"shorten the message or use a model which has no "
+                        f"restriction on input sequence length like XLNet."
+                    )
+                logger.debug(
+                    f"The sequence length of '{example.get(attribute)[:20]}...' "
+                    f"is too long({sequence_length} tokens) for the "
+                    f"model chosen {self.model_name} which has a maximum "
+                    f"sequence length of {self.max_model_sequence_length} tokens. "
+                    f"Downstream model predictions may be affected because of this."
+                )
+
+    def _add_extra_padding(
+        self, sequence_embeddings: np.ndarray, actual_sequence_lengths: List[int]
+    ) -> np.ndarray:
+        """Add extra zero padding to match the original sequence length.
+
+        This is only done if the input was truncated during the batch
+        preparation of input for the model.
+        Args:
+            sequence_embeddings: Embeddings returned from the model
+            actual_sequence_lengths: original sequence length of all inputs
+
+        Returns:
+            Modified sequence embeddings with padding if necessary
+        """
+        if self.max_model_sequence_length == NO_LENGTH_RESTRICTION:
+            # No extra padding needed because there wouldn't have been any
+            # truncation in the first place
+            return sequence_embeddings
+
+        reshaped_sequence_embeddings = []
+        for index, embedding in enumerate(sequence_embeddings):
+            embedding_size = embedding.shape[-1]
+            if actual_sequence_lengths[index] > self.max_model_sequence_length:
+                embedding = np.concatenate(
+                    [
+                        embedding,
+                        np.zeros(
+                            (
+                                actual_sequence_lengths[index]
+                                - self.max_model_sequence_length,
+                                embedding_size,
+                            ),
+                            dtype=np.float32,
+                        ),
+                    ]
+                )
+            reshaped_sequence_embeddings.append(embedding)
+
+        return np.array(reshaped_sequence_embeddings)
+
+    def _get_model_features_for_batch(
+        self,
+        batch_token_ids: List[List[int]],
+        batch_tokens: List[List[Token]],
+        batch_examples: List[Message],
+        attribute: Text,
+        inference_mode: bool = False,
+    ) -> Tuple[np.ndarray, np.ndarray]:
+        """Compute dense features of each example in the batch.
+
+        We first add the special tokens corresponding to each language model. Next, we
+        add appropriate padding and compute a mask for that padding so that it doesn't
+        affect the feature computation. The padded batch is next fed to the language
+        model and token level embeddings are computed. Using the pre-computed mask,
+        embeddings for non-padding tokens are extracted and subsequently sentence
+        level embeddings are computed.
+
+        Args:
+            batch_token_ids: List of token ids of each example in the batch.
+            batch_tokens: List of token objects for each example in the batch.
+            batch_examples: List of examples in the batch.
+            attribute: attribute of the Message object to be processed.
+            inference_mode: Whether the call is during training or during inference.
+
+        Returns:
+            Sentence and token level dense representations.
+        """
+        # Let's first add tokenizer specific special tokens to all examples
+        batch_token_ids_augmented = self._add_lm_specific_special_tokens(
+            batch_token_ids
+        )
+
+        # Compute sequence lengths for all examples
+        (
+            actual_sequence_lengths,
+            max_input_sequence_length,
+        ) = self._extract_sequence_lengths(batch_token_ids_augmented)
+
+        # Validate that all sequences can be processed based on their sequence
+        # lengths and the maximum sequence length the model can handle
+        self._validate_sequence_lengths(
+            actual_sequence_lengths, batch_examples, attribute, inference_mode
+        )
+
+        # Add padding so that whole batch can be fed to the model
+        padded_token_ids = self._add_padding_to_batch(
+            batch_token_ids_augmented, max_input_sequence_length
+        )
+
+        # Compute attention mask based on actual_sequence_length
+        batch_attention_mask = self._compute_attention_mask(
+            actual_sequence_lengths, max_input_sequence_length
+        )
+
+        # Get token level features from the model
+        sequence_hidden_states = self._compute_batch_sequence_features(
+            batch_attention_mask, padded_token_ids
+        )
+
+        # Extract features for only non-padding tokens
+        sequence_nonpadded_embeddings = self._extract_nonpadded_embeddings(
+            sequence_hidden_states, actual_sequence_lengths
+        )
+
+        # Extract sentence level and post-processed features
+        (
+            sentence_embeddings,
+            sequence_embeddings,
+        ) = self._post_process_sequence_embeddings(sequence_nonpadded_embeddings)
+
+        # Pad zeros for examples which were truncated in inference mode.
+        # This is intentionally done after sentence embeddings have been
+        # extracted so that they are not affected
+        sequence_embeddings = self._add_extra_padding(
+            sequence_embeddings, actual_sequence_lengths
+        )
+
+        # shape of matrix for all sequence embeddings
+        batch_dim = len(sequence_embeddings)
+        seq_dim = max(e.shape[0] for e in sequence_embeddings)
+        feature_dim = sequence_embeddings[0].shape[1]
+        shape = (batch_dim, seq_dim, feature_dim)
+
+        # align features with tokens so that we have just one vector per token
+        # (don't include sub-tokens)
+        sequence_embeddings = train_utils.align_token_features(
+            batch_tokens, sequence_embeddings, shape
+        )
+
+        # sequence_embeddings is a padded numpy array
+        # remove the padding, keep just the non-zero vectors
+        sequence_final_embeddings = []
+        for embeddings, tokens in zip(sequence_embeddings, batch_tokens):
+            sequence_final_embeddings.append(embeddings[: len(tokens)])
+        sequence_final_embeddings = np.array(sequence_final_embeddings)
+
+        return sentence_embeddings, sequence_final_embeddings
+
+    def _get_docs_for_batch(
+        self,
+        batch_examples: List[Message],
+        attribute: Text,
+        inference_mode: bool = False,
+    ) -> List[Dict[Text, Any]]:
+        """Compute language model docs for all examples in the batch.
+
+        Args:
+            batch_examples: Batch of message objects for which language model docs
+            need to be computed.
+            attribute: Property of message to be processed, one of ``TEXT`` or
+            ``RESPONSE``.
+            inference_mode: Whether the call is during inference or during training.
+
+
+        Returns:
+            List of language model docs for each message in batch.
+        """
+        hf_transformers_doc = batch_examples[0].get(LANGUAGE_MODEL_DOCS[attribute])
+        if hf_transformers_doc:
+            # This should only be the case if the deprecated
+            # HFTransformersNLP component is used in the pipeline
+            # TODO: remove this when HFTransformersNLP is removed for good
+            logging.debug(
+                f"'{LANGUAGE_MODEL_DOCS[attribute]}' set: this "
+                f"indicates you're using the deprecated component "
+                f"HFTransformersNLP, please remove it from your "
+                f"pipeline."
+            )
+            return [ex.get(LANGUAGE_MODEL_DOCS[attribute]) for ex in batch_examples]
+
+        batch_tokens, batch_token_ids = self._get_token_ids_for_batch(
+            batch_examples, attribute
+        )
+
+        (
+            batch_sentence_features,
+            batch_sequence_features,
+        ) = self._get_model_features_for_batch(
+            batch_token_ids, batch_tokens, batch_examples, attribute, inference_mode
+        )
+
+        # A doc consists of
+        # {'sequence_features': ..., 'sentence_features': ...}
+        batch_docs = []
+        for index in range(len(batch_examples)):
+            doc = {
+                SEQUENCE_FEATURES: batch_sequence_features[index],
+                SENTENCE_FEATURES: np.reshape(batch_sentence_features[index], (1, -1)),
+            }
+            batch_docs.append(doc)
+
+        return batch_docs
 
     def train(
         self,
@@ -35,32 +778,61 @@ def train(
         config: Optional[RasaNLUModelConfig] = None,
         **kwargs: Any,
     ) -> None:
+        """Compute tokens and dense features for each message in training data.
 
-        for example in training_data.training_examples:
-            for attribute in DENSE_FEATURIZABLE_ATTRIBUTES:
-                self._set_lm_features(example, attribute)
-
-    def _get_doc(self, message: Message, attribute: Text) -> Any:
-        """
-        Get the language model doc. A doc consists of
-        {'token_ids': ..., 'tokens': ...,
-        'sequence_features': ..., 'sentence_features': ...}
+        Args:
+            training_data: NLU training data to be tokenized and featurized
+            config: NLU pipeline config consisting of all components.
         """
-        return message.get(LANGUAGE_MODEL_DOCS[attribute])
+        batch_size = 64
 
-    def process(self, message: Message, **kwargs: Any) -> None:
-        """Sets the dense features from the language model doc to the incoming
-        message."""
         for attribute in DENSE_FEATURIZABLE_ATTRIBUTES:
-            self._set_lm_features(message, attribute)
 
-    def _set_lm_features(self, message: Message, attribute: Text = TEXT) -> None:
-        """Adds the precomputed word vectors to the messages features."""
-        doc = self._get_doc(message, attribute)
+            non_empty_examples = list(
+                filter(lambda x: x.get(attribute), training_data.training_examples)
+            )
 
-        if doc is None:
-            return
+            batch_start_index = 0
+
+            while batch_start_index < len(non_empty_examples):
+
+                batch_end_index = min(
+                    batch_start_index + batch_size, len(non_empty_examples)
+                )
+                # Collect batch examples
+                batch_messages = non_empty_examples[batch_start_index:batch_end_index]
+
+                # Construct a doc with relevant features
+                # extracted(tokens, dense_features)
+                batch_docs = self._get_docs_for_batch(batch_messages, attribute)
+
+                for index, ex in enumerate(batch_messages):
+                    self._set_lm_features(batch_docs[index], ex, attribute)
+                batch_start_index += batch_size
 
+    def process(self, message: Message, **kwargs: Any) -> None:
+        """Process an incoming message by computing its tokens and dense features.
+
+        Args:
+            message: Incoming message object
+        """
+        # process of all featurizers operates only on TEXT and ACTION_TEXT attributes,
+        # because all other attributes are labels which are featurized during training
+        # and their features are stored by the model itself.
+        for attribute in {TEXT, ACTION_TEXT}:
+            if message.get(attribute):
+                self._set_lm_features(
+                    self._get_docs_for_batch(
+                        [message], attribute=attribute, inference_mode=True
+                    )[0],
+                    message,
+                    attribute,
+                )
+
+    def _set_lm_features(
+        self, doc: Dict[Text, Any], message: Message, attribute: Text = TEXT
+    ) -> None:
+        """Adds the precomputed word vectors to the messages features."""
         sequence_features = doc[SEQUENCE_FEATURES]
         sentence_features = doc[SENTENCE_FEATURES]
 
diff --git a/rasa/nlu/tokenizers/convert_tokenizer.py b/rasa/nlu/tokenizers/convert_tokenizer.py
index a2b4857732f1..369753791960 100644
--- a/rasa/nlu/tokenizers/convert_tokenizer.py
+++ b/rasa/nlu/tokenizers/convert_tokenizer.py
@@ -1,210 +1,28 @@
-from typing import Any, Dict, List, Optional, Text
+from typing import Dict, Text, Any
 
-from rasa.core.utils import get_dict_hash
-from rasa.nlu.constants import NUMBER_OF_SUB_TOKENS
-from rasa.nlu.model import Metadata
-from rasa.nlu.tokenizers.tokenizer import Token
+import rasa.shared.utils.io
+from rasa.nlu.tokenizers.tokenizer import Tokenizer
 from rasa.nlu.tokenizers.whitespace_tokenizer import WhitespaceTokenizer
-from rasa.shared.nlu.training_data.message import Message
-from rasa.utils import common
-import rasa.nlu.utils
-import rasa.utils.train_utils as train_utils
-from rasa.exceptions import RasaException
-import tensorflow as tf
-import os
-
-
-# URL to the old remote location of the model which
-# users might use. The model is no longer hosted here.
-ORIGINAL_TF_HUB_MODULE_URL = (
-    "https://github.com/PolyAI-LDN/polyai-models/releases/download/v1.0/model.tar.gz"
-)
-
-# Warning: This URL is only intended for running pytests on ConveRT
-# related components. This URL should not be allowed to be used by the user.
-RESTRICTED_ACCESS_URL = "https://storage.googleapis.com/continuous-integration-model-storage/convert_tf2.tar.gz"
 
 
 class ConveRTTokenizer(WhitespaceTokenizer):
-    """Tokenizer using ConveRT model.
+    """This tokenizer is deprecated and will be removed in the future.
 
-    Loads the ConveRT(https://github.com/PolyAI-LDN/polyai-models#convert)
-    model from TFHub and computes sub-word tokens for dense
-    featurizable attributes of each message object.
+    The ConveRTFeaturizer component now sets the sub-token information
+    for dense featurizable attributes of each message object.
     """
 
-    defaults = {
-        # Flag to check whether to split intents
-        "intent_tokenization_flag": False,
-        # Symbol on which intent should be split
-        "intent_split_symbol": "_",
-        # Regular expression to detect tokens
-        "token_pattern": None,
-        # Remote URL/Local path to model files
-        "model_url": None,
-    }
-
     def __init__(self, component_config: Dict[Text, Any] = None) -> None:
-        """Construct a new tokenizer using the WhitespaceTokenizer framework.
+        """Initializes ConveRTTokenizer with the ConveRT model.
 
         Args:
-            component_config: User configuration for the component
+            component_config: Configuration for the component.
         """
         super().__init__(component_config)
-
-        self.model_url = self._get_validated_model_url()
-
-        self.module = train_utils.load_tf_hub_model(self.model_url)
-
-        self.tokenize_signature = self.module.signatures["tokenize"]
-
-    @staticmethod
-    def _validate_model_files_exist(model_directory: Text) -> None:
-        """Check if essential model files exist inside the model_directory.
-
-        Args:
-            model_directory: Directory to investigate
-        """
-        files_to_check = [
-            os.path.join(model_directory, "saved_model.pb"),
-            os.path.join(model_directory, "variables/variables.index"),
-            os.path.join(model_directory, "variables/variables.data-00001-of-00002"),
-            os.path.join(model_directory, "variables/variables.data-00000-of-00002"),
-        ]
-
-        for file_path in files_to_check:
-            if not os.path.exists(file_path):
-                raise RasaException(
-                    f"""File {file_path} does not exist.
-                    Re-check the files inside the directory {model_directory}.
-                    It should contain the following model
-                    files - [{", ".join(files_to_check)}]"""
-                )
-
-    def _get_validated_model_url(self) -> Text:
-        """Validates the specified `model_url` parameter.
-
-        The `model_url` parameter cannot be left empty. It can either
-        be set to a remote URL where the model is hosted or it can be
-        a path to a local directory.
-
-        Returns:
-            Validated path to model
-        """
-        model_url = self.component_config.get("model_url", None)
-
-        if not model_url:
-            raise RasaException(
-                f"""Parameter "model_url" was not specified in the configuration
-                of "{ConveRTTokenizer.__name__}".
-                You can either use a community hosted URL of the model
-                or if you have a local copy of the model, pass the
-                path to the directory containing the model files."""
-            )
-
-        if model_url == ORIGINAL_TF_HUB_MODULE_URL:
-            # Can't use the originally hosted URL
-            raise RasaException(
-                f"""Parameter "model_url" of "{ConveRTTokenizer.__name__}" was
-                set to "{model_url}" which does not contain the model any longer.
-                You can either use a community hosted URL or if you have a
-                local copy of the model, pass the path to the directory
-                containing the model files."""
-            )
-
-        if model_url == RESTRICTED_ACCESS_URL:
-            # Can't use the URL that is reserved for tests only
-            raise RasaException(
-                f"""Parameter "model_url" of "{ConveRTTokenizer.__name__}" was
-                set to "{model_url}" which is strictly reserved for pytests of Rasa Open Source only.
-                Due to licensing issues you are not allowed to use the model from this URL.
-                You can either use a community hosted URL or if you have a
-                local copy of the model, pass the path to the directory
-                containing the model files."""
-            )
-
-        if os.path.isfile(model_url):
-            # Definitely invalid since the specified path should be a directory
-            raise RasaException(
-                f"""Parameter "model_url" of "{ConveRTTokenizer.__name__}" was
-                set to the path of a file which is invalid. You
-                can either use a community hosted URL or if you have a
-                local copy of the model, pass the path to the directory
-                containing the model files."""
-            )
-
-        if rasa.nlu.utils.is_url(model_url):
-            return model_url
-
-        if os.path.isdir(model_url):
-            # Looks like a local directory. Inspect the directory
-            # to see if model files exist.
-            self._validate_model_files_exist(model_url)
-            # Convert the path to an absolute one since
-            # TFHUB doesn't like relative paths
-            return os.path.abspath(model_url)
-
-        raise RasaException(
-            f"""{model_url} is neither a valid remote URL nor a local directory.
-            You can either use a community hosted URL or if you have a
-            local copy of the model, pass the path to
-            the directory containing the model files."""
+        rasa.shared.utils.io.raise_warning(
+            f"'{self.__class__.__name__}' is deprecated and "
+            f"will be removed in the future. "
+            f"It is recommended to use the '{WhitespaceTokenizer.__name__}' or "
+            f"another {Tokenizer.__name__} instead.",
+            category=DeprecationWarning,
         )
-
-    @classmethod
-    def cache_key(
-        cls, component_meta: Dict[Text, Any], model_metadata: Metadata
-    ) -> Optional[Text]:
-        """Cache the component for future use.
-
-        Args:
-            component_meta: configuration for the component.
-            model_metadata: configuration for the whole pipeline.
-
-        Returns: key of the cache for future retrievals.
-        """
-        _config = common.update_existing_keys(cls.defaults, component_meta)
-        return f"{cls.name}-{get_dict_hash(_config)}"
-
-    def provide_context(self) -> Dict[Text, Any]:
-        return {"tf_hub_module": self.module}
-
-    def _tokenize(self, sentence: Text) -> Any:
-
-        return self.tokenize_signature(tf.convert_to_tensor([sentence]))[
-            "default"
-        ].numpy()
-
-    def tokenize(self, message: Message, attribute: Text) -> List[Token]:
-        """Tokenize the text using the ConveRT model.
-        ConveRT adds a special char in front of (some) words and splits words into
-        sub-words. To ensure the entity start and end values matches the token values,
-        tokenize the text first using the whitespace tokenizer. If individual tokens
-        are split up into multiple tokens, add this information to the
-        respected tokens.
-        """
-
-        # perform whitespace tokenization
-        tokens_in = super().tokenize(message, attribute)
-
-        tokens_out = []
-
-        for token in tokens_in:
-            # use ConveRT model to tokenize the text
-            split_token_strings = self._tokenize(token.text)[0]
-
-            # clean tokens (remove special chars and empty tokens)
-            split_token_strings = self._clean_tokens(split_token_strings)
-
-            token.set(NUMBER_OF_SUB_TOKENS, len(split_token_strings))
-
-            tokens_out.append(token)
-
-        return tokens_out
-
-    @staticmethod
-    def _clean_tokens(tokens: List[bytes]) -> List[Text]:
-        """Encode tokens and remove special char added by ConveRT."""
-
-        tokens = [string.decode("utf-8").replace("﹏", "") for string in tokens]
-        return [string for string in tokens if string]
diff --git a/rasa/nlu/tokenizers/lm_tokenizer.py b/rasa/nlu/tokenizers/lm_tokenizer.py
index 5e3bd61f41bb..fbee73158ef1 100644
--- a/rasa/nlu/tokenizers/lm_tokenizer.py
+++ b/rasa/nlu/tokenizers/lm_tokenizer.py
@@ -1,35 +1,27 @@
-from typing import Text, List, Any, Dict, Type
+from typing import Dict, Text, Any
 
-from rasa.nlu.tokenizers.tokenizer import Token, Tokenizer
-from rasa.nlu.components import Component
-from rasa.nlu.utils.hugging_face.hf_transformers import HFTransformersNLP
-from rasa.shared.nlu.training_data.message import Message
+import rasa.shared.utils.io
+from rasa.nlu.tokenizers.tokenizer import Tokenizer
+from rasa.nlu.tokenizers.whitespace_tokenizer import WhitespaceTokenizer
 
-from rasa.nlu.constants import LANGUAGE_MODEL_DOCS, TOKENS
 
+class LanguageModelTokenizer(WhitespaceTokenizer):
+    """This tokenizer is deprecated and will be removed in the future.
 
-class LanguageModelTokenizer(Tokenizer):
-    """Tokenizer using transformer based language models.
-
-    Uses the output of HFTransformersNLP component to set the tokens
-    for dense featurizable attributes of each message object.
+    Use the LanguageModelFeaturizer with any other Tokenizer instead.
     """
 
-    @classmethod
-    def required_components(cls) -> List[Type[Component]]:
-        return [HFTransformersNLP]
-
-    defaults = {
-        # Flag to check whether to split intents
-        "intent_tokenization_flag": False,
-        # Symbol on which intent should be split
-        "intent_split_symbol": "_",
-    }
-
-    def get_doc(self, message: Message, attribute: Text) -> Dict[Text, Any]:
-        return message.get(LANGUAGE_MODEL_DOCS[attribute])
-
-    def tokenize(self, message: Message, attribute: Text) -> List[Token]:
-        doc = self.get_doc(message, attribute)
-
-        return doc[TOKENS]
+    def __init__(self, component_config: Dict[Text, Any] = None) -> None:
+        """Initializes LanguageModelTokenizer for tokenization.
+
+        Args:
+            component_config: Configuration for the component.
+        """
+        super().__init__(component_config)
+        rasa.shared.utils.io.raise_warning(
+            f"'{self.__class__.__name__}' is deprecated and "
+            f"will be removed in the future. "
+            f"It is recommended to use the '{WhitespaceTokenizer.__name__}' or "
+            f"another {Tokenizer.__name__} instead.",
+            category=DeprecationWarning,
+        )
diff --git a/rasa/nlu/utils/hugging_face/hf_transformers.py b/rasa/nlu/utils/hugging_face/hf_transformers.py
index 8b818f3b8030..8a512876d200 100644
--- a/rasa/nlu/utils/hugging_face/hf_transformers.py
+++ b/rasa/nlu/utils/hugging_face/hf_transformers.py
@@ -1,22 +1,22 @@
 import logging
 from typing import Any, Dict, List, Text, Tuple, Optional
 
-from rasa.core.utils import get_dict_hash
+import rasa.core.utils
 from rasa.nlu.model import Metadata
 from rasa.nlu.tokenizers.whitespace_tokenizer import WhitespaceTokenizer
+from rasa.nlu.featurizers.dense_featurizer.lm_featurizer import LanguageModelFeaturizer
 from rasa.nlu.components import Component
 from rasa.nlu.config import RasaNLUModelConfig
 from rasa.shared.nlu.training_data.training_data import TrainingData
 from rasa.shared.nlu.training_data.message import Message
 from rasa.nlu.tokenizers.tokenizer import Token
+import rasa.shared.utils.io
 import rasa.utils.train_utils as train_utils
 import numpy as np
 
 from rasa.nlu.constants import (
     LANGUAGE_MODEL_DOCS,
     DENSE_FEATURIZABLE_ATTRIBUTES,
-    TOKEN_IDS,
-    TOKENS,
     SENTENCE_FEATURES,
     SEQUENCE_FEATURES,
     NUMBER_OF_SUB_TOKENS,
@@ -37,12 +37,9 @@
 
 
 class HFTransformersNLP(Component):
-    """Utility Component for interfacing between Transformers library and Rasa OS.
+    """This component is deprecated and will be removed in the future.
 
-    The transformers(https://github.com/huggingface/transformers) library
-    is used to load pre-trained language models like BERT, GPT-2, etc.
-    The component also tokenizes and featurizes dense featurizable attributes of each
-    message.
+    Use the LanguageModelFeaturizer instead.
     """
 
     defaults = {
@@ -60,11 +57,19 @@ def __init__(
         component_config: Optional[Dict[Text, Any]] = None,
         skip_model_load: bool = False,
     ) -> None:
+        """Initializes HFTransformsNLP with the models specified."""
         super(HFTransformersNLP, self).__init__(component_config)
 
         self._load_model_metadata()
         self._load_model_instance(skip_model_load)
         self.whitespace_tokenizer = WhitespaceTokenizer()
+        rasa.shared.utils.io.raise_warning(
+            f"'{self.__class__.__name__}' is deprecated and "
+            f"will be removed in the future. "
+            f"It is recommended to use the '{LanguageModelFeaturizer.__name__}' "
+            f"instead.",
+            category=DeprecationWarning,
+        )
 
     def _load_model_metadata(self) -> None:
 
@@ -78,7 +83,7 @@ def _load_model_metadata(self) -> None:
         if self.model_name not in model_class_dict:
             raise KeyError(
                 f"'{self.model_name}' not a valid model name. Choose from "
-                f"{str(list(model_class_dict.keys()))} or create"
+                f"{str(list(model_class_dict.keys()))} or create "
                 f"a new class inheriting from this class to support your model."
             )
 
@@ -95,12 +100,12 @@ def _load_model_metadata(self) -> None:
         self.max_model_sequence_length = MAX_SEQUENCE_LENGTHS[self.model_name]
 
     def _load_model_instance(self, skip_model_load: bool) -> None:
-        """Try loading the model instance
+        """Try loading the model instance.
 
         Args:
-            skip_model_load: Skip loading the model instances to save time. This should be True only for pytests
+            skip_model_load: Skip loading the model instances to save time.
+            This should be True only for pytests
         """
-
         if skip_model_load:
             # This should be True only during pytests
             return
@@ -131,10 +136,20 @@ def _load_model_instance(self, skip_model_load: bool) -> None:
     def cache_key(
         cls, component_meta: Dict[Text, Any], model_metadata: Metadata
     ) -> Optional[Text]:
+        """Cache the component for future use.
 
+        Args:
+            component_meta: configuration for the component.
+            model_metadata: configuration for the whole pipeline.
+
+        Returns: key of the cache for future retrievals.
+        """
         weights = component_meta.get("model_weights") or {}
 
-        return f"{cls.name}-{component_meta.get('model_name')}-{get_dict_hash(weights)}"
+        return (
+            f"{cls.name}-{component_meta.get('model_name')}-"
+            f"{rasa.core.utils.get_dict_hash(weights)}"
+        )
 
     @classmethod
     def required_packages(cls) -> List[Text]:
@@ -212,7 +227,6 @@ def _post_process_sequence_embeddings(
         Returns:
             Sentence and sequence level representations.
         """
-
         from rasa.nlu.utils.hugging_face.registry import (
             model_embeddings_post_processors,
         )
@@ -254,7 +268,6 @@ def _tokenize_example(
             List of token strings and token ids for the corresponding attribute of the
             message.
         """
-
         tokens_in = self.whitespace_tokenizer.tokenize(message, attribute)
 
         tokens_out = []
@@ -292,7 +305,6 @@ def _get_token_ids_for_batch(
         Returns:
             List of token strings and token ids for each example in the batch.
         """
-
         batch_token_ids = []
         batch_tokens = []
         for example in batch_examples:
@@ -323,7 +335,6 @@ def _compute_attention_mask(
         Returns:
             Computed attention mask, 0 for padding and 1 for non-padding tokens.
         """
-
         attention_mask = []
 
         for actual_sequence_length in actual_sequence_lengths:
@@ -343,7 +354,16 @@ def _compute_attention_mask(
     def _extract_sequence_lengths(
         self, batch_token_ids: List[List[int]]
     ) -> Tuple[List[int], int]:
+        """Extracts the sequence length for each example and maximum sequence length.
+
+        Args:
+            batch_token_ids: List of token ids for each example in the batch.
 
+        Returns:
+            Tuple consisting of: the actual sequence lengths for each example,
+            and the maximum input sequence length (taking into account the
+            maximum sequence length that the model can handle.
+        """
         # Compute max length across examples
         max_input_sequence_length = 0
         actual_sequence_lengths = []
@@ -498,7 +518,6 @@ def _add_extra_padding(
         Returns:
             Modified sequence embeddings with padding if necessary
         """
-
         if self.max_model_sequence_length == NO_LENGTH_RESTRICTION:
             # No extra padding needed because there wouldn't have been any truncation in the first place
             return sequence_embeddings
@@ -640,7 +659,6 @@ def _get_docs_for_batch(
         Returns:
             List of language model docs for each message in batch.
         """
-
         batch_tokens, batch_token_ids = self._get_token_ids_for_batch(
             batch_examples, attribute
         )
@@ -658,8 +676,6 @@ def _get_docs_for_batch(
         batch_docs = []
         for index in range(len(batch_examples)):
             doc = {
-                TOKEN_IDS: batch_token_ids[index],
-                TOKENS: batch_tokens[index],
                 SEQUENCE_FEATURES: batch_sequence_features[index],
                 SENTENCE_FEATURES: np.reshape(batch_sentence_features[index], (1, -1)),
             }
@@ -680,7 +696,6 @@ def train(
             config: NLU pipeline config consisting of all components.
 
         """
-
         batch_size = 64
 
         for attribute in DENSE_FEATURIZABLE_ATTRIBUTES:
@@ -715,7 +730,6 @@ def process(self, message: Message, **kwargs: Any) -> None:
         Args:
             message: Incoming message object
         """
-
         # process of all featurizers operates only on TEXT and ACTION_TEXT attributes,
         # because all other attributes are labels which are featurized during training
         # and their features are stored by the model itself.
diff --git a/tests/nlu/featurizers/test_convert_featurizer.py b/tests/nlu/featurizers/test_convert_featurizer.py
index e4b90d5d1347..b219c7618cdb 100644
--- a/tests/nlu/featurizers/test_convert_featurizer.py
+++ b/tests/nlu/featurizers/test_convert_featurizer.py
@@ -1,37 +1,41 @@
 import numpy as np
 import pytest
-from typing import Text
+from typing import Text, Optional, List, Tuple
+from pathlib import Path
+import os
 from _pytest.monkeypatch import MonkeyPatch
 
-from rasa.nlu.tokenizers.convert_tokenizer import (
-    ConveRTTokenizer,
-    RESTRICTED_ACCESS_URL,
-)
+from rasa.nlu.tokenizers.whitespace_tokenizer import WhitespaceTokenizer
 from rasa.shared.nlu.training_data.training_data import TrainingData
 from rasa.shared.nlu.training_data.message import Message
-from rasa.nlu.constants import TOKENS_NAMES
+from rasa.nlu.constants import TOKENS_NAMES, NUMBER_OF_SUB_TOKENS
 from rasa.shared.nlu.constants import TEXT, INTENT, RESPONSE
 from rasa.nlu.config import RasaNLUModelConfig
-from rasa.nlu.featurizers.dense_featurizer.convert_featurizer import ConveRTFeaturizer
+from rasa.nlu.featurizers.dense_featurizer.convert_featurizer import (
+    ConveRTFeaturizer,
+    RESTRICTED_ACCESS_URL,
+    ORIGINAL_TF_HUB_MODULE_URL,
+)
+from rasa.exceptions import RasaException
 
 
 @pytest.mark.skip_on_windows
-def test_convert_featurizer_process(component_builder, monkeypatch: MonkeyPatch):
+def test_convert_featurizer_process(monkeypatch: MonkeyPatch):
+    tokenizer = WhitespaceTokenizer()
 
     monkeypatch.setattr(
-        ConveRTTokenizer, "_get_validated_model_url", lambda x: RESTRICTED_ACCESS_URL
+        ConveRTFeaturizer, "_get_validated_model_url", lambda x: RESTRICTED_ACCESS_URL
     )
-
-    component_config = {"name": "ConveRTTokenizer", "model_url": RESTRICTED_ACCESS_URL}
-    tokenizer = ConveRTTokenizer(component_config)
-    featurizer = component_builder.create_component_from_class(ConveRTFeaturizer)
-
+    component_config = {"name": "ConveRTFeaturizer", "model_url": RESTRICTED_ACCESS_URL}
+    featurizer = ConveRTFeaturizer(component_config)
     sentence = "Hey how are you today ?"
-    message = Message(data={TEXT: sentence})
-    tokens = tokenizer.tokenize(message, attribute=TEXT)
-    message.set(TOKENS_NAMES[TEXT], tokens)
+    message = Message.build(text=sentence)
 
-    featurizer.process(message, tf_hub_module=tokenizer.module)
+    td = TrainingData([message])
+    tokenizer.train(td)
+    tokens = featurizer.tokenize(message, attribute=TEXT)
+
+    featurizer.process(message, tf_hub_module=featurizer.module)
 
     expected = np.array([2.2636216, -0.26475656, -1.1358104, -0.49751878, -1.3946456])
     expected_cls = np.array(
@@ -49,26 +53,29 @@ def test_convert_featurizer_process(component_builder, monkeypatch: MonkeyPatch)
 
 
 @pytest.mark.skip_on_windows
-def test_convert_featurizer_train(component_builder, monkeypatch: MonkeyPatch):
+def test_convert_featurizer_train(monkeypatch: MonkeyPatch):
+    tokenizer = WhitespaceTokenizer()
 
     monkeypatch.setattr(
-        ConveRTTokenizer, "_get_validated_model_url", lambda x: RESTRICTED_ACCESS_URL
+        ConveRTFeaturizer, "_get_validated_model_url", lambda x: RESTRICTED_ACCESS_URL
     )
-    component_config = {"name": "ConveRTTokenizer", "model_url": RESTRICTED_ACCESS_URL}
-    tokenizer = ConveRTTokenizer(component_config)
-    featurizer = component_builder.create_component_from_class(ConveRTFeaturizer)
+    component_config = {"name": "ConveRTFeaturizer", "model_url": RESTRICTED_ACCESS_URL}
+    featurizer = ConveRTFeaturizer(component_config)
 
     sentence = "Hey how are you today ?"
     message = Message(data={TEXT: sentence})
     message.set(RESPONSE, sentence)
 
-    tokens = tokenizer.tokenize(message, attribute=TEXT)
+    td = TrainingData([message])
+    tokenizer.train(td)
+
+    tokens = featurizer.tokenize(message, attribute=TEXT)
 
     message.set(TOKENS_NAMES[TEXT], tokens)
     message.set(TOKENS_NAMES[RESPONSE], tokens)
 
     featurizer.train(
-        TrainingData([message]), RasaNLUModelConfig(), tf_hub_module=tokenizer.module
+        TrainingData([message]), RasaNLUModelConfig(), tf_hub_module=featurizer.module
     )
 
     expected = np.array([2.2636216, -0.26475656, -1.1358104, -0.49751878, -1.3946456])
@@ -114,14 +121,143 @@ def test_convert_featurizer_train(component_builder, monkeypatch: MonkeyPatch):
 def test_convert_featurizer_tokens_to_text(
     sentence: Text, expected_text: Text, monkeypatch: MonkeyPatch
 ):
+    tokenizer = WhitespaceTokenizer()
 
     monkeypatch.setattr(
-        ConveRTTokenizer, "_get_validated_model_url", lambda x: RESTRICTED_ACCESS_URL
+        ConveRTFeaturizer, "_get_validated_model_url", lambda x: RESTRICTED_ACCESS_URL
     )
-    component_config = {"name": "ConveRTTokenizer", "model_url": RESTRICTED_ACCESS_URL}
-    tokenizer = ConveRTTokenizer(component_config)
-    tokens = tokenizer.tokenize(Message(data={TEXT: sentence}), attribute=TEXT)
+    component_config = {"name": "ConveRTFeaturizer", "model_url": RESTRICTED_ACCESS_URL}
+    featurizer = ConveRTFeaturizer(component_config)
+    message = Message.build(text=sentence)
+    td = TrainingData([message])
+    tokenizer.train(td)
+    tokens = featurizer.tokenize(message, attribute=TEXT)
 
     actual_text = ConveRTFeaturizer._tokens_to_text([tokens])[0]
 
     assert expected_text == actual_text
+
+
+@pytest.mark.skip_on_windows
+@pytest.mark.parametrize(
+    "text, expected_tokens, expected_indices",
+    [
+        (
+            "forecast for lunch",
+            ["forecast", "for", "lunch"],
+            [(0, 8), (9, 12), (13, 18)],
+        ),
+        ("hello", ["hello"], [(0, 5)]),
+        ("you're", ["you", "re"], [(0, 3), (4, 6)]),
+        ("r. n. b.", ["r", "n", "b"], [(0, 1), (3, 4), (6, 7)]),
+        ("rock & roll", ["rock", "&", "roll"], [(0, 4), (5, 6), (7, 11)]),
+        ("ńöñàśçií", ["ńöñàśçií"], [(0, 8)]),
+    ],
+)
+def test_convert_featurizer_token_edge_cases(
+    text: Text,
+    expected_tokens: List[Text],
+    expected_indices: List[Tuple[int]],
+    monkeypatch: MonkeyPatch,
+):
+    tokenizer = WhitespaceTokenizer()
+
+    monkeypatch.setattr(
+        ConveRTFeaturizer, "_get_validated_model_url", lambda x: RESTRICTED_ACCESS_URL
+    )
+    component_config = {"name": "ConveRTFeaturizer", "model_url": RESTRICTED_ACCESS_URL}
+    featurizer = ConveRTFeaturizer(component_config)
+    message = Message.build(text=text)
+    td = TrainingData([message])
+    tokenizer.train(td)
+    tokens = featurizer.tokenize(message, attribute=TEXT)
+
+    assert [t.text for t in tokens] == expected_tokens
+    assert [t.start for t in tokens] == [i[0] for i in expected_indices]
+    assert [t.end for t in tokens] == [i[1] for i in expected_indices]
+
+
+@pytest.mark.skip_on_windows
+@pytest.mark.parametrize(
+    "text, expected_number_of_sub_tokens",
+    [("Aarhus is a city", [2, 1, 1, 1]), ("sentence embeddings", [1, 3])],
+)
+def test_convert_featurizer_number_of_sub_tokens(
+    text: Text, expected_number_of_sub_tokens: List[int], monkeypatch: MonkeyPatch
+):
+    tokenizer = WhitespaceTokenizer()
+
+    monkeypatch.setattr(
+        ConveRTFeaturizer, "_get_validated_model_url", lambda x: RESTRICTED_ACCESS_URL
+    )
+    component_config = {"name": "ConveRTFeaturizer", "model_url": RESTRICTED_ACCESS_URL}
+    featurizer = ConveRTFeaturizer(component_config)
+
+    message = Message.build(text=text)
+    td = TrainingData([message])
+    tokenizer.train(td)
+
+    tokens = featurizer.tokenize(message, attribute=TEXT)
+
+    assert [
+        t.get(NUMBER_OF_SUB_TOKENS) for t in tokens
+    ] == expected_number_of_sub_tokens
+
+
+@pytest.mark.skip_on_windows
+@pytest.mark.parametrize(
+    "model_url, exception_phrase",
+    [
+        (ORIGINAL_TF_HUB_MODULE_URL, "which does not contain the model any longer"),
+        (
+            RESTRICTED_ACCESS_URL,
+            "which is strictly reserved for pytests of Rasa Open Source only",
+        ),
+        (None, """"model_url" was not specified in the configuration"""),
+        ("", """"model_url" was not specified in the configuration"""),
+    ],
+)
+def test_raise_invalid_urls(model_url: Optional[Text], exception_phrase: Text):
+
+    component_config = {"name": "ConveRTFeaturizer", "model_url": model_url}
+    with pytest.raises(RasaException) as excinfo:
+        _ = ConveRTFeaturizer(component_config)
+
+    assert exception_phrase in str(excinfo.value)
+
+
+@pytest.mark.skip_on_windows
+def test_raise_wrong_model_directory(tmp_path: Path):
+
+    component_config = {"name": "ConveRTFeaturizer", "model_url": str(tmp_path)}
+
+    with pytest.raises(RasaException) as excinfo:
+        _ = ConveRTFeaturizer(component_config)
+
+    assert "Re-check the files inside the directory" in str(excinfo.value)
+
+
+@pytest.mark.skip_on_windows
+def test_raise_wrong_model_file(tmp_path: Path):
+
+    # create a dummy file
+    temp_file = os.path.join(tmp_path, "saved_model.pb")
+    f = open(temp_file, "wb")
+    f.close()
+    component_config = {"name": "ConveRTFeaturizer", "model_url": temp_file}
+
+    with pytest.raises(RasaException) as excinfo:
+        _ = ConveRTFeaturizer(component_config)
+
+    assert "set to the path of a file which is invalid" in str(excinfo.value)
+
+
+@pytest.mark.skip_on_windows
+def test_raise_invalid_path():
+
+    component_config = {"name": "ConveRTFeaturizer", "model_url": "saved_model.pb"}
+
+    with pytest.raises(RasaException) as excinfo:
+        _ = ConveRTFeaturizer(component_config)
+
+    assert "neither a valid remote URL nor a local directory" in str(excinfo.value)
diff --git a/tests/nlu/featurizers/test_lm_featurizer.py b/tests/nlu/featurizers/test_lm_featurizer.py
index bb87f8f90a79..4acdc78c8de4 100644
--- a/tests/nlu/featurizers/test_lm_featurizer.py
+++ b/tests/nlu/featurizers/test_lm_featurizer.py
@@ -1,6 +1,20 @@
+from typing import Text, List
+
 import numpy as np
 import pytest
+import logging
+
+from _pytest.logging import LogCaptureFixture
 
+from rasa.nlu.constants import (
+    TOKENS_NAMES,
+    NUMBER_OF_SUB_TOKENS,
+    SEQUENCE_FEATURES,
+    SENTENCE_FEATURES,
+    LANGUAGE_MODEL_DOCS,
+)
+from rasa.nlu.tokenizers.lm_tokenizer import LanguageModelTokenizer
+from rasa.nlu.tokenizers.whitespace_tokenizer import WhitespaceTokenizer
 from rasa.shared.nlu.training_data.training_data import TrainingData
 from rasa.shared.nlu.training_data.message import Message
 from rasa.nlu.featurizers.dense_featurizer.lm_featurizer import LanguageModelFeaturizer
@@ -173,17 +187,15 @@
 def test_lm_featurizer_shape_values(
     model_name, texts, expected_shape, expected_sequence_vec, expected_cls_vec
 ):
-    transformers_config = {"model_name": model_name}
+    config = {"model_name": model_name}
 
-    transformers_nlp = HFTransformersNLP(transformers_config)
-    lm_featurizer = LanguageModelFeaturizer()
+    lm_featurizer = LanguageModelFeaturizer(config)
 
     messages = []
     for text in texts:
         messages.append(Message.build(text=text))
     td = TrainingData(messages)
 
-    transformers_nlp.train(td)
     lm_featurizer.train(td)
 
     for index in range(len(texts)):
@@ -223,3 +235,531 @@ def test_lm_featurizer_shape_values(
 
         assert intent_sequence_vec is None
         assert intent_sentence_vec is None
+
+
+@pytest.mark.parametrize(
+    "input_sequence_length, model_name, should_overflow",
+    [(20, "bert", False), (1000, "bert", True), (1000, "xlnet", False)],
+)
+def test_sequence_length_overflow_train(
+    input_sequence_length: int, model_name: Text, should_overflow: bool
+):
+    component = LanguageModelFeaturizer(
+        {"model_name": model_name}, skip_model_load=True
+    )
+    message = Message.build(text=" ".join(["hi"] * input_sequence_length))
+    if should_overflow:
+        with pytest.raises(RuntimeError):
+            component._validate_sequence_lengths(
+                [input_sequence_length], [message], "text", inference_mode=False
+            )
+    else:
+        component._validate_sequence_lengths(
+            [input_sequence_length], [message], "text", inference_mode=False
+        )
+
+
+@pytest.mark.parametrize(
+    "sequence_embeddings, actual_sequence_lengths, model_name, padding_needed",
+    [
+        (np.ones((1, 512, 5)), [1000], "bert", True),
+        (np.ones((1, 512, 5)), [1000], "xlnet", False),
+        (np.ones((1, 256, 5)), [256], "bert", False),
+    ],
+)
+def test_long_sequences_extra_padding(
+    sequence_embeddings: np.ndarray,
+    actual_sequence_lengths: List[int],
+    model_name: Text,
+    padding_needed: bool,
+):
+    component = LanguageModelFeaturizer(
+        {"model_name": model_name}, skip_model_load=True
+    )
+    modified_sequence_embeddings = component._add_extra_padding(
+        sequence_embeddings, actual_sequence_lengths
+    )
+    if not padding_needed:
+        assert np.all(modified_sequence_embeddings) == np.all(sequence_embeddings)
+    else:
+        assert modified_sequence_embeddings.shape[1] == actual_sequence_lengths[0]
+        assert (
+            modified_sequence_embeddings[0].shape[-1]
+            == sequence_embeddings[0].shape[-1]
+        )
+        zero_embeddings = modified_sequence_embeddings[0][
+            sequence_embeddings.shape[1] :
+        ]
+        assert np.all(zero_embeddings == 0)
+
+
+@pytest.mark.parametrize(
+    "token_ids, max_sequence_length_model, resulting_length, padding_added",
+    [
+        ([[1] * 200], 512, 512, True),
+        ([[1] * 700], 512, 512, False),
+        ([[1] * 200], 200, 200, False),
+    ],
+)
+def test_input_padding(
+    token_ids: List[List[int]],
+    max_sequence_length_model: int,
+    resulting_length: int,
+    padding_added: bool,
+):
+    component = LanguageModelFeaturizer({"model_name": "bert"}, skip_model_load=True)
+    component.pad_token_id = 0
+    padded_input = component._add_padding_to_batch(token_ids, max_sequence_length_model)
+    assert len(padded_input[0]) == resulting_length
+    if padding_added:
+        original_length = len(token_ids[0])
+        assert np.all(np.array(padded_input[0][original_length:]) == 0)
+
+
+@pytest.mark.parametrize(
+    "sequence_length, model_name, model_weights, should_overflow",
+    [
+        (1000, "bert", "bert-base-uncased", True),
+        (256, "bert", "bert-base-uncased", False),
+    ],
+)
+@pytest.mark.skip_on_windows
+def test_log_longer_sequence(
+    sequence_length: int,
+    model_name: Text,
+    model_weights: Text,
+    should_overflow: bool,
+    caplog,
+):
+    config = {"model_name": model_name, "model_weights": model_weights}
+
+    featurizer = LanguageModelFeaturizer(config)
+
+    text = " ".join(["hi"] * sequence_length)
+    tokenizer = WhitespaceTokenizer()
+    message = Message.build(text=text)
+    td = TrainingData([message])
+    tokenizer.train(td)
+    caplog.set_level(logging.DEBUG)
+    featurizer.process(message)
+    if should_overflow:
+        assert "hi hi hi" in caplog.text
+    assert len(message.features) >= 2
+
+
+@pytest.mark.parametrize(
+    "actual_sequence_length, max_input_sequence_length, zero_start_index",
+    [(256, 512, 256), (700, 700, 700), (700, 512, 512)],
+)
+def test_attention_mask(
+    actual_sequence_length: int, max_input_sequence_length: int, zero_start_index: int
+):
+    component = LanguageModelFeaturizer({"model_name": "bert"}, skip_model_load=True)
+
+    attention_mask = component._compute_attention_mask(
+        [actual_sequence_length], max_input_sequence_length
+    )
+    mask_ones = attention_mask[0][:zero_start_index]
+    mask_zeros = attention_mask[0][zero_start_index:]
+
+    assert np.all(mask_ones == 1)
+    assert np.all(mask_zeros == 0)
+
+
+# TODO: need to fix this failing test
+@pytest.mark.skip(reason="Results in random crashing of github action workers")
+@pytest.mark.parametrize(
+    "model_name, model_weights, texts, expected_tokens, expected_indices",
+    [
+        (
+            "bert",
+            None,
+            [
+                "Good evening.",
+                "you're",
+                "r. n. b.",
+                "rock & roll",
+                "here is the sentence I want embeddings for.",
+            ],
+            [
+                ["good", "evening"],
+                ["you", "re"],
+                ["r", "n", "b"],
+                ["rock", "&", "roll"],
+                [
+                    "here",
+                    "is",
+                    "the",
+                    "sentence",
+                    "i",
+                    "want",
+                    "em",
+                    "bed",
+                    "ding",
+                    "s",
+                    "for",
+                ],
+            ],
+            [
+                [(0, 4), (5, 12)],
+                [(0, 3), (4, 6)],
+                [(0, 1), (3, 4), (6, 7)],
+                [(0, 4), (5, 6), (7, 11)],
+                [
+                    (0, 4),
+                    (5, 7),
+                    (8, 11),
+                    (12, 20),
+                    (21, 22),
+                    (23, 27),
+                    (28, 30),
+                    (30, 33),
+                    (33, 37),
+                    (37, 38),
+                    (39, 42),
+                ],
+            ],
+        ),
+        (
+            "bert",
+            "bert-base-chinese",
+            [
+                "晚上好",  # normal & easy case
+                "没问题！",  # `！` is a Chinese punctuation
+                "去东畈村",  # `畈` is a OOV token for bert-base-chinese
+                "好的😃",  # include a emoji which is common in Chinese text-based chat
+            ],
+            [
+                ["晚", "上", "好"],
+                ["没", "问", "题", "！"],
+                ["去", "东", "畈", "村"],
+                ["好", "的", "😃"],
+            ],
+            [
+                [(0, 1), (1, 2), (2, 3)],
+                [(0, 1), (1, 2), (2, 3), (3, 4)],
+                [(0, 1), (1, 2), (2, 3), (3, 4)],
+                [(0, 1), (1, 2), (2, 3)],
+            ],
+        ),
+        (
+            "gpt",
+            None,
+            [
+                "Good evening.",
+                "hello",
+                "you're",
+                "r. n. b.",
+                "rock & roll",
+                "here is the sentence I want embeddings for.",
+            ],
+            [
+                ["good", "evening"],
+                ["hello"],
+                ["you", "re"],
+                ["r", "n", "b"],
+                ["rock", "&", "roll"],
+                ["here", "is", "the", "sentence", "i", "want", "embe", "ddings", "for"],
+            ],
+            [
+                [(0, 4), (5, 12)],
+                [(0, 5)],
+                [(0, 3), (4, 6)],
+                [(0, 1), (3, 4), (6, 7)],
+                [(0, 4), (5, 6), (7, 11)],
+                [
+                    (0, 4),
+                    (5, 7),
+                    (8, 11),
+                    (12, 20),
+                    (21, 22),
+                    (23, 27),
+                    (28, 32),
+                    (32, 38),
+                    (39, 42),
+                ],
+            ],
+        ),
+        (
+            "gpt2",
+            None,
+            [
+                "Good evening.",
+                "hello",
+                "you're",
+                "r. n. b.",
+                "rock & roll",
+                "here is the sentence I want embeddings for.",
+            ],
+            [
+                ["Good", "even", "ing"],
+                ["hello"],
+                ["you", "re"],
+                ["r", "n", "b"],
+                ["rock", "&", "roll"],
+                [
+                    "here",
+                    "is",
+                    "the",
+                    "sent",
+                    "ence",
+                    "I",
+                    "want",
+                    "embed",
+                    "d",
+                    "ings",
+                    "for",
+                ],
+            ],
+            [
+                [(0, 4), (5, 9), (9, 12)],
+                [(0, 5)],
+                [(0, 3), (4, 6)],
+                [(0, 1), (3, 4), (6, 7)],
+                [(0, 4), (5, 6), (7, 11)],
+                [
+                    (0, 4),
+                    (5, 7),
+                    (8, 11),
+                    (12, 16),
+                    (16, 20),
+                    (21, 22),
+                    (23, 27),
+                    (28, 33),
+                    (33, 34),
+                    (34, 38),
+                    (39, 42),
+                ],
+            ],
+        ),
+        (
+            "xlnet",
+            None,
+            [
+                "Good evening.",
+                "hello",
+                "you're",
+                "r. n. b.",
+                "rock & roll",
+                "here is the sentence I want embeddings for.",
+            ],
+            [
+                ["Good", "evening"],
+                ["hello"],
+                ["you", "re"],
+                ["r", "n", "b"],
+                ["rock", "&", "roll"],
+                [
+                    "here",
+                    "is",
+                    "the",
+                    "sentence",
+                    "I",
+                    "want",
+                    "embed",
+                    "ding",
+                    "s",
+                    "for",
+                ],
+            ],
+            [4, 3, 4, 5, 5, 12],
+        ),
+        (
+            "distilbert",
+            None,
+            [
+                "Good evening.",
+                "you're",
+                "r. n. b.",
+                "rock & roll",
+                "here is the sentence I want embeddings for.",
+            ],
+            [
+                ["good", "evening"],
+                ["you", "re"],
+                ["r", "n", "b"],
+                ["rock", "&", "roll"],
+                [
+                    "here",
+                    "is",
+                    "the",
+                    "sentence",
+                    "i",
+                    "want",
+                    "em",
+                    "bed",
+                    "ding",
+                    "s",
+                    "for",
+                ],
+            ],
+            [
+                [(0, 4), (5, 12)],
+                [(0, 3), (4, 6)],
+                [(0, 1), (3, 4), (6, 7)],
+                [(0, 4), (5, 6), (7, 11)],
+                [
+                    (0, 4),
+                    (5, 7),
+                    (8, 11),
+                    (12, 20),
+                    (21, 22),
+                    (23, 27),
+                    (28, 30),
+                    (30, 33),
+                    (33, 37),
+                    (37, 38),
+                    (39, 42),
+                ],
+            ],
+        ),
+        (
+            "roberta",
+            None,
+            [
+                "Good evening.",
+                "hello",
+                "you're",
+                "r. n. b.",
+                "rock & roll",
+                "here is the sentence I want embeddings for.",
+            ],
+            [
+                ["Good", "even", "ing"],
+                ["hello"],
+                ["you", "re"],
+                ["r", "n", "b"],
+                ["rock", "&", "roll"],
+                [
+                    "here",
+                    "is",
+                    "the",
+                    "sent",
+                    "ence",
+                    "I",
+                    "want",
+                    "embed",
+                    "d",
+                    "ings",
+                    "for",
+                ],
+            ],
+            [
+                [(0, 4), (5, 9), (9, 12)],
+                [(0, 5)],
+                [(0, 3), (4, 6)],
+                [(0, 1), (3, 4), (6, 7)],
+                [(0, 4), (5, 6), (7, 11)],
+                [
+                    (0, 4),
+                    (5, 7),
+                    (8, 11),
+                    (12, 16),
+                    (16, 20),
+                    (21, 22),
+                    (23, 27),
+                    (28, 33),
+                    (33, 34),
+                    (34, 38),
+                    (39, 42),
+                ],
+            ],
+        ),
+    ],
+)
+@pytest.mark.skip_on_windows
+def test_lm_featurizer_edge_cases(
+    model_name, model_weights, texts, expected_tokens, expected_indices
+):
+
+    if model_weights is None:
+        model_weights_config = {}
+    else:
+        model_weights_config = {"model_weights": model_weights}
+    transformers_config = {**{"model_name": model_name}, **model_weights_config}
+
+    lm_featurizer = LanguageModelFeaturizer(transformers_config)
+    whitespace_tokenizer = WhitespaceTokenizer()
+
+    for text, gt_tokens, gt_indices in zip(texts, expected_tokens, expected_indices):
+
+        message = Message.build(text=text)
+        tokens = whitespace_tokenizer.tokenize(message, TEXT)
+        message.set(TOKENS_NAMES[TEXT], tokens)
+        lm_featurizer.process(message)
+
+        assert [t.text for t in tokens] == gt_tokens
+        assert [t.start for t in tokens] == [i[0] for i in gt_indices]
+        assert [t.end for t in tokens] == [i[1] for i in gt_indices]
+
+
+@pytest.mark.parametrize(
+    "text, expected_number_of_sub_tokens",
+    [("sentence embeddings", [1, 4]), ("this is a test", [1, 1, 1, 1])],
+)
+def test_lm_featurizer_number_of_sub_tokens(text, expected_number_of_sub_tokens):
+    config = {
+        "model_name": "bert",
+        "model_weights": "bert-base-uncased",
+    }  # Test for one should be enough
+
+    lm_featurizer = LanguageModelFeaturizer(config)
+    whitespace_tokenizer = WhitespaceTokenizer()
+
+    message = Message.build(text=text)
+
+    td = TrainingData([message])
+    whitespace_tokenizer.train(td)
+    lm_featurizer.train(td)
+
+    assert [
+        t.get(NUMBER_OF_SUB_TOKENS) for t in message.get(TOKENS_NAMES[TEXT])
+    ] == expected_number_of_sub_tokens
+
+
+@pytest.mark.parametrize("text", [("hi there")])
+def test_log_deprecation_warning_with_old_config(text: str, caplog: LogCaptureFixture):
+    message = Message.build(text)
+
+    transformers_nlp = HFTransformersNLP(
+        {"model_name": "bert", "model_weights": "bert-base-uncased"}
+    )
+    transformers_nlp.process(message)
+
+    caplog.set_level(logging.DEBUG)
+    lm_tokenizer = LanguageModelTokenizer()
+    lm_tokenizer.process(message)
+    lm_featurizer = LanguageModelFeaturizer(skip_model_load=True)
+    caplog.clear()
+    with caplog.at_level(logging.DEBUG):
+        lm_featurizer.process(message)
+
+    assert "deprecated component HFTransformersNLP" in caplog.text
+
+
+@pytest.mark.skip(reason="Results in random crashing of github action workers")
+def test_preserve_sentence_and_sequence_features_old_config():
+    attribute = "text"
+    message = Message.build("hi there")
+
+    transformers_nlp = HFTransformersNLP(
+        {"model_name": "bert", "model_weights": "bert-base-uncased"}
+    )
+    transformers_nlp.process(message)
+    lm_tokenizer = LanguageModelTokenizer()
+    lm_tokenizer.process(message)
+
+    lm_featurizer = LanguageModelFeaturizer({"model_name": "gpt2"})
+    lm_featurizer.process(message)
+
+    message.set(LANGUAGE_MODEL_DOCS[attribute], None)
+    lm_docs = lm_featurizer._get_docs_for_batch(
+        [message], attribute=attribute, inference_mode=True
+    )[0]
+    hf_docs = transformers_nlp._get_docs_for_batch(
+        [message], attribute=attribute, inference_mode=True
+    )[0]
+    assert not (message.features[0].features == lm_docs[SEQUENCE_FEATURES]).any()
+    assert not (message.features[1].features == lm_docs[SENTENCE_FEATURES]).any()
+    assert (message.features[0].features == hf_docs[SEQUENCE_FEATURES]).all()
+    assert (message.features[1].features == hf_docs[SENTENCE_FEATURES]).all()
diff --git a/tests/nlu/test_config.py b/tests/nlu/test_config.py
index 0b052c9a6286..d682d5d490a1 100644
--- a/tests/nlu/test_config.py
+++ b/tests/nlu/test_config.py
@@ -54,7 +54,7 @@ def test_invalid_many_tokenizers_in_config():
             {
                 "pipeline": [
                     {"name": "WhitespaceTokenizer"},
-                    {"name": "LanguageModelFeaturizer"},
+                    {"name": "MitieIntentClassifier"},
                 ]
             }
         ),
diff --git a/tests/nlu/test_train.py b/tests/nlu/test_train.py
index 12f0520d1200..459a93933950 100644
--- a/tests/nlu/test_train.py
+++ b/tests/nlu/test_train.py
@@ -7,12 +7,16 @@
 from rasa.shared.nlu.training_data.training_data import TrainingData
 from rasa.utils.tensorflow.constants import EPOCHS
 from tests.nlu.conftest import DEFAULT_DATA_PATH
-from typing import Any, Dict, List, Tuple, Text, Union, Optional
+from typing import Any, Dict, List, Tuple, Text, Union
 
 COMPONENTS_TEST_PARAMS = {
     "DIETClassifier": {EPOCHS: 1},
     "ResponseSelector": {EPOCHS: 1},
     "HFTransformersNLP": {"model_name": "bert", "model_weights": "bert-base-uncased"},
+    "LanguageModelFeaturizer": {
+        "model_name": "bert",
+        "model_weights": "bert-base-uncased",
+    },
 }
 
 
@@ -112,8 +116,8 @@ def pipelines_for_non_windows_tests() -> List[Tuple[Text, List[Dict[Text, Any]]]
 def test_all_components_are_in_at_least_one_test_pipeline():
     """There is a template that includes all components to
     test the train-persist-load-use cycle. Ensures that
-    really all components are in there."""
-
+    really all components are in there.
+    """
     all_pipelines = pipelines_for_tests() + pipelines_for_non_windows_tests()
     all_components = [c["name"] for _, p in all_pipelines for c in p]
 
diff --git a/tests/nlu/tokenizers/test_convert_tokenizer.py b/tests/nlu/tokenizers/test_convert_tokenizer.py
deleted file mode 100644
index ca2770cae6b9..000000000000
--- a/tests/nlu/tokenizers/test_convert_tokenizer.py
+++ /dev/null
@@ -1,169 +0,0 @@
-import pytest
-from typing import Text, List, Tuple, Optional
-from pathlib import Path
-import os
-from _pytest.monkeypatch import MonkeyPatch
-
-from rasa.shared.nlu.training_data.training_data import TrainingData
-from rasa.shared.nlu.training_data.message import Message
-from rasa.nlu.constants import TOKENS_NAMES, NUMBER_OF_SUB_TOKENS
-from rasa.shared.nlu.constants import TEXT, INTENT
-from rasa.nlu.tokenizers.convert_tokenizer import (
-    ConveRTTokenizer,
-    RESTRICTED_ACCESS_URL,
-    ORIGINAL_TF_HUB_MODULE_URL,
-)
-from rasa.exceptions import RasaException
-
-
-@pytest.mark.skip_on_windows
-@pytest.mark.parametrize(
-    "text, expected_tokens, expected_indices",
-    [
-        (
-            "forecast for lunch",
-            ["forecast", "for", "lunch"],
-            [(0, 8), (9, 12), (13, 18)],
-        ),
-        ("hello", ["hello"], [(0, 5)]),
-        ("you're", ["you", "re"], [(0, 3), (4, 6)]),
-        ("r. n. b.", ["r", "n", "b"], [(0, 1), (3, 4), (6, 7)]),
-        ("rock & roll", ["rock", "&", "roll"], [(0, 4), (5, 6), (7, 11)]),
-        ("ńöñàśçií", ["ńöñàśçií"], [(0, 8)]),
-    ],
-)
-def test_convert_tokenizer_edge_cases(
-    text: Text,
-    expected_tokens: List[Text],
-    expected_indices: List[Tuple[int]],
-    monkeypatch: MonkeyPatch,
-):
-
-    monkeypatch.setattr(
-        ConveRTTokenizer, "_get_validated_model_url", lambda x: RESTRICTED_ACCESS_URL
-    )
-
-    component_config = {"name": "ConveRTTokenizer", "model_url": RESTRICTED_ACCESS_URL}
-    tokenizer = ConveRTTokenizer(component_config)
-
-    tokens = tokenizer.tokenize(Message(data={TEXT: text}), attribute=TEXT)
-
-    assert [t.text for t in tokens] == expected_tokens
-    assert [t.start for t in tokens] == [i[0] for i in expected_indices]
-    assert [t.end for t in tokens] == [i[1] for i in expected_indices]
-
-
-@pytest.mark.skip_on_windows
-@pytest.mark.parametrize(
-    "text, expected_tokens",
-    [
-        ("Forecast_for_LUNCH", ["Forecast_for_LUNCH"]),
-        ("Forecast for LUNCH", ["Forecast for LUNCH"]),
-    ],
-)
-def test_custom_intent_symbol(
-    text: Text, expected_tokens: List[Text], monkeypatch: MonkeyPatch
-):
-
-    monkeypatch.setattr(
-        ConveRTTokenizer, "_get_validated_model_url", lambda x: RESTRICTED_ACCESS_URL
-    )
-
-    component_config = {
-        "name": "ConveRTTokenizer",
-        "model_url": RESTRICTED_ACCESS_URL,
-        "intent_tokenization": True,
-        "intent_split_symbol": "+",
-    }
-
-    tokenizer = ConveRTTokenizer(component_config)
-
-    message = Message(data={TEXT: text})
-    message.set(INTENT, text)
-
-    tokenizer.train(TrainingData([message]))
-
-    assert [t.text for t in message.get(TOKENS_NAMES[INTENT])] == expected_tokens
-
-
-@pytest.mark.skip_on_windows
-@pytest.mark.parametrize(
-    "text, expected_number_of_sub_tokens",
-    [("Aarhus is a city", [2, 1, 1, 1]), ("sentence embeddings", [1, 3])],
-)
-def test_convert_tokenizer_number_of_sub_tokens(
-    text: Text, expected_number_of_sub_tokens: List[int], monkeypatch: MonkeyPatch
-):
-    monkeypatch.setattr(
-        ConveRTTokenizer, "_get_validated_model_url", lambda x: RESTRICTED_ACCESS_URL
-    )
-    component_config = {"name": "ConveRTTokenizer", "model_url": RESTRICTED_ACCESS_URL}
-    tokenizer = ConveRTTokenizer(component_config)
-
-    message = Message(data={TEXT: text})
-    message.set(INTENT, text)
-
-    tokenizer.train(TrainingData([message]))
-
-    assert [
-        t.get(NUMBER_OF_SUB_TOKENS) for t in message.get(TOKENS_NAMES[TEXT])
-    ] == expected_number_of_sub_tokens
-
-
-@pytest.mark.skip_on_windows
-@pytest.mark.parametrize(
-    "model_url, exception_phrase",
-    [
-        (ORIGINAL_TF_HUB_MODULE_URL, "which does not contain the model any longer"),
-        (
-            RESTRICTED_ACCESS_URL,
-            "which is strictly reserved for pytests of Rasa Open Source only",
-        ),
-        (None, """"model_url" was not specified in the configuration"""),
-        ("", """"model_url" was not specified in the configuration"""),
-    ],
-)
-def test_raise_invalid_urls(model_url: Optional[Text], exception_phrase: Text):
-
-    component_config = {"name": "ConveRTTokenizer", "model_url": model_url}
-    with pytest.raises(RasaException) as excinfo:
-        _ = ConveRTTokenizer(component_config)
-
-    assert exception_phrase in str(excinfo.value)
-
-
-@pytest.mark.skip_on_windows
-def test_raise_wrong_model_directory(tmp_path: Path):
-
-    component_config = {"name": "ConveRTTokenizer", "model_url": str(tmp_path)}
-
-    with pytest.raises(RasaException) as excinfo:
-        _ = ConveRTTokenizer(component_config)
-
-    assert "Re-check the files inside the directory" in str(excinfo.value)
-
-
-@pytest.mark.skip_on_windows
-def test_raise_wrong_model_file(tmp_path: Path):
-
-    # create a dummy file
-    temp_file = os.path.join(tmp_path, "saved_model.pb")
-    f = open(temp_file, "wb")
-    f.close()
-    component_config = {"name": "ConveRTTokenizer", "model_url": temp_file}
-
-    with pytest.raises(RasaException) as excinfo:
-        _ = ConveRTTokenizer(component_config)
-
-    assert "set to the path of a file which is invalid" in str(excinfo.value)
-
-
-@pytest.mark.skip_on_windows
-def test_raise_invalid_path():
-
-    component_config = {"name": "ConveRTTokenizer", "model_url": "saved_model.pb"}
-
-    with pytest.raises(RasaException) as excinfo:
-        _ = ConveRTTokenizer(component_config)
-
-    assert "neither a valid remote URL nor a local directory" in str(excinfo.value)
diff --git a/tests/nlu/tokenizers/test_lm_tokenizer.py b/tests/nlu/tokenizers/test_lm_tokenizer.py
deleted file mode 100644
index 74ed9e87328d..000000000000
--- a/tests/nlu/tokenizers/test_lm_tokenizer.py
+++ /dev/null
@@ -1,430 +0,0 @@
-import pytest
-
-from rasa.shared.nlu.training_data.training_data import TrainingData
-from rasa.shared.nlu.training_data.message import Message
-from rasa.nlu.constants import (
-    TOKENS_NAMES,
-    LANGUAGE_MODEL_DOCS,
-    TOKEN_IDS,
-    NUMBER_OF_SUB_TOKENS,
-)
-from rasa.shared.nlu.constants import TEXT, INTENT
-from rasa.nlu.tokenizers.lm_tokenizer import LanguageModelTokenizer
-from rasa.nlu.utils.hugging_face.hf_transformers import HFTransformersNLP
-
-
-# TODO: need to fix this failing test
-@pytest.mark.skip(reason="Results in random crashing of github action workers")
-@pytest.mark.parametrize(
-    "model_name, model_weights, texts, expected_tokens, expected_indices, expected_num_token_ids",
-    [
-        (
-            "bert",
-            None,
-            [
-                "Good evening.",
-                "you're",
-                "r. n. b.",
-                "rock & roll",
-                "here is the sentence I want embeddings for.",
-            ],
-            [
-                ["good", "evening"],
-                ["you", "re"],
-                ["r", "n", "b"],
-                ["rock", "&", "roll"],
-                [
-                    "here",
-                    "is",
-                    "the",
-                    "sentence",
-                    "i",
-                    "want",
-                    "em",
-                    "bed",
-                    "ding",
-                    "s",
-                    "for",
-                ],
-            ],
-            [
-                [(0, 4), (5, 12)],
-                [(0, 3), (4, 6)],
-                [(0, 1), (3, 4), (6, 7)],
-                [(0, 4), (5, 6), (7, 11)],
-                [
-                    (0, 4),
-                    (5, 7),
-                    (8, 11),
-                    (12, 20),
-                    (21, 22),
-                    (23, 27),
-                    (28, 30),
-                    (30, 33),
-                    (33, 37),
-                    (37, 38),
-                    (39, 42),
-                ],
-            ],
-            [4, 4, 5, 5, 13],
-        ),
-        (
-            "bert",
-            "bert-base-chinese",
-            [
-                "晚上好",  # normal & easy case
-                "没问题！",  # `！` is a Chinese punctuation
-                "去东畈村",  # `畈` is a OOV token for bert-base-chinese
-                "好的😃",  # include a emoji which is common in Chinese text-based chat
-            ],
-            [
-                ["晚", "上", "好"],
-                ["没", "问", "题", "！"],
-                ["去", "东", "畈", "村"],
-                ["好", "的", "😃"],
-            ],
-            [
-                [(0, 1), (1, 2), (2, 3)],
-                [(0, 1), (1, 2), (2, 3), (3, 4)],
-                [(0, 1), (1, 2), (2, 3), (3, 4)],
-                [(0, 1), (1, 2), (2, 3)],
-            ],
-            [3, 4, 4, 3],
-        ),
-        (
-            "gpt",
-            None,
-            [
-                "Good evening.",
-                "hello",
-                "you're",
-                "r. n. b.",
-                "rock & roll",
-                "here is the sentence I want embeddings for.",
-            ],
-            [
-                ["good", "evening"],
-                ["hello"],
-                ["you", "re"],
-                ["r", "n", "b"],
-                ["rock", "&", "roll"],
-                ["here", "is", "the", "sentence", "i", "want", "embe", "ddings", "for"],
-            ],
-            [
-                [(0, 4), (5, 12)],
-                [(0, 5)],
-                [(0, 3), (4, 6)],
-                [(0, 1), (3, 4), (6, 7)],
-                [(0, 4), (5, 6), (7, 11)],
-                [
-                    (0, 4),
-                    (5, 7),
-                    (8, 11),
-                    (12, 20),
-                    (21, 22),
-                    (23, 27),
-                    (28, 32),
-                    (32, 38),
-                    (39, 42),
-                ],
-            ],
-            [2, 1, 2, 3, 3, 9],
-        ),
-        (
-            "gpt2",
-            None,
-            [
-                "Good evening.",
-                "hello",
-                "you're",
-                "r. n. b.",
-                "rock & roll",
-                "here is the sentence I want embeddings for.",
-            ],
-            [
-                ["Good", "even", "ing"],
-                ["hello"],
-                ["you", "re"],
-                ["r", "n", "b"],
-                ["rock", "&", "roll"],
-                [
-                    "here",
-                    "is",
-                    "the",
-                    "sent",
-                    "ence",
-                    "I",
-                    "want",
-                    "embed",
-                    "d",
-                    "ings",
-                    "for",
-                ],
-            ],
-            [
-                [(0, 4), (5, 9), (9, 12)],
-                [(0, 5)],
-                [(0, 3), (4, 6)],
-                [(0, 1), (3, 4), (6, 7)],
-                [(0, 4), (5, 6), (7, 11)],
-                [
-                    (0, 4),
-                    (5, 7),
-                    (8, 11),
-                    (12, 16),
-                    (16, 20),
-                    (21, 22),
-                    (23, 27),
-                    (28, 33),
-                    (33, 34),
-                    (34, 38),
-                    (39, 42),
-                ],
-            ],
-            [3, 1, 2, 3, 3, 11],
-        ),
-        (
-            "xlnet",
-            None,
-            [
-                "Good evening.",
-                "hello",
-                "you're",
-                "r. n. b.",
-                "rock & roll",
-                "here is the sentence I want embeddings for.",
-            ],
-            [
-                ["Good", "evening"],
-                ["hello"],
-                ["you", "re"],
-                ["r", "n", "b"],
-                ["rock", "&", "roll"],
-                [
-                    "here",
-                    "is",
-                    "the",
-                    "sentence",
-                    "I",
-                    "want",
-                    "embed",
-                    "ding",
-                    "s",
-                    "for",
-                ],
-            ],
-            [
-                [(0, 4), (5, 12)],
-                [(0, 5)],
-                [(0, 3), (4, 6)],
-                [(0, 1), (3, 4), (6, 7)],
-                [(0, 4), (5, 6), (7, 11)],
-                [
-                    (0, 4),
-                    (5, 7),
-                    (8, 11),
-                    (12, 20),
-                    (21, 22),
-                    (23, 27),
-                    (28, 33),
-                    (33, 37),
-                    (37, 38),
-                    (39, 42),
-                ],
-            ],
-            [4, 3, 4, 5, 5, 12],
-        ),
-        (
-            "distilbert",
-            None,
-            [
-                "Good evening.",
-                "you're",
-                "r. n. b.",
-                "rock & roll",
-                "here is the sentence I want embeddings for.",
-            ],
-            [
-                ["good", "evening"],
-                ["you", "re"],
-                ["r", "n", "b"],
-                ["rock", "&", "roll"],
-                [
-                    "here",
-                    "is",
-                    "the",
-                    "sentence",
-                    "i",
-                    "want",
-                    "em",
-                    "bed",
-                    "ding",
-                    "s",
-                    "for",
-                ],
-            ],
-            [
-                [(0, 4), (5, 12)],
-                [(0, 3), (4, 6)],
-                [(0, 1), (3, 4), (6, 7)],
-                [(0, 4), (5, 6), (7, 11)],
-                [
-                    (0, 4),
-                    (5, 7),
-                    (8, 11),
-                    (12, 20),
-                    (21, 22),
-                    (23, 27),
-                    (28, 30),
-                    (30, 33),
-                    (33, 37),
-                    (37, 38),
-                    (39, 42),
-                ],
-            ],
-            [4, 4, 5, 5, 13],
-        ),
-        (
-            "roberta",
-            None,
-            [
-                "Good evening.",
-                "hello",
-                "you're",
-                "r. n. b.",
-                "rock & roll",
-                "here is the sentence I want embeddings for.",
-            ],
-            [
-                ["Good", "even", "ing"],
-                ["hello"],
-                ["you", "re"],
-                ["r", "n", "b"],
-                ["rock", "&", "roll"],
-                [
-                    "here",
-                    "is",
-                    "the",
-                    "sent",
-                    "ence",
-                    "I",
-                    "want",
-                    "embed",
-                    "d",
-                    "ings",
-                    "for",
-                ],
-            ],
-            [
-                [(0, 4), (5, 9), (9, 12)],
-                [(0, 5)],
-                [(0, 3), (4, 6)],
-                [(0, 1), (3, 4), (6, 7)],
-                [(0, 4), (5, 6), (7, 11)],
-                [
-                    (0, 4),
-                    (5, 7),
-                    (8, 11),
-                    (12, 16),
-                    (16, 20),
-                    (21, 22),
-                    (23, 27),
-                    (28, 33),
-                    (33, 34),
-                    (34, 38),
-                    (39, 42),
-                ],
-            ],
-            [5, 3, 4, 5, 5, 13],
-        ),
-    ],
-)
-@pytest.mark.skip_on_windows
-def test_lm_tokenizer_edge_cases(
-    model_name,
-    model_weights,
-    texts,
-    expected_tokens,
-    expected_indices,
-    expected_num_token_ids,
-):
-
-    if model_weights is None:
-        model_weights_config = {}
-    else:
-        model_weights_config = {"model_weights": model_weights}
-    transformers_config = {**{"model_name": model_name}, **model_weights_config}
-
-    transformers_nlp = HFTransformersNLP(transformers_config)
-    lm_tokenizer = LanguageModelTokenizer()
-
-    for text, gt_tokens, gt_indices, gt_num_indices in zip(
-        texts, expected_tokens, expected_indices, expected_num_token_ids
-    ):
-
-        message = Message.build(text=text)
-        transformers_nlp.process(message)
-        tokens = lm_tokenizer.tokenize(message, TEXT)
-        token_ids = message.get(LANGUAGE_MODEL_DOCS[TEXT])[TOKEN_IDS]
-
-        assert [t.text for t in tokens] == gt_tokens
-        assert [t.start for t in tokens] == [i[0] for i in gt_indices]
-        assert [t.end for t in tokens] == [i[1] for i in gt_indices]
-        assert len(token_ids) == gt_num_indices
-
-
-@pytest.mark.parametrize(
-    "text, expected_tokens",
-    [
-        ("Forecast_for_LUNCH", ["Forecast_for_LUNCH"]),
-        ("Forecast for LUNCH", ["Forecast for LUNCH"]),
-        ("Forecast+for+LUNCH", ["Forecast", "for", "LUNCH"]),
-    ],
-)
-@pytest.mark.skip_on_windows
-def test_lm_tokenizer_custom_intent_symbol(text, expected_tokens):
-    component_config = {"intent_tokenization_flag": True, "intent_split_symbol": "+"}
-
-    transformers_config = {
-        "model_name": "bert",
-        "model_weights": "bert-base-uncased",
-    }  # Test for one should be enough
-
-    transformers_nlp = HFTransformersNLP(transformers_config)
-    lm_tokenizer = LanguageModelTokenizer(component_config)
-
-    message = Message.build(text=text)
-    message.set(INTENT, text)
-
-    td = TrainingData([message])
-
-    transformers_nlp.train(td)
-    lm_tokenizer.train(td)
-
-    assert [t.text for t in message.get(TOKENS_NAMES[INTENT])] == expected_tokens
-
-
-@pytest.mark.parametrize(
-    "text, expected_number_of_sub_tokens",
-    [("sentence embeddings", [1, 4]), ("this is a test", [1, 1, 1, 1])],
-)
-@pytest.mark.skip_on_windows
-def test_lm_tokenizer_number_of_sub_tokens(text, expected_number_of_sub_tokens):
-    transformers_config = {
-        "model_name": "bert",
-        "model_weights": "bert-base-uncased",
-    }  # Test for one should be enough
-
-    transformers_nlp = HFTransformersNLP(transformers_config)
-    lm_tokenizer = LanguageModelTokenizer()
-
-    message = Message.build(text=text)
-
-    td = TrainingData([message])
-
-    transformers_nlp.train(td)
-    lm_tokenizer.train(td)
-
-    assert [
-        t.get(NUMBER_OF_SUB_TOKENS) for t in message.get(TOKENS_NAMES[TEXT])
-    ] == expected_number_of_sub_tokens
diff --git a/tests/nlu/utils/test_hf_transformers.py b/tests/nlu/utils/test_hf_transformers.py
index 82949054f8f2..89362c822ca3 100644
--- a/tests/nlu/utils/test_hf_transformers.py
+++ b/tests/nlu/utils/test_hf_transformers.py
@@ -5,6 +5,9 @@
 
 from rasa.nlu.utils.hugging_face.hf_transformers import HFTransformersNLP
 from rasa.shared.nlu.training_data.message import Message
+from rasa.nlu.tokenizers.whitespace_tokenizer import WhitespaceTokenizer
+from rasa.nlu.constants import TOKENS_NAMES
+from rasa.shared.nlu.constants import TEXT
 
 
 @pytest.mark.parametrize(
@@ -14,7 +17,6 @@
 def test_sequence_length_overflow_train(
     input_sequence_length: int, model_name: Text, should_overflow: bool
 ):
-
     component = HFTransformersNLP({"model_name": model_name}, skip_model_load=True)
     message = Message.build(text=" ".join(["hi"] * input_sequence_length))
     if should_overflow:
@@ -42,7 +44,6 @@ def test_long_sequences_extra_padding(
     model_name: Text,
     padding_needed: bool,
 ):
-
     component = HFTransformersNLP({"model_name": model_name}, skip_model_load=True)
     modified_sequence_embeddings = component._add_extra_padding(
         sequence_embeddings, actual_sequence_lengths
@@ -91,7 +92,6 @@ def test_input_padding(
     "sequence_length, model_name, should_overflow",
     [(1000, "bert", True), (256, "bert", False)],
 )
-@pytest.mark.skip_on_windows
 def test_log_longer_sequence(
     sequence_length: int, model_name: Text, should_overflow: bool, caplog
 ):
@@ -132,3 +132,330 @@ def test_attention_mask(
 
     assert np.all(mask_ones == 1)
     assert np.all(mask_zeros == 0)
+
+
+# TODO: need to fix this failing test
+@pytest.mark.skip(reason="Results in random crashing of github action workers")
+@pytest.mark.parametrize(
+    "model_name, model_weights, texts, expected_tokens, expected_indices",
+    [
+        (
+            "bert",
+            None,
+            [
+                "Good evening.",
+                "you're",
+                "r. n. b.",
+                "rock & roll",
+                "here is the sentence I want embeddings for.",
+            ],
+            [
+                ["good", "evening"],
+                ["you", "re"],
+                ["r", "n", "b"],
+                ["rock", "&", "roll"],
+                [
+                    "here",
+                    "is",
+                    "the",
+                    "sentence",
+                    "i",
+                    "want",
+                    "em",
+                    "bed",
+                    "ding",
+                    "s",
+                    "for",
+                ],
+            ],
+            [
+                [(0, 4), (5, 12)],
+                [(0, 3), (4, 6)],
+                [(0, 1), (3, 4), (6, 7)],
+                [(0, 4), (5, 6), (7, 11)],
+                [
+                    (0, 4),
+                    (5, 7),
+                    (8, 11),
+                    (12, 20),
+                    (21, 22),
+                    (23, 27),
+                    (28, 30),
+                    (30, 33),
+                    (33, 37),
+                    (37, 38),
+                    (39, 42),
+                ],
+            ],
+        ),
+        (
+            "bert",
+            "bert-base-chinese",
+            [
+                "晚上好",  # normal & easy case
+                "没问题！",  # `！` is a Chinese punctuation
+                "去东畈村",  # `畈` is a OOV token for bert-base-chinese
+                "好的😃",  # include a emoji which is common in Chinese text-based chat
+            ],
+            [
+                ["晚", "上", "好"],
+                ["没", "问", "题", "！"],
+                ["去", "东", "畈", "村"],
+                ["好", "的", "😃"],
+            ],
+            [
+                [(0, 1), (1, 2), (2, 3)],
+                [(0, 1), (1, 2), (2, 3), (3, 4)],
+                [(0, 1), (1, 2), (2, 3), (3, 4)],
+                [(0, 1), (1, 2), (2, 3)],
+            ],
+        ),
+        (
+            "gpt",
+            None,
+            [
+                "Good evening.",
+                "hello",
+                "you're",
+                "r. n. b.",
+                "rock & roll",
+                "here is the sentence I want embeddings for.",
+            ],
+            [
+                ["good", "evening"],
+                ["hello"],
+                ["you", "re"],
+                ["r", "n", "b"],
+                ["rock", "&", "roll"],
+                ["here", "is", "the", "sentence", "i", "want", "embe", "ddings", "for"],
+            ],
+            [
+                [(0, 4), (5, 12)],
+                [(0, 5)],
+                [(0, 3), (4, 6)],
+                [(0, 1), (3, 4), (6, 7)],
+                [(0, 4), (5, 6), (7, 11)],
+                [
+                    (0, 4),
+                    (5, 7),
+                    (8, 11),
+                    (12, 20),
+                    (21, 22),
+                    (23, 27),
+                    (28, 32),
+                    (32, 38),
+                    (39, 42),
+                ],
+            ],
+        ),
+        (
+            "gpt2",
+            None,
+            [
+                "Good evening.",
+                "hello",
+                "you're",
+                "r. n. b.",
+                "rock & roll",
+                "here is the sentence I want embeddings for.",
+            ],
+            [
+                ["Good", "even", "ing"],
+                ["hello"],
+                ["you", "re"],
+                ["r", "n", "b"],
+                ["rock", "&", "roll"],
+                [
+                    "here",
+                    "is",
+                    "the",
+                    "sent",
+                    "ence",
+                    "I",
+                    "want",
+                    "embed",
+                    "d",
+                    "ings",
+                    "for",
+                ],
+            ],
+            [
+                [(0, 4), (5, 9), (9, 12)],
+                [(0, 5)],
+                [(0, 3), (4, 6)],
+                [(0, 1), (3, 4), (6, 7)],
+                [(0, 4), (5, 6), (7, 11)],
+                [
+                    (0, 4),
+                    (5, 7),
+                    (8, 11),
+                    (12, 16),
+                    (16, 20),
+                    (21, 22),
+                    (23, 27),
+                    (28, 33),
+                    (33, 34),
+                    (34, 38),
+                    (39, 42),
+                ],
+            ],
+        ),
+        (
+            "xlnet",
+            None,
+            [
+                "Good evening.",
+                "hello",
+                "you're",
+                "r. n. b.",
+                "rock & roll",
+                "here is the sentence I want embeddings for.",
+            ],
+            [
+                ["Good", "evening"],
+                ["hello"],
+                ["you", "re"],
+                ["r", "n", "b"],
+                ["rock", "&", "roll"],
+                [
+                    "here",
+                    "is",
+                    "the",
+                    "sentence",
+                    "I",
+                    "want",
+                    "embed",
+                    "ding",
+                    "s",
+                    "for",
+                ],
+            ],
+            [4, 3, 4, 5, 5, 12],
+        ),
+        (
+            "distilbert",
+            None,
+            [
+                "Good evening.",
+                "you're",
+                "r. n. b.",
+                "rock & roll",
+                "here is the sentence I want embeddings for.",
+            ],
+            [
+                ["good", "evening"],
+                ["you", "re"],
+                ["r", "n", "b"],
+                ["rock", "&", "roll"],
+                [
+                    "here",
+                    "is",
+                    "the",
+                    "sentence",
+                    "i",
+                    "want",
+                    "em",
+                    "bed",
+                    "ding",
+                    "s",
+                    "for",
+                ],
+            ],
+            [
+                [(0, 4), (5, 12)],
+                [(0, 3), (4, 6)],
+                [(0, 1), (3, 4), (6, 7)],
+                [(0, 4), (5, 6), (7, 11)],
+                [
+                    (0, 4),
+                    (5, 7),
+                    (8, 11),
+                    (12, 20),
+                    (21, 22),
+                    (23, 27),
+                    (28, 30),
+                    (30, 33),
+                    (33, 37),
+                    (37, 38),
+                    (39, 42),
+                ],
+            ],
+        ),
+        (
+            "roberta",
+            None,
+            [
+                "Good evening.",
+                "hello",
+                "you're",
+                "r. n. b.",
+                "rock & roll",
+                "here is the sentence I want embeddings for.",
+            ],
+            [
+                ["Good", "even", "ing"],
+                ["hello"],
+                ["you", "re"],
+                ["r", "n", "b"],
+                ["rock", "&", "roll"],
+                [
+                    "here",
+                    "is",
+                    "the",
+                    "sent",
+                    "ence",
+                    "I",
+                    "want",
+                    "embed",
+                    "d",
+                    "ings",
+                    "for",
+                ],
+            ],
+            [
+                [(0, 4), (5, 9), (9, 12)],
+                [(0, 5)],
+                [(0, 3), (4, 6)],
+                [(0, 1), (3, 4), (6, 7)],
+                [(0, 4), (5, 6), (7, 11)],
+                [
+                    (0, 4),
+                    (5, 7),
+                    (8, 11),
+                    (12, 16),
+                    (16, 20),
+                    (21, 22),
+                    (23, 27),
+                    (28, 33),
+                    (33, 34),
+                    (34, 38),
+                    (39, 42),
+                ],
+            ],
+        ),
+    ],
+)
+@pytest.mark.skip_on_windows
+def test_hf_transformer_edge_cases(
+    model_name, model_weights, texts, expected_tokens, expected_indices
+):
+
+    if model_weights is None:
+        model_weights_config = {}
+    else:
+        model_weights_config = {"model_weights": model_weights}
+    transformers_config = {**{"model_name": model_name}, **model_weights_config}
+
+    hf_transformer = HFTransformersNLP(transformers_config)
+    whitespace_tokenizer = WhitespaceTokenizer()
+
+    for text, gt_tokens, gt_indices in zip(texts, expected_tokens, expected_indices):
+
+        message = Message.build(text=text)
+        tokens = whitespace_tokenizer.tokenize(message, TEXT)
+        message.set(TOKENS_NAMES[TEXT], tokens)
+        hf_transformer.process(message)
+
+        assert [t.text for t in tokens] == gt_tokens
+        assert [t.start for t in tokens] == [i[0] for i in gt_indices]
+        assert [t.end for t in tokens] == [i[1] for i in gt_indices]