diff --git a/changelog/7027.improvement.md b/changelog/7027.improvement.md new file mode 100644 index 000000000000..baaa4813790e --- /dev/null +++ b/changelog/7027.improvement.md @@ -0,0 +1,6 @@ +Remove dependency between `ConveRTTokenizer` and `ConveRTFeaturizer`. The `ConveRTTokenizer` is now deprecated, and the +`ConveRTFeaturizer` can be used with any other `Tokenizer`. + +Remove dependency between `HFTransformersNLP`, `LanguageModelTokenizer`, and `LanguageModelFeaturizer`. Both +`HFTransformersNLP` and `LanguageModelTokenizer` are now deprecated. `LanguageModelFeaturizer` implements the behavior +of the stack and can be used with any other `Tokenizer`. diff --git a/docs/docs/components.mdx b/docs/docs/components.mdx index c3cf7a003e24..62e308ab6c24 100644 --- a/docs/docs/components.mdx +++ b/docs/docs/components.mdx @@ -139,6 +139,10 @@ word vectors in your pipeline. ### HFTransformersNLP +:::caution Deprecated +The `HFTransformersNLP` is deprecated and will be removed in a future release. The [LanguageModelFeaturizer](./components.mdx#languagemodelfeaturizer) +now implements its behavior. +::: * **Short** @@ -406,6 +410,10 @@ word vectors in your pipeline. ### ConveRTTokenizer +:::caution Deprecated +The `ConveRTTokenizer` is deprecated and will be removed in a future release. The [ConveRTFeaturizer](./components.mdx#convertfeaturizer) +now implements its behavior. Any [tokenizer](./components.mdx#tokenizers) can be used in its place. +::: * **Short** @@ -466,42 +474,46 @@ word vectors in your pipeline. ### LanguageModelTokenizer +:::caution Deprecated +The `LanguageModelTokenizer` is deprecated and will be removed in a future release. The [LanguageModelFeaturizer](./components.mdx#languagemodelfeaturizer) +now implements its behavior. Any [tokenizer](./components.mdx#tokenizers) can be used in its place. +::: - * **Short** +* **Short** - Tokenizer from pre-trained language models +Tokenizer from pre-trained language models - * **Outputs** +* **Outputs** - `tokens` for user messages, responses (if present), and intents (if specified) +`tokens` for user messages, responses (if present), and intents (if specified) - * **Requires** +* **Requires** - [HFTransformersNLP](./components.mdx#hftransformersnlp) +[HFTransformersNLP](./components.mdx#hftransformersnlp) - * **Description** +* **Description** - Creates tokens using the pre-trained language model specified in upstream [HFTransformersNLP](./components.mdx#hftransformersnlp) component. - Must be used whenever the [LanguageModelFeaturizer](./components.mdx#languagemodelfeaturizer) is used. +Creates tokens using the pre-trained language model specified in upstream [HFTransformersNLP](./components.mdx#hftransformersnlp) component. +Must be used whenever the [LanguageModelFeaturizer](./components.mdx#languagemodelfeaturizer) is used. - * **Configuration** +* **Configuration** - ```yaml-rasa - pipeline: - - name: "LanguageModelTokenizer" - # Flag to check whether to split intents - "intent_tokenization_flag": False - # Symbol on which intent should be split - "intent_split_symbol": "_" - ``` +```yaml-rasa +pipeline: +- name: "LanguageModelTokenizer" + # Flag to check whether to split intents + "intent_tokenization_flag": False + # Symbol on which intent should be split + "intent_split_symbol": "_" +``` ## Featurizers @@ -644,7 +656,7 @@ Note: The `feature-dimension` for sequence and sentence features does not have t * **Requires** - [ConveRTTokenizer](./components.mdx#converttokenizer) + `tokens` @@ -667,7 +679,7 @@ Note: The `feature-dimension` for sequence and sentence features does not have t ::: :::note - To use `ConveRTTokenizer`, install Rasa Open Source with `pip3 install rasa[convert]`. + To use `ConveRTFeaturizer`, install Rasa Open Source with `pip3 install rasa[convert]`. ::: @@ -698,7 +710,7 @@ Note: The `feature-dimension` for sequence and sentence features does not have t * **Requires** - [HFTransformersNLP](./components.mdx#hftransformersnlp) and [LanguageModelTokenizer](./components.mdx#languagemodeltokenizer) + `tokens`. @@ -711,8 +723,7 @@ Note: The `feature-dimension` for sequence and sentence features does not have t * **Description** Creates features for entity extraction, intent classification, and response selection. - Uses the pre-trained language model specified in upstream [HFTransformersNLP](./components.mdx#hftransformersnlp) component to compute vector - representations of input text. + Uses a pre-trained language model to compute vector representations of input text. :::note Please make sure that you use a language model which is pre-trained on the same language corpus as that of your @@ -724,14 +735,49 @@ Note: The `feature-dimension` for sequence and sentence features does not have t * **Configuration** - Include [HFTransformersNLP](./components.mdx#hftransformersnlp) and [LanguageModelTokenizer](./components.mdx#languagemodeltokenizer) components before this component. Use - [LanguageModelTokenizer](./components.mdx#languagemodeltokenizer) to ensure tokens are correctly set for all components throughout the pipeline. + Include a [Tokenizer](./components.mdx#tokenizers) component before this component. + + You should specify what language model to load via the parameter `model_name`. See the below table for the + available language models. + Additionally, you can also specify the architecture variation of the chosen language model by specifying the + parameter `model_weights`. + The full list of supported architectures can be found in the + [HuggingFace documentation](https://huggingface.co/transformers/pretrained_models.html). + If left empty, it uses the default model architecture that original Transformers library loads (see table below). + + ``` + +----------------+--------------+-------------------------+ + | Language Model | Parameter | Default value for | + | | "model_name" | "model_weights" | + +----------------+--------------+-------------------------+ + | BERT | bert | rasa/LaBSE | + +----------------+--------------+-------------------------+ + | GPT | gpt | openai-gpt | + +----------------+--------------+-------------------------+ + | GPT-2 | gpt2 | gpt2 | + +----------------+--------------+-------------------------+ + | XLNet | xlnet | xlnet-base-cased | + +----------------+--------------+-------------------------+ + | DistilBERT | distilbert | distilbert-base-uncased | + +----------------+--------------+-------------------------+ + | RoBERTa | roberta | roberta-base | + +----------------+--------------+-------------------------+ + ``` + + The following configuration loads the language model BERT: ```yaml-rasa pipeline: - - name: "LanguageModelFeaturizer" - ``` + - name: LanguageModelFeaturizer + # Name of the language model to use + model_name: "bert" + # Pre-Trained weights to be loaded + model_weights: "rasa/LaBSE" + # An optional path to a specific directory to download and cache the pre-trained model weights. + # The `default` cache_dir is the same as https://huggingface.co/transformers/serialization.html#cache-directory . + cache_dir: null + ``` ### RegexFeaturizer diff --git a/docs/docs/migration-guide.mdx b/docs/docs/migration-guide.mdx index 3ab49ab105a1..eb01641c49a5 100644 --- a/docs/docs/migration-guide.mdx +++ b/docs/docs/migration-guide.mdx @@ -10,6 +10,34 @@ description: | This page contains information about changes between major versions and how you can migrate from one version to another. +## Rasa 2.0 to Rasa 2.1 + +### Deprecations + +`ConveRTTokenizer` is now deprecated. [ConveRTFeaturizer](./components.mdx#convertfeaturizer) now implements +its behaviour. To migrate, replace `ConveRTTokenizer` with any other tokenizer, for e.g.: + +```yaml +pipeline: + - name: WhitespaceTokenizer + - name: ConveRTFeaturizer + model_url: + ... +``` + +`HFTransformersNLP` and `LanguageModelTokenizer` components are now deprecated. +[LanguageModelFeaturizer](./components.mdx#languagemodelfeaturizer) now implements their behaviour. +To migrate, replace both the above components with any tokenizer and specify the model architecture and model weights +as part of `LanguageModelFeaturizer`, for e.g.: + +```yaml +pipeline: + - name: WhitespaceTokenizer + - name: LanguageModelFeaturizer + model_name: "bert" + model_weights: "rasa/LaBSE" + ... +``` ## Rasa 1.10 to Rasa 2.0 diff --git a/rasa/nlu/constants.py b/rasa/nlu/constants.py index 49e0978b075b..14297822acb3 100644 --- a/rasa/nlu/constants.py +++ b/rasa/nlu/constants.py @@ -63,9 +63,6 @@ rasa.shared.nlu.constants.INTENT_RESPONSE_KEY: "intent_response_key_tokens", } -TOKENS = "tokens" -TOKEN_IDS = "token_ids" - SEQUENCE_FEATURES = "sequence_features" SENTENCE_FEATURES = "sentence_features" diff --git a/rasa/nlu/featurizers/dense_featurizer/convert_featurizer.py b/rasa/nlu/featurizers/dense_featurizer/convert_featurizer.py index 9d65e3ef3460..e24c82d27219 100644 --- a/rasa/nlu/featurizers/dense_featurizer/convert_featurizer.py +++ b/rasa/nlu/featurizers/dense_featurizer/convert_featurizer.py @@ -2,11 +2,14 @@ from typing import Any, Dict, List, NoReturn, Optional, Text, Tuple, Type from tqdm import tqdm +import os import rasa.shared.utils.io -from rasa.nlu.tokenizers.convert_tokenizer import ConveRTTokenizer +import rasa.core.utils +from rasa.utils import common +from rasa.nlu.tokenizers.tokenizer import Token, Tokenizer +from rasa.nlu.model import Metadata from rasa.shared.constants import DOCS_URL_COMPONENTS -from rasa.nlu.tokenizers.tokenizer import Token from rasa.nlu.components import Component from rasa.nlu.featurizers.featurizer import DenseFeaturizer from rasa.shared.nlu.training_data.features import Features @@ -17,8 +20,16 @@ DENSE_FEATURIZABLE_ATTRIBUTES, FEATURIZER_CLASS_ALIAS, TOKENS_NAMES, + NUMBER_OF_SUB_TOKENS, ) -from rasa.shared.nlu.constants import TEXT, FEATURE_TYPE_SENTENCE, FEATURE_TYPE_SEQUENCE +from rasa.shared.nlu.constants import ( + TEXT, + FEATURE_TYPE_SENTENCE, + FEATURE_TYPE_SEQUENCE, + ACTION_TEXT, +) +from rasa.exceptions import RasaException +import rasa.nlu.utils import numpy as np import tensorflow as tf @@ -26,6 +37,16 @@ logger = logging.getLogger(__name__) +# URL to the old remote location of the model which +# users might use. The model is no longer hosted here. +ORIGINAL_TF_HUB_MODULE_URL = ( + "https://github.com/PolyAI-LDN/polyai-models/releases/download/v1.0/model.tar.gz" +) + +# Warning: This URL is only intended for running pytests on ConveRT +# related components. This URL should not be allowed to be used by the user. +RESTRICTED_ACCESS_URL = "https://storage.googleapis.com/continuous-integration-model-storage/convert_tf2.tar.gz" + class ConveRTFeaturizer(DenseFeaturizer): """Featurizer using ConveRT model. @@ -35,22 +56,135 @@ class ConveRTFeaturizer(DenseFeaturizer): for dense featurizable attributes of each message object. """ + defaults = { + # Remote URL/Local path to model files + "model_url": None + } + @classmethod def required_components(cls) -> List[Type[Component]]: - return [ConveRTTokenizer] + """Components that should be included in the pipeline before this component.""" + return [Tokenizer] @classmethod def required_packages(cls) -> List[Text]: + """Packages needed to be installed.""" return ["tensorflow_text", "tensorflow_hub"] def __init__(self, component_config: Optional[Dict[Text, Any]] = None) -> None: + """Initializes ConveRTFeaturizer with the model and different + encoding signatures. + Args: + component_config: Configuration for the component. + """ super(ConveRTFeaturizer, self).__init__(component_config) + self.model_url = self._get_validated_model_url() + + self.module = train_utils.load_tf_hub_model(self.model_url) + + self.tokenize_signature = self._get_signature("tokenize", self.module) + self.sequence_encoding_signature = self._get_signature( + "encode_sequence", self.module + ) + self.sentence_encoding_signature = self._get_signature("default", self.module) @staticmethod - def __get_signature(signature: Text, module: Any) -> NoReturn: - """Retrieve a signature from a (hopefully loaded) TF model.""" + def _validate_model_files_exist(model_directory: Text) -> None: + """Check if essential model files exist inside the model_directory. + + Args: + model_directory: Directory to investigate + """ + files_to_check = [ + os.path.join(model_directory, "saved_model.pb"), + os.path.join(model_directory, "variables/variables.index"), + os.path.join(model_directory, "variables/variables.data-00001-of-00002"), + os.path.join(model_directory, "variables/variables.data-00000-of-00002"), + ] + for file_path in files_to_check: + if not os.path.exists(file_path): + raise RasaException( + f"""File {file_path} does not exist. + Re-check the files inside the directory {model_directory}. + It should contain the following model + files - [{", ".join(files_to_check)}]""" + ) + + def _get_validated_model_url(self) -> Text: + """Validates the specified `model_url` parameter. + + The `model_url` parameter cannot be left empty. It can either + be set to a remote URL where the model is hosted or it can be + a path to a local directory. + + Returns: + Validated path to model + """ + model_url = self.component_config.get("model_url", None) + + if not model_url: + raise RasaException( + f"""Parameter "model_url" was not specified in the configuration + of "{ConveRTFeaturizer.__name__}". It is mandatory to pass a value for this parameter. + You can either use a community hosted URL of the model + or if you have a local copy of the model, pass the + path to the directory containing the model files.""" + ) + + if model_url == ORIGINAL_TF_HUB_MODULE_URL: + # Can't use the originally hosted URL + raise RasaException( + f"""Parameter "model_url" of "{ConveRTFeaturizer.__name__}" was + set to "{model_url}" which does not contain the model any longer. + You can either use a community hosted URL or if you have a + local copy of the model, pass the path to the directory + containing the model files.""" + ) + + if model_url == RESTRICTED_ACCESS_URL: + # Can't use the URL that is reserved for tests only + raise RasaException( + f"""Parameter "model_url" of "{ConveRTFeaturizer.__name__}" was + set to "{model_url}" which is strictly reserved for pytests of Rasa Open Source only. + Due to licensing issues you are not allowed to use the model from this URL. + You can either use a community hosted URL or if you have a + local copy of the model, pass the path to the directory + containing the model files.""" + ) + + if os.path.isfile(model_url): + # Definitely invalid since the specified path should be a directory + raise RasaException( + f"""Parameter "model_url" of "{ConveRTFeaturizer.__name__}" was + set to the path of a file which is invalid. You + can either use a community hosted URL or if you have a + local copy of the model, pass the path to the directory + containing the model files.""" + ) + + if rasa.nlu.utils.is_url(model_url): + return model_url + + if os.path.isdir(model_url): + # Looks like a local directory. Inspect the directory + # to see if model files exist. + self._validate_model_files_exist(model_url) + # Convert the path to an absolute one since + # TFHUB doesn't like relative paths + return os.path.abspath(model_url) + + raise RasaException( + f"""{model_url} is neither a valid remote URL nor a local directory. + You can either use a community hosted URL or if you have a + local copy of the model, pass the path to + the directory containing the model files.""" + ) + + @staticmethod + def _get_signature(signature: Text, module: Any) -> NoReturn: + """Retrieve a signature from a (hopefully loaded) TF model.""" if not module: raise Exception( "ConveRTFeaturizer needs a proper loaded tensorflow module when used. " @@ -60,39 +194,34 @@ def __get_signature(signature: Text, module: Any) -> NoReturn: return module.signatures[signature] def _compute_features( - self, batch_examples: List[Message], module: Any, attribute: Text = TEXT + self, batch_examples: List[Message], attribute: Text = TEXT ) -> Tuple[np.ndarray, np.ndarray]: - - sentence_encodings = self._compute_sentence_encodings( - batch_examples, module, attribute - ) + sentence_encodings = self._compute_sentence_encodings(batch_examples, attribute) ( sequence_encodings, number_of_tokens_in_sentence, - ) = self._compute_sequence_encodings(batch_examples, module, attribute) + ) = self._compute_sequence_encodings(batch_examples, attribute) return self._get_features( sentence_encodings, sequence_encodings, number_of_tokens_in_sentence ) def _compute_sentence_encodings( - self, batch_examples: List[Message], module: Any, attribute: Text = TEXT + self, batch_examples: List[Message], attribute: Text = TEXT ) -> np.ndarray: # Get text for attribute of each example batch_attribute_text = [ex.get(attribute) for ex in batch_examples] - sentence_encodings = self._sentence_encoding_of_text( - batch_attribute_text, module - ) + sentence_encodings = self._sentence_encoding_of_text(batch_attribute_text) # convert them to a sequence of 1 return np.reshape(sentence_encodings, (len(batch_examples), 1, -1)) def _compute_sequence_encodings( - self, batch_examples: List[Message], module: Any, attribute: Text = TEXT + self, batch_examples: List[Message], attribute: Text = TEXT ) -> Tuple[np.ndarray, List[int]]: list_of_tokens = [ - example.get(TOKENS_NAMES[attribute]) for example in batch_examples + self.tokenize(example, attribute) for example in batch_examples ] number_of_tokens_in_sentence = [ @@ -103,7 +232,7 @@ def _compute_sequence_encodings( # the returned embeddings from ConveRT matches the length of the tokens # (including sub-tokens) tokenized_texts = self._tokens_to_text(list_of_tokens) - token_features = self._sequence_encoding_of_text(tokenized_texts, module) + token_features = self._sequence_encoding_of_text(tokenized_texts) # ConveRT might split up tokens into sub-tokens # take the mean of the sub-token vectors and use that as the token vector @@ -120,7 +249,6 @@ def _get_features( number_of_tokens_in_sentence: List[int], ) -> Tuple[np.ndarray, np.ndarray]: """Get the sequence and sentence features.""" - sentence_embeddings = [] sequence_embeddings = [] @@ -138,8 +266,9 @@ def _get_features( def _tokens_to_text(list_of_tokens: List[List[Token]]) -> List[Text]: """Convert list of tokens to text. - Add a whitespace between two tokens if the end value of the first tokens is - not the same as the end value of the second token.""" + Add a whitespace between two tokens if the end value of the first tokens + is not the same as the end value of the second token. + """ texts = [] for tokens in list_of_tokens: text = "" @@ -154,23 +283,31 @@ def _tokens_to_text(list_of_tokens: List[List[Token]]) -> List[Text]: return texts - def _sentence_encoding_of_text(self, batch: List[Text], module: Any) -> np.ndarray: - signature = self.__get_signature("default", module) - return signature(tf.convert_to_tensor(batch))["default"].numpy() + def _sentence_encoding_of_text(self, batch: List[Text]) -> np.ndarray: - def _sequence_encoding_of_text(self, batch: List[Text], module: Any) -> np.ndarray: - signature = self.__get_signature("encode_sequence", module) + return self.sentence_encoding_signature(tf.convert_to_tensor(batch))[ + "default" + ].numpy() - return signature(tf.convert_to_tensor(batch))["sequence_encoding"].numpy() + def _sequence_encoding_of_text(self, batch: List[Text]) -> np.ndarray: + + return self.sequence_encoding_signature(tf.convert_to_tensor(batch))[ + "sequence_encoding" + ].numpy() def train( self, training_data: TrainingData, config: Optional[RasaNLUModelConfig] = None, - *, - tf_hub_module: Any = None, **kwargs: Any, ) -> None: + """Featurize all message attributes in the training data with the ConveRT model. + + Args: + training_data: Training data to be featurized + config: Pipeline configuration + **kwargs: Any other arguments. + """ if config is not None and config.language != "en": rasa.shared.utils.io.raise_warning( f"Since ``ConveRT`` model is trained only on an english " @@ -203,7 +340,7 @@ def train( ( batch_sequence_features, batch_sentence_features, - ) = self._compute_features(batch_examples, tf_hub_module, attribute) + ) = self._compute_features(batch_examples, attribute) self._set_features( batch_examples, @@ -212,14 +349,17 @@ def train( attribute, ) - def process( - self, message: Message, *, tf_hub_module: Any = None, **kwargs: Any - ) -> None: + def process(self, message: Message, **kwargs: Any) -> None: + """Featurize an incoming message with the ConveRT model. - for attribute in DENSE_FEATURIZABLE_ATTRIBUTES: + Args: + message: Message to be featurized + **kwargs: Any other arguments. + """ + for attribute in {TEXT, ACTION_TEXT}: if message.get(attribute): sequence_features, sentence_features = self._compute_features( - [message], tf_hub_module, attribute=attribute + [message], attribute=attribute ) self._set_features( @@ -249,3 +389,61 @@ def _set_features( self.component_config[FEATURIZER_CLASS_ALIAS], ) example.add_features(_sentence_features) + + @classmethod + def cache_key( + cls, component_meta: Dict[Text, Any], model_metadata: Metadata + ) -> Optional[Text]: + """Cache the component for future use. + + Args: + component_meta: configuration for the component. + model_metadata: configuration for the whole pipeline. + + Returns: key of the cache for future retrievals. + """ + _config = common.update_existing_keys(cls.defaults, component_meta) + return f"{cls.name}-{rasa.core.utils.get_dict_hash(_config)}" + + def provide_context(self) -> Dict[Text, Any]: + """Store the model in pipeline context for future use.""" + return {"tf_hub_module": self.module} + + def _tokenize(self, sentence: Text) -> Any: + + return self.tokenize_signature(tf.convert_to_tensor([sentence]))[ + "default" + ].numpy() + + def tokenize(self, message: Message, attribute: Text) -> List[Token]: + """Tokenize the text using the ConveRT model. + + ConveRT adds a special char in front of (some) words and splits words into + sub-words. To ensure the entity start and end values matches the token values, + reuse the tokens that are already assigned to the message. If individual tokens + are split up into multiple tokens, add this information to the + respected tokens. + """ + tokens_in = message.get(TOKENS_NAMES[attribute]) + + tokens_out = [] + + for token in tokens_in: + # use ConveRT model to tokenize the text + split_token_strings = self._tokenize(token.text)[0] + + # clean tokens (remove special chars and empty tokens) + split_token_strings = self._clean_tokens(split_token_strings) + + token.set(NUMBER_OF_SUB_TOKENS, len(split_token_strings)) + + tokens_out.append(token) + + message.set(TOKENS_NAMES[attribute], tokens_out) + return tokens_out + + @staticmethod + def _clean_tokens(tokens: List[bytes]) -> List[Text]: + """Encode tokens and remove special char added by ConveRT.""" + tokens = [string.decode("utf-8").replace("﹏", "") for string in tokens] + return [string for string in tokens if string] diff --git a/rasa/nlu/featurizers/dense_featurizer/lm_featurizer.py b/rasa/nlu/featurizers/dense_featurizer/lm_featurizer.py index d0bea59d1c78..4583dcd6fad1 100644 --- a/rasa/nlu/featurizers/dense_featurizer/lm_featurizer.py +++ b/rasa/nlu/featurizers/dense_featurizer/lm_featurizer.py @@ -1,33 +1,776 @@ -from typing import Any, Optional, Text, List, Type +import numpy as np +import logging +from typing import Any, Optional, Text, List, Type, Dict, Tuple + +import rasa.core.utils from rasa.nlu.config import RasaNLUModelConfig -from rasa.nlu.components import Component +from rasa.nlu.components import Component, UnsupportedLanguageError from rasa.nlu.featurizers.featurizer import DenseFeaturizer +from rasa.nlu.model import Metadata from rasa.shared.nlu.training_data.features import Features -from rasa.nlu.utils.hugging_face.hf_transformers import HFTransformersNLP -from rasa.nlu.tokenizers.lm_tokenizer import LanguageModelTokenizer +from rasa.nlu.tokenizers.tokenizer import Tokenizer, Token from rasa.shared.nlu.training_data.training_data import TrainingData from rasa.shared.nlu.training_data.message import Message from rasa.nlu.constants import ( - LANGUAGE_MODEL_DOCS, DENSE_FEATURIZABLE_ATTRIBUTES, SEQUENCE_FEATURES, SENTENCE_FEATURES, FEATURIZER_CLASS_ALIAS, + NO_LENGTH_RESTRICTION, + NUMBER_OF_SUB_TOKENS, + TOKENS_NAMES, + LANGUAGE_MODEL_DOCS, +) +from rasa.shared.nlu.constants import ( + TEXT, + FEATURE_TYPE_SENTENCE, + FEATURE_TYPE_SEQUENCE, + ACTION_TEXT, ) -from rasa.shared.nlu.constants import TEXT, FEATURE_TYPE_SENTENCE, FEATURE_TYPE_SEQUENCE +from rasa.utils import train_utils + +MAX_SEQUENCE_LENGTHS = { + "bert": 512, + "gpt": 512, + "gpt2": 512, + "xlnet": NO_LENGTH_RESTRICTION, + "distilbert": 512, + "roberta": 512, +} + +logger = logging.getLogger(__name__) class LanguageModelFeaturizer(DenseFeaturizer): - """Featurizer using transformer based language models. + """Featurizer using transformer-based language models. - Uses the output of HFTransformersNLP component to set the sequence and sentence - level representations for dense featurizable attributes of each message object. + The transformers(https://github.com/huggingface/transformers) library + is used to load pre-trained language models like BERT, GPT-2, etc. + The component also tokenizes and featurizes dense featurizable attributes of + each message. """ + defaults = { + # name of the language model to load. + "model_name": "bert", + # Pre-Trained weights to be loaded(string) + "model_weights": None, + # an optional path to a specific directory to download + # and cache the pre-trained model weights. + "cache_dir": None, + } + @classmethod def required_components(cls) -> List[Type[Component]]: - return [HFTransformersNLP, LanguageModelTokenizer] + """Packages needed to be installed.""" + return [Tokenizer] + + def __init__( + self, + component_config: Optional[Dict[Text, Any]] = None, + skip_model_load: bool = False, + hf_transformers_loaded: bool = False, + ) -> None: + """Initializes LanguageModelFeaturizer with the specified model. + + Args: + component_config: Configuration for the component. + skip_model_load: Skip loading the model for pytests. + hf_transformers_loaded: Skip loading of model and metadata, use + HFTransformers output instead. + """ + super(LanguageModelFeaturizer, self).__init__(component_config) + if hf_transformers_loaded: + return + self._load_model_metadata() + self._load_model_instance(skip_model_load) + + @classmethod + def create( + cls, component_config: Dict[Text, Any], config: RasaNLUModelConfig + ) -> "DenseFeaturizer": + language = config.language + if not cls.can_handle_language(language): + # check failed + raise UnsupportedLanguageError(cls.name, language) + # TODO: remove this when HFTransformersNLP is removed for good + if isinstance(config, Metadata): + hf_transformers_loaded = "HFTransformersNLP" in [ + c["name"] for c in config.metadata["pipeline"] + ] + else: + hf_transformers_loaded = "HFTransformersNLP" in config.component_names + return cls(component_config, hf_transformers_loaded=hf_transformers_loaded) + + @classmethod + def load( + cls, + meta: Dict[Text, Any], + model_dir: Optional[Text] = None, + model_metadata: Optional["Metadata"] = None, + cached_component: Optional["Component"] = None, + **kwargs: Any, + ) -> "Component": + """Load this component from file. + + After a component has been trained, it will be persisted by + calling `persist`. When the pipeline gets loaded again, + this component needs to be able to restore itself. + Components can rely on any context attributes that are + created by :meth:`components.Component.create` + calls to components previous to this one. + + This method differs from the parent method only in that it calls create + rather than the constructor if the component is not found. This is to + trigger the check for HFTransformersNLP and the method can be removed + when HFTRansformersNLP is removed. + + Args: + meta: Any configuration parameter related to the model. + model_dir: The directory to load the component from. + model_metadata: The model's :class:`rasa.nlu.model.Metadata`. + cached_component: The cached component. + + Returns: + the loaded component + """ + # TODO: remove this when HFTransformersNLP is removed for good + if cached_component: + return cached_component + + return cls.create(meta, model_metadata) + + def _load_model_metadata(self) -> None: + """Load the metadata for the specified model and sets these properties. + + This includes the model name, model weights, cache directory and the + maximum sequence length the model can handle. + """ + from rasa.nlu.utils.hugging_face.registry import ( + model_class_dict, + model_weights_defaults, + ) + + self.model_name = self.component_config["model_name"] + + if self.model_name not in model_class_dict: + raise KeyError( + f"'{self.model_name}' not a valid model name. Choose from " + f"{str(list(model_class_dict.keys()))} or create" + f"a new class inheriting from this class to support your model." + ) + + self.model_weights = self.component_config["model_weights"] + self.cache_dir = self.component_config["cache_dir"] + + if not self.model_weights: + logger.info( + f"Model weights not specified. Will choose default model " + f"weights: {model_weights_defaults[self.model_name]}" + ) + self.model_weights = model_weights_defaults[self.model_name] + + self.max_model_sequence_length = MAX_SEQUENCE_LENGTHS[self.model_name] + + def _load_model_instance(self, skip_model_load: bool) -> None: + """Try loading the model instance. + + Args: + skip_model_load: Skip loading the model instances to save time. This + should be True only for pytests + """ + if skip_model_load: + # This should be True only during pytests + return + + from rasa.nlu.utils.hugging_face.registry import ( + model_class_dict, + model_tokenizer_dict, + ) + + logger.debug(f"Loading Tokenizer and Model for {self.model_name}") + + self.tokenizer = model_tokenizer_dict[self.model_name].from_pretrained( + self.model_weights, cache_dir=self.cache_dir + ) + self.model = model_class_dict[self.model_name].from_pretrained( + self.model_weights, cache_dir=self.cache_dir + ) + + # Use a universal pad token since all transformer architectures do not have a + # consistent token. Instead of pad_token_id we use unk_token_id because + # pad_token_id is not set for all architectures. We can't add a new token as + # well since vocabulary resizing is not yet supported for TF classes. + # Also, this does not hurt the model predictions since we use an attention mask + # while feeding input. + self.pad_token_id = self.tokenizer.unk_token_id + + @classmethod + def cache_key( + cls, component_meta: Dict[Text, Any], model_metadata: Metadata + ) -> Optional[Text]: + """Cache the component for future use. + + Args: + component_meta: configuration for the component. + model_metadata: configuration for the whole pipeline. + + Returns: key of the cache for future retrievals. + """ + weights = component_meta.get("model_weights") or {} + + return ( + f"{cls.name}-{component_meta.get('model_name')}-" + f"{rasa.core.utils.get_dict_hash(weights)}" + ) + + @classmethod + def required_packages(cls) -> List[Text]: + """Packages needed to be installed.""" + return ["transformers"] + + def _lm_tokenize(self, text: Text) -> Tuple[List[int], List[Text]]: + """Pass the text through the tokenizer of the language model. + + Args: + text: Text to be tokenized. + + Returns: List of token ids and token strings. + """ + split_token_ids = self.tokenizer.encode(text, add_special_tokens=False) + + split_token_strings = self.tokenizer.convert_ids_to_tokens(split_token_ids) + + return split_token_ids, split_token_strings + + def _add_lm_specific_special_tokens( + self, token_ids: List[List[int]] + ) -> List[List[int]]: + """Add language model specific special tokens which were used during + their training. + + Args: + token_ids: List of token ids for each example in the batch. + + Returns: Augmented list of token ids for each example in the batch. + """ + from rasa.nlu.utils.hugging_face.registry import ( + model_special_tokens_pre_processors, + ) + + augmented_tokens = [ + model_special_tokens_pre_processors[self.model_name](example_token_ids) + for example_token_ids in token_ids + ] + return augmented_tokens + + def _lm_specific_token_cleanup( + self, split_token_ids: List[int], token_strings: List[Text] + ) -> Tuple[List[int], List[Text]]: + """Clean up special chars added by tokenizers of language models. + + Many language models add a special char in front/back of (some) words. We clean + up those chars as they are not + needed once the features are already computed. + + Args: + split_token_ids: List of token ids received as output from the language + model specific tokenizer. + token_strings: List of token strings received as output from the language + model specific tokenizer. + + Returns: Cleaned up token ids and token strings. + """ + from rasa.nlu.utils.hugging_face.registry import model_tokens_cleaners + + return model_tokens_cleaners[self.model_name](split_token_ids, token_strings) + + def _post_process_sequence_embeddings( + self, sequence_embeddings: np.ndarray + ) -> Tuple[np.ndarray, np.ndarray]: + """Compute sentence and sequence level representations for relevant tokens. + + Args: + sequence_embeddings: Sequence level dense features received as output from + language model. + + Returns: Sentence and sequence level representations. + """ + from rasa.nlu.utils.hugging_face.registry import ( + model_embeddings_post_processors, + ) + + sentence_embeddings = [] + post_processed_sequence_embeddings = [] + + for example_embedding in sequence_embeddings: + ( + example_sentence_embedding, + example_post_processed_embedding, + ) = model_embeddings_post_processors[self.model_name](example_embedding) + + sentence_embeddings.append(example_sentence_embedding) + post_processed_sequence_embeddings.append(example_post_processed_embedding) + + return ( + np.array(sentence_embeddings), + np.array(post_processed_sequence_embeddings), + ) + + def _tokenize_example( + self, message: Message, attribute: Text + ) -> Tuple[List[Token], List[int]]: + """Tokenize a single message example. + + Many language models add a special char in front of (some) words and split + words into sub-words. To ensure the entity start and end values matches the + token values, use the tokens produced by the Tokenizer component. If + individual tokens are split up into multiple tokens, we add this information + to the respected token. + + Args: + message: Single message object to be processed. + attribute: Property of message to be processed, one of ``TEXT`` or + ``RESPONSE``. + + Returns: List of token strings and token ids for the corresponding + attribute of the message. + """ + tokens_in = message.get(TOKENS_NAMES[attribute]) + tokens_out = [] + + token_ids_out = [] + + for token in tokens_in: + # use lm specific tokenizer to further tokenize the text + split_token_ids, split_token_strings = self._lm_tokenize(token.text) + + (split_token_ids, split_token_strings) = self._lm_specific_token_cleanup( + split_token_ids, split_token_strings + ) + + token_ids_out += split_token_ids + + token.set(NUMBER_OF_SUB_TOKENS, len(split_token_strings)) + + tokens_out.append(token) + + return tokens_out, token_ids_out + + def _get_token_ids_for_batch( + self, batch_examples: List[Message], attribute: Text + ) -> Tuple[List[List[Token]], List[List[int]]]: + """Compute token ids and token strings for each example in batch. + + A token id is the id of that token in the vocabulary of the language model. + + Args: + batch_examples: Batch of message objects for which tokens need to be + computed. + attribute: Property of message to be processed, one of ``TEXT`` or + ``RESPONSE``. + + Returns: List of token strings and token ids for each example in the batch. + """ + batch_token_ids = [] + batch_tokens = [] + for example in batch_examples: + + example_tokens, example_token_ids = self._tokenize_example( + example, attribute + ) + batch_tokens.append(example_tokens) + batch_token_ids.append(example_token_ids) + + return batch_tokens, batch_token_ids + + @staticmethod + def _compute_attention_mask( + actual_sequence_lengths: List[int], max_input_sequence_length: int + ) -> np.ndarray: + """Compute a mask for padding tokens. + + This mask will be used by the language model so that it does not attend to + padding tokens. + + Args: + actual_sequence_lengths: List of length of each example without any + padding. + max_input_sequence_length: Maximum length of a sequence that will be + present in the input batch. This is + after taking into consideration the maximum input sequence the model + can handle. Hence it can never be + greater than self.max_model_sequence_length in case the model + applies length restriction. + + Returns: Computed attention mask, 0 for padding and 1 for non-padding + tokens. + """ + attention_mask = [] + + for actual_sequence_length in actual_sequence_lengths: + # add 1s for present tokens, fill up the remaining space up to max + # sequence length with 0s (non-existing tokens) + padded_sequence = [1] * min( + actual_sequence_length, max_input_sequence_length + ) + [0] * ( + max_input_sequence_length + - min(actual_sequence_length, max_input_sequence_length) + ) + attention_mask.append(padded_sequence) + + attention_mask = np.array(attention_mask).astype(np.float32) + return attention_mask + + def _extract_sequence_lengths( + self, batch_token_ids: List[List[int]] + ) -> Tuple[List[int], int]: + """Extracts the sequence length for each example and maximum sequence length. + + Args: + batch_token_ids: List of token ids for each example in the batch. + + Returns: + Tuple consisting of: the actual sequence lengths for each example, + and the maximum input sequence length (taking into account the + maximum sequence length that the model can handle. + """ + # Compute max length across examples + max_input_sequence_length = 0 + actual_sequence_lengths = [] + + for example_token_ids in batch_token_ids: + sequence_length = len(example_token_ids) + actual_sequence_lengths.append(sequence_length) + max_input_sequence_length = max( + max_input_sequence_length, len(example_token_ids) + ) + + # Take into account the maximum sequence length the model can handle + max_input_sequence_length = ( + max_input_sequence_length + if self.max_model_sequence_length == NO_LENGTH_RESTRICTION + else min(max_input_sequence_length, self.max_model_sequence_length) + ) + + return actual_sequence_lengths, max_input_sequence_length + + def _add_padding_to_batch( + self, batch_token_ids: List[List[int]], max_sequence_length_model: int + ) -> List[List[int]]: + """Add padding so that all examples in the batch are of the same length. + + Args: + batch_token_ids: Batch of examples where each example is a non-padded list + of token ids. + max_sequence_length_model: Maximum length of any input sequence in the batch + to be fed to the model. + + Returns: + Padded batch with all examples of the same length. + """ + padded_token_ids = [] + + # Add padding according to max_sequence_length + # Some models don't contain pad token, we use unknown token as padding token. + # This doesn't affect the computation since we compute an attention mask + # anyways. + for example_token_ids in batch_token_ids: + + # Truncate any longer sequences so that they can be fed to the model + if len(example_token_ids) > max_sequence_length_model: + example_token_ids = example_token_ids[:max_sequence_length_model] + + padded_token_ids.append( + example_token_ids + + [self.pad_token_id] + * (max_sequence_length_model - len(example_token_ids)) + ) + return padded_token_ids + + @staticmethod + def _extract_nonpadded_embeddings( + embeddings: np.ndarray, actual_sequence_lengths: List[int] + ) -> np.ndarray: + """Extract embeddings for actual tokens. + + Use pre-computed non-padded lengths of each example to extract embeddings + for non-padding tokens. + + Args: + embeddings: sequence level representations for each example of the batch. + actual_sequence_lengths: non-padded lengths of each example of the batch. + + Returns: + Sequence level embeddings for only non-padding tokens of the batch. + """ + nonpadded_sequence_embeddings = [] + for index, embedding in enumerate(embeddings): + unmasked_embedding = embedding[: actual_sequence_lengths[index]] + nonpadded_sequence_embeddings.append(unmasked_embedding) + + return np.array(nonpadded_sequence_embeddings) + + def _compute_batch_sequence_features( + self, batch_attention_mask: np.ndarray, padded_token_ids: List[List[int]] + ) -> np.ndarray: + """Feed the padded batch to the language model. + + Args: + batch_attention_mask: Mask of 0s and 1s which indicate whether the token + is a padding token or not. + padded_token_ids: Batch of token ids for each example. The batch is padded + and hence can be fed at once. + + Returns: + Sequence level representations from the language model. + """ + model_outputs = self.model( + np.array(padded_token_ids), attention_mask=np.array(batch_attention_mask) + ) + + # sequence hidden states is always the first output from all models + sequence_hidden_states = model_outputs[0] + + sequence_hidden_states = sequence_hidden_states.numpy() + return sequence_hidden_states + + def _validate_sequence_lengths( + self, + actual_sequence_lengths: List[int], + batch_examples: List[Message], + attribute: Text, + inference_mode: bool = False, + ) -> None: + """Validate if sequence lengths of all inputs are less the max sequence + length the model can handle. + + This method should throw an error during training, whereas log a debug + message during inference if any of the input examples have a length + greater than maximum sequence length allowed. + + Args: + actual_sequence_lengths: original sequence length of all inputs + batch_examples: all message instances in the batch + attribute: attribute of message object to be processed + inference_mode: Whether this is during training or during inferencing + """ + if self.max_model_sequence_length == NO_LENGTH_RESTRICTION: + # There is no restriction on sequence length from the model + return + + for sequence_length, example in zip(actual_sequence_lengths, batch_examples): + if sequence_length > self.max_model_sequence_length: + if not inference_mode: + raise RuntimeError( + f"The sequence length of '{example.get(attribute)[:20]}...' " + f"is too long({sequence_length} tokens) for the " + f"model chosen {self.model_name} which has a maximum " + f"sequence length of {self.max_model_sequence_length} tokens. Either " + f"shorten the message or use a model which has no " + f"restriction on input sequence length like XLNet." + ) + logger.debug( + f"The sequence length of '{example.get(attribute)[:20]}...' " + f"is too long({sequence_length} tokens) for the " + f"model chosen {self.model_name} which has a maximum " + f"sequence length of {self.max_model_sequence_length} tokens. " + f"Downstream model predictions may be affected because of this." + ) + + def _add_extra_padding( + self, sequence_embeddings: np.ndarray, actual_sequence_lengths: List[int] + ) -> np.ndarray: + """Add extra zero padding to match the original sequence length. + + This is only done if the input was truncated during the batch + preparation of input for the model. + Args: + sequence_embeddings: Embeddings returned from the model + actual_sequence_lengths: original sequence length of all inputs + + Returns: + Modified sequence embeddings with padding if necessary + """ + if self.max_model_sequence_length == NO_LENGTH_RESTRICTION: + # No extra padding needed because there wouldn't have been any + # truncation in the first place + return sequence_embeddings + + reshaped_sequence_embeddings = [] + for index, embedding in enumerate(sequence_embeddings): + embedding_size = embedding.shape[-1] + if actual_sequence_lengths[index] > self.max_model_sequence_length: + embedding = np.concatenate( + [ + embedding, + np.zeros( + ( + actual_sequence_lengths[index] + - self.max_model_sequence_length, + embedding_size, + ), + dtype=np.float32, + ), + ] + ) + reshaped_sequence_embeddings.append(embedding) + + return np.array(reshaped_sequence_embeddings) + + def _get_model_features_for_batch( + self, + batch_token_ids: List[List[int]], + batch_tokens: List[List[Token]], + batch_examples: List[Message], + attribute: Text, + inference_mode: bool = False, + ) -> Tuple[np.ndarray, np.ndarray]: + """Compute dense features of each example in the batch. + + We first add the special tokens corresponding to each language model. Next, we + add appropriate padding and compute a mask for that padding so that it doesn't + affect the feature computation. The padded batch is next fed to the language + model and token level embeddings are computed. Using the pre-computed mask, + embeddings for non-padding tokens are extracted and subsequently sentence + level embeddings are computed. + + Args: + batch_token_ids: List of token ids of each example in the batch. + batch_tokens: List of token objects for each example in the batch. + batch_examples: List of examples in the batch. + attribute: attribute of the Message object to be processed. + inference_mode: Whether the call is during training or during inference. + + Returns: + Sentence and token level dense representations. + """ + # Let's first add tokenizer specific special tokens to all examples + batch_token_ids_augmented = self._add_lm_specific_special_tokens( + batch_token_ids + ) + + # Compute sequence lengths for all examples + ( + actual_sequence_lengths, + max_input_sequence_length, + ) = self._extract_sequence_lengths(batch_token_ids_augmented) + + # Validate that all sequences can be processed based on their sequence + # lengths and the maximum sequence length the model can handle + self._validate_sequence_lengths( + actual_sequence_lengths, batch_examples, attribute, inference_mode + ) + + # Add padding so that whole batch can be fed to the model + padded_token_ids = self._add_padding_to_batch( + batch_token_ids_augmented, max_input_sequence_length + ) + + # Compute attention mask based on actual_sequence_length + batch_attention_mask = self._compute_attention_mask( + actual_sequence_lengths, max_input_sequence_length + ) + + # Get token level features from the model + sequence_hidden_states = self._compute_batch_sequence_features( + batch_attention_mask, padded_token_ids + ) + + # Extract features for only non-padding tokens + sequence_nonpadded_embeddings = self._extract_nonpadded_embeddings( + sequence_hidden_states, actual_sequence_lengths + ) + + # Extract sentence level and post-processed features + ( + sentence_embeddings, + sequence_embeddings, + ) = self._post_process_sequence_embeddings(sequence_nonpadded_embeddings) + + # Pad zeros for examples which were truncated in inference mode. + # This is intentionally done after sentence embeddings have been + # extracted so that they are not affected + sequence_embeddings = self._add_extra_padding( + sequence_embeddings, actual_sequence_lengths + ) + + # shape of matrix for all sequence embeddings + batch_dim = len(sequence_embeddings) + seq_dim = max(e.shape[0] for e in sequence_embeddings) + feature_dim = sequence_embeddings[0].shape[1] + shape = (batch_dim, seq_dim, feature_dim) + + # align features with tokens so that we have just one vector per token + # (don't include sub-tokens) + sequence_embeddings = train_utils.align_token_features( + batch_tokens, sequence_embeddings, shape + ) + + # sequence_embeddings is a padded numpy array + # remove the padding, keep just the non-zero vectors + sequence_final_embeddings = [] + for embeddings, tokens in zip(sequence_embeddings, batch_tokens): + sequence_final_embeddings.append(embeddings[: len(tokens)]) + sequence_final_embeddings = np.array(sequence_final_embeddings) + + return sentence_embeddings, sequence_final_embeddings + + def _get_docs_for_batch( + self, + batch_examples: List[Message], + attribute: Text, + inference_mode: bool = False, + ) -> List[Dict[Text, Any]]: + """Compute language model docs for all examples in the batch. + + Args: + batch_examples: Batch of message objects for which language model docs + need to be computed. + attribute: Property of message to be processed, one of ``TEXT`` or + ``RESPONSE``. + inference_mode: Whether the call is during inference or during training. + + + Returns: + List of language model docs for each message in batch. + """ + hf_transformers_doc = batch_examples[0].get(LANGUAGE_MODEL_DOCS[attribute]) + if hf_transformers_doc: + # This should only be the case if the deprecated + # HFTransformersNLP component is used in the pipeline + # TODO: remove this when HFTransformersNLP is removed for good + logging.debug( + f"'{LANGUAGE_MODEL_DOCS[attribute]}' set: this " + f"indicates you're using the deprecated component " + f"HFTransformersNLP, please remove it from your " + f"pipeline." + ) + return [ex.get(LANGUAGE_MODEL_DOCS[attribute]) for ex in batch_examples] + + batch_tokens, batch_token_ids = self._get_token_ids_for_batch( + batch_examples, attribute + ) + + ( + batch_sentence_features, + batch_sequence_features, + ) = self._get_model_features_for_batch( + batch_token_ids, batch_tokens, batch_examples, attribute, inference_mode + ) + + # A doc consists of + # {'sequence_features': ..., 'sentence_features': ...} + batch_docs = [] + for index in range(len(batch_examples)): + doc = { + SEQUENCE_FEATURES: batch_sequence_features[index], + SENTENCE_FEATURES: np.reshape(batch_sentence_features[index], (1, -1)), + } + batch_docs.append(doc) + + return batch_docs def train( self, @@ -35,32 +778,61 @@ def train( config: Optional[RasaNLUModelConfig] = None, **kwargs: Any, ) -> None: + """Compute tokens and dense features for each message in training data. - for example in training_data.training_examples: - for attribute in DENSE_FEATURIZABLE_ATTRIBUTES: - self._set_lm_features(example, attribute) - - def _get_doc(self, message: Message, attribute: Text) -> Any: - """ - Get the language model doc. A doc consists of - {'token_ids': ..., 'tokens': ..., - 'sequence_features': ..., 'sentence_features': ...} + Args: + training_data: NLU training data to be tokenized and featurized + config: NLU pipeline config consisting of all components. """ - return message.get(LANGUAGE_MODEL_DOCS[attribute]) + batch_size = 64 - def process(self, message: Message, **kwargs: Any) -> None: - """Sets the dense features from the language model doc to the incoming - message.""" for attribute in DENSE_FEATURIZABLE_ATTRIBUTES: - self._set_lm_features(message, attribute) - def _set_lm_features(self, message: Message, attribute: Text = TEXT) -> None: - """Adds the precomputed word vectors to the messages features.""" - doc = self._get_doc(message, attribute) + non_empty_examples = list( + filter(lambda x: x.get(attribute), training_data.training_examples) + ) - if doc is None: - return + batch_start_index = 0 + + while batch_start_index < len(non_empty_examples): + + batch_end_index = min( + batch_start_index + batch_size, len(non_empty_examples) + ) + # Collect batch examples + batch_messages = non_empty_examples[batch_start_index:batch_end_index] + + # Construct a doc with relevant features + # extracted(tokens, dense_features) + batch_docs = self._get_docs_for_batch(batch_messages, attribute) + + for index, ex in enumerate(batch_messages): + self._set_lm_features(batch_docs[index], ex, attribute) + batch_start_index += batch_size + def process(self, message: Message, **kwargs: Any) -> None: + """Process an incoming message by computing its tokens and dense features. + + Args: + message: Incoming message object + """ + # process of all featurizers operates only on TEXT and ACTION_TEXT attributes, + # because all other attributes are labels which are featurized during training + # and their features are stored by the model itself. + for attribute in {TEXT, ACTION_TEXT}: + if message.get(attribute): + self._set_lm_features( + self._get_docs_for_batch( + [message], attribute=attribute, inference_mode=True + )[0], + message, + attribute, + ) + + def _set_lm_features( + self, doc: Dict[Text, Any], message: Message, attribute: Text = TEXT + ) -> None: + """Adds the precomputed word vectors to the messages features.""" sequence_features = doc[SEQUENCE_FEATURES] sentence_features = doc[SENTENCE_FEATURES] diff --git a/rasa/nlu/tokenizers/convert_tokenizer.py b/rasa/nlu/tokenizers/convert_tokenizer.py index a2b4857732f1..369753791960 100644 --- a/rasa/nlu/tokenizers/convert_tokenizer.py +++ b/rasa/nlu/tokenizers/convert_tokenizer.py @@ -1,210 +1,28 @@ -from typing import Any, Dict, List, Optional, Text +from typing import Dict, Text, Any -from rasa.core.utils import get_dict_hash -from rasa.nlu.constants import NUMBER_OF_SUB_TOKENS -from rasa.nlu.model import Metadata -from rasa.nlu.tokenizers.tokenizer import Token +import rasa.shared.utils.io +from rasa.nlu.tokenizers.tokenizer import Tokenizer from rasa.nlu.tokenizers.whitespace_tokenizer import WhitespaceTokenizer -from rasa.shared.nlu.training_data.message import Message -from rasa.utils import common -import rasa.nlu.utils -import rasa.utils.train_utils as train_utils -from rasa.exceptions import RasaException -import tensorflow as tf -import os - - -# URL to the old remote location of the model which -# users might use. The model is no longer hosted here. -ORIGINAL_TF_HUB_MODULE_URL = ( - "https://github.com/PolyAI-LDN/polyai-models/releases/download/v1.0/model.tar.gz" -) - -# Warning: This URL is only intended for running pytests on ConveRT -# related components. This URL should not be allowed to be used by the user. -RESTRICTED_ACCESS_URL = "https://storage.googleapis.com/continuous-integration-model-storage/convert_tf2.tar.gz" class ConveRTTokenizer(WhitespaceTokenizer): - """Tokenizer using ConveRT model. + """This tokenizer is deprecated and will be removed in the future. - Loads the ConveRT(https://github.com/PolyAI-LDN/polyai-models#convert) - model from TFHub and computes sub-word tokens for dense - featurizable attributes of each message object. + The ConveRTFeaturizer component now sets the sub-token information + for dense featurizable attributes of each message object. """ - defaults = { - # Flag to check whether to split intents - "intent_tokenization_flag": False, - # Symbol on which intent should be split - "intent_split_symbol": "_", - # Regular expression to detect tokens - "token_pattern": None, - # Remote URL/Local path to model files - "model_url": None, - } - def __init__(self, component_config: Dict[Text, Any] = None) -> None: - """Construct a new tokenizer using the WhitespaceTokenizer framework. + """Initializes ConveRTTokenizer with the ConveRT model. Args: - component_config: User configuration for the component + component_config: Configuration for the component. """ super().__init__(component_config) - - self.model_url = self._get_validated_model_url() - - self.module = train_utils.load_tf_hub_model(self.model_url) - - self.tokenize_signature = self.module.signatures["tokenize"] - - @staticmethod - def _validate_model_files_exist(model_directory: Text) -> None: - """Check if essential model files exist inside the model_directory. - - Args: - model_directory: Directory to investigate - """ - files_to_check = [ - os.path.join(model_directory, "saved_model.pb"), - os.path.join(model_directory, "variables/variables.index"), - os.path.join(model_directory, "variables/variables.data-00001-of-00002"), - os.path.join(model_directory, "variables/variables.data-00000-of-00002"), - ] - - for file_path in files_to_check: - if not os.path.exists(file_path): - raise RasaException( - f"""File {file_path} does not exist. - Re-check the files inside the directory {model_directory}. - It should contain the following model - files - [{", ".join(files_to_check)}]""" - ) - - def _get_validated_model_url(self) -> Text: - """Validates the specified `model_url` parameter. - - The `model_url` parameter cannot be left empty. It can either - be set to a remote URL where the model is hosted or it can be - a path to a local directory. - - Returns: - Validated path to model - """ - model_url = self.component_config.get("model_url", None) - - if not model_url: - raise RasaException( - f"""Parameter "model_url" was not specified in the configuration - of "{ConveRTTokenizer.__name__}". - You can either use a community hosted URL of the model - or if you have a local copy of the model, pass the - path to the directory containing the model files.""" - ) - - if model_url == ORIGINAL_TF_HUB_MODULE_URL: - # Can't use the originally hosted URL - raise RasaException( - f"""Parameter "model_url" of "{ConveRTTokenizer.__name__}" was - set to "{model_url}" which does not contain the model any longer. - You can either use a community hosted URL or if you have a - local copy of the model, pass the path to the directory - containing the model files.""" - ) - - if model_url == RESTRICTED_ACCESS_URL: - # Can't use the URL that is reserved for tests only - raise RasaException( - f"""Parameter "model_url" of "{ConveRTTokenizer.__name__}" was - set to "{model_url}" which is strictly reserved for pytests of Rasa Open Source only. - Due to licensing issues you are not allowed to use the model from this URL. - You can either use a community hosted URL or if you have a - local copy of the model, pass the path to the directory - containing the model files.""" - ) - - if os.path.isfile(model_url): - # Definitely invalid since the specified path should be a directory - raise RasaException( - f"""Parameter "model_url" of "{ConveRTTokenizer.__name__}" was - set to the path of a file which is invalid. You - can either use a community hosted URL or if you have a - local copy of the model, pass the path to the directory - containing the model files.""" - ) - - if rasa.nlu.utils.is_url(model_url): - return model_url - - if os.path.isdir(model_url): - # Looks like a local directory. Inspect the directory - # to see if model files exist. - self._validate_model_files_exist(model_url) - # Convert the path to an absolute one since - # TFHUB doesn't like relative paths - return os.path.abspath(model_url) - - raise RasaException( - f"""{model_url} is neither a valid remote URL nor a local directory. - You can either use a community hosted URL or if you have a - local copy of the model, pass the path to - the directory containing the model files.""" + rasa.shared.utils.io.raise_warning( + f"'{self.__class__.__name__}' is deprecated and " + f"will be removed in the future. " + f"It is recommended to use the '{WhitespaceTokenizer.__name__}' or " + f"another {Tokenizer.__name__} instead.", + category=DeprecationWarning, ) - - @classmethod - def cache_key( - cls, component_meta: Dict[Text, Any], model_metadata: Metadata - ) -> Optional[Text]: - """Cache the component for future use. - - Args: - component_meta: configuration for the component. - model_metadata: configuration for the whole pipeline. - - Returns: key of the cache for future retrievals. - """ - _config = common.update_existing_keys(cls.defaults, component_meta) - return f"{cls.name}-{get_dict_hash(_config)}" - - def provide_context(self) -> Dict[Text, Any]: - return {"tf_hub_module": self.module} - - def _tokenize(self, sentence: Text) -> Any: - - return self.tokenize_signature(tf.convert_to_tensor([sentence]))[ - "default" - ].numpy() - - def tokenize(self, message: Message, attribute: Text) -> List[Token]: - """Tokenize the text using the ConveRT model. - ConveRT adds a special char in front of (some) words and splits words into - sub-words. To ensure the entity start and end values matches the token values, - tokenize the text first using the whitespace tokenizer. If individual tokens - are split up into multiple tokens, add this information to the - respected tokens. - """ - - # perform whitespace tokenization - tokens_in = super().tokenize(message, attribute) - - tokens_out = [] - - for token in tokens_in: - # use ConveRT model to tokenize the text - split_token_strings = self._tokenize(token.text)[0] - - # clean tokens (remove special chars and empty tokens) - split_token_strings = self._clean_tokens(split_token_strings) - - token.set(NUMBER_OF_SUB_TOKENS, len(split_token_strings)) - - tokens_out.append(token) - - return tokens_out - - @staticmethod - def _clean_tokens(tokens: List[bytes]) -> List[Text]: - """Encode tokens and remove special char added by ConveRT.""" - - tokens = [string.decode("utf-8").replace("﹏", "") for string in tokens] - return [string for string in tokens if string] diff --git a/rasa/nlu/tokenizers/lm_tokenizer.py b/rasa/nlu/tokenizers/lm_tokenizer.py index 5e3bd61f41bb..fbee73158ef1 100644 --- a/rasa/nlu/tokenizers/lm_tokenizer.py +++ b/rasa/nlu/tokenizers/lm_tokenizer.py @@ -1,35 +1,27 @@ -from typing import Text, List, Any, Dict, Type +from typing import Dict, Text, Any -from rasa.nlu.tokenizers.tokenizer import Token, Tokenizer -from rasa.nlu.components import Component -from rasa.nlu.utils.hugging_face.hf_transformers import HFTransformersNLP -from rasa.shared.nlu.training_data.message import Message +import rasa.shared.utils.io +from rasa.nlu.tokenizers.tokenizer import Tokenizer +from rasa.nlu.tokenizers.whitespace_tokenizer import WhitespaceTokenizer -from rasa.nlu.constants import LANGUAGE_MODEL_DOCS, TOKENS +class LanguageModelTokenizer(WhitespaceTokenizer): + """This tokenizer is deprecated and will be removed in the future. -class LanguageModelTokenizer(Tokenizer): - """Tokenizer using transformer based language models. - - Uses the output of HFTransformersNLP component to set the tokens - for dense featurizable attributes of each message object. + Use the LanguageModelFeaturizer with any other Tokenizer instead. """ - @classmethod - def required_components(cls) -> List[Type[Component]]: - return [HFTransformersNLP] - - defaults = { - # Flag to check whether to split intents - "intent_tokenization_flag": False, - # Symbol on which intent should be split - "intent_split_symbol": "_", - } - - def get_doc(self, message: Message, attribute: Text) -> Dict[Text, Any]: - return message.get(LANGUAGE_MODEL_DOCS[attribute]) - - def tokenize(self, message: Message, attribute: Text) -> List[Token]: - doc = self.get_doc(message, attribute) - - return doc[TOKENS] + def __init__(self, component_config: Dict[Text, Any] = None) -> None: + """Initializes LanguageModelTokenizer for tokenization. + + Args: + component_config: Configuration for the component. + """ + super().__init__(component_config) + rasa.shared.utils.io.raise_warning( + f"'{self.__class__.__name__}' is deprecated and " + f"will be removed in the future. " + f"It is recommended to use the '{WhitespaceTokenizer.__name__}' or " + f"another {Tokenizer.__name__} instead.", + category=DeprecationWarning, + ) diff --git a/rasa/nlu/utils/hugging_face/hf_transformers.py b/rasa/nlu/utils/hugging_face/hf_transformers.py index 8b818f3b8030..8a512876d200 100644 --- a/rasa/nlu/utils/hugging_face/hf_transformers.py +++ b/rasa/nlu/utils/hugging_face/hf_transformers.py @@ -1,22 +1,22 @@ import logging from typing import Any, Dict, List, Text, Tuple, Optional -from rasa.core.utils import get_dict_hash +import rasa.core.utils from rasa.nlu.model import Metadata from rasa.nlu.tokenizers.whitespace_tokenizer import WhitespaceTokenizer +from rasa.nlu.featurizers.dense_featurizer.lm_featurizer import LanguageModelFeaturizer from rasa.nlu.components import Component from rasa.nlu.config import RasaNLUModelConfig from rasa.shared.nlu.training_data.training_data import TrainingData from rasa.shared.nlu.training_data.message import Message from rasa.nlu.tokenizers.tokenizer import Token +import rasa.shared.utils.io import rasa.utils.train_utils as train_utils import numpy as np from rasa.nlu.constants import ( LANGUAGE_MODEL_DOCS, DENSE_FEATURIZABLE_ATTRIBUTES, - TOKEN_IDS, - TOKENS, SENTENCE_FEATURES, SEQUENCE_FEATURES, NUMBER_OF_SUB_TOKENS, @@ -37,12 +37,9 @@ class HFTransformersNLP(Component): - """Utility Component for interfacing between Transformers library and Rasa OS. + """This component is deprecated and will be removed in the future. - The transformers(https://github.com/huggingface/transformers) library - is used to load pre-trained language models like BERT, GPT-2, etc. - The component also tokenizes and featurizes dense featurizable attributes of each - message. + Use the LanguageModelFeaturizer instead. """ defaults = { @@ -60,11 +57,19 @@ def __init__( component_config: Optional[Dict[Text, Any]] = None, skip_model_load: bool = False, ) -> None: + """Initializes HFTransformsNLP with the models specified.""" super(HFTransformersNLP, self).__init__(component_config) self._load_model_metadata() self._load_model_instance(skip_model_load) self.whitespace_tokenizer = WhitespaceTokenizer() + rasa.shared.utils.io.raise_warning( + f"'{self.__class__.__name__}' is deprecated and " + f"will be removed in the future. " + f"It is recommended to use the '{LanguageModelFeaturizer.__name__}' " + f"instead.", + category=DeprecationWarning, + ) def _load_model_metadata(self) -> None: @@ -78,7 +83,7 @@ def _load_model_metadata(self) -> None: if self.model_name not in model_class_dict: raise KeyError( f"'{self.model_name}' not a valid model name. Choose from " - f"{str(list(model_class_dict.keys()))} or create" + f"{str(list(model_class_dict.keys()))} or create " f"a new class inheriting from this class to support your model." ) @@ -95,12 +100,12 @@ def _load_model_metadata(self) -> None: self.max_model_sequence_length = MAX_SEQUENCE_LENGTHS[self.model_name] def _load_model_instance(self, skip_model_load: bool) -> None: - """Try loading the model instance + """Try loading the model instance. Args: - skip_model_load: Skip loading the model instances to save time. This should be True only for pytests + skip_model_load: Skip loading the model instances to save time. + This should be True only for pytests """ - if skip_model_load: # This should be True only during pytests return @@ -131,10 +136,20 @@ def _load_model_instance(self, skip_model_load: bool) -> None: def cache_key( cls, component_meta: Dict[Text, Any], model_metadata: Metadata ) -> Optional[Text]: + """Cache the component for future use. + Args: + component_meta: configuration for the component. + model_metadata: configuration for the whole pipeline. + + Returns: key of the cache for future retrievals. + """ weights = component_meta.get("model_weights") or {} - return f"{cls.name}-{component_meta.get('model_name')}-{get_dict_hash(weights)}" + return ( + f"{cls.name}-{component_meta.get('model_name')}-" + f"{rasa.core.utils.get_dict_hash(weights)}" + ) @classmethod def required_packages(cls) -> List[Text]: @@ -212,7 +227,6 @@ def _post_process_sequence_embeddings( Returns: Sentence and sequence level representations. """ - from rasa.nlu.utils.hugging_face.registry import ( model_embeddings_post_processors, ) @@ -254,7 +268,6 @@ def _tokenize_example( List of token strings and token ids for the corresponding attribute of the message. """ - tokens_in = self.whitespace_tokenizer.tokenize(message, attribute) tokens_out = [] @@ -292,7 +305,6 @@ def _get_token_ids_for_batch( Returns: List of token strings and token ids for each example in the batch. """ - batch_token_ids = [] batch_tokens = [] for example in batch_examples: @@ -323,7 +335,6 @@ def _compute_attention_mask( Returns: Computed attention mask, 0 for padding and 1 for non-padding tokens. """ - attention_mask = [] for actual_sequence_length in actual_sequence_lengths: @@ -343,7 +354,16 @@ def _compute_attention_mask( def _extract_sequence_lengths( self, batch_token_ids: List[List[int]] ) -> Tuple[List[int], int]: + """Extracts the sequence length for each example and maximum sequence length. + + Args: + batch_token_ids: List of token ids for each example in the batch. + Returns: + Tuple consisting of: the actual sequence lengths for each example, + and the maximum input sequence length (taking into account the + maximum sequence length that the model can handle. + """ # Compute max length across examples max_input_sequence_length = 0 actual_sequence_lengths = [] @@ -498,7 +518,6 @@ def _add_extra_padding( Returns: Modified sequence embeddings with padding if necessary """ - if self.max_model_sequence_length == NO_LENGTH_RESTRICTION: # No extra padding needed because there wouldn't have been any truncation in the first place return sequence_embeddings @@ -640,7 +659,6 @@ def _get_docs_for_batch( Returns: List of language model docs for each message in batch. """ - batch_tokens, batch_token_ids = self._get_token_ids_for_batch( batch_examples, attribute ) @@ -658,8 +676,6 @@ def _get_docs_for_batch( batch_docs = [] for index in range(len(batch_examples)): doc = { - TOKEN_IDS: batch_token_ids[index], - TOKENS: batch_tokens[index], SEQUENCE_FEATURES: batch_sequence_features[index], SENTENCE_FEATURES: np.reshape(batch_sentence_features[index], (1, -1)), } @@ -680,7 +696,6 @@ def train( config: NLU pipeline config consisting of all components. """ - batch_size = 64 for attribute in DENSE_FEATURIZABLE_ATTRIBUTES: @@ -715,7 +730,6 @@ def process(self, message: Message, **kwargs: Any) -> None: Args: message: Incoming message object """ - # process of all featurizers operates only on TEXT and ACTION_TEXT attributes, # because all other attributes are labels which are featurized during training # and their features are stored by the model itself. diff --git a/tests/nlu/featurizers/test_convert_featurizer.py b/tests/nlu/featurizers/test_convert_featurizer.py index e4b90d5d1347..b219c7618cdb 100644 --- a/tests/nlu/featurizers/test_convert_featurizer.py +++ b/tests/nlu/featurizers/test_convert_featurizer.py @@ -1,37 +1,41 @@ import numpy as np import pytest -from typing import Text +from typing import Text, Optional, List, Tuple +from pathlib import Path +import os from _pytest.monkeypatch import MonkeyPatch -from rasa.nlu.tokenizers.convert_tokenizer import ( - ConveRTTokenizer, - RESTRICTED_ACCESS_URL, -) +from rasa.nlu.tokenizers.whitespace_tokenizer import WhitespaceTokenizer from rasa.shared.nlu.training_data.training_data import TrainingData from rasa.shared.nlu.training_data.message import Message -from rasa.nlu.constants import TOKENS_NAMES +from rasa.nlu.constants import TOKENS_NAMES, NUMBER_OF_SUB_TOKENS from rasa.shared.nlu.constants import TEXT, INTENT, RESPONSE from rasa.nlu.config import RasaNLUModelConfig -from rasa.nlu.featurizers.dense_featurizer.convert_featurizer import ConveRTFeaturizer +from rasa.nlu.featurizers.dense_featurizer.convert_featurizer import ( + ConveRTFeaturizer, + RESTRICTED_ACCESS_URL, + ORIGINAL_TF_HUB_MODULE_URL, +) +from rasa.exceptions import RasaException @pytest.mark.skip_on_windows -def test_convert_featurizer_process(component_builder, monkeypatch: MonkeyPatch): +def test_convert_featurizer_process(monkeypatch: MonkeyPatch): + tokenizer = WhitespaceTokenizer() monkeypatch.setattr( - ConveRTTokenizer, "_get_validated_model_url", lambda x: RESTRICTED_ACCESS_URL + ConveRTFeaturizer, "_get_validated_model_url", lambda x: RESTRICTED_ACCESS_URL ) - - component_config = {"name": "ConveRTTokenizer", "model_url": RESTRICTED_ACCESS_URL} - tokenizer = ConveRTTokenizer(component_config) - featurizer = component_builder.create_component_from_class(ConveRTFeaturizer) - + component_config = {"name": "ConveRTFeaturizer", "model_url": RESTRICTED_ACCESS_URL} + featurizer = ConveRTFeaturizer(component_config) sentence = "Hey how are you today ?" - message = Message(data={TEXT: sentence}) - tokens = tokenizer.tokenize(message, attribute=TEXT) - message.set(TOKENS_NAMES[TEXT], tokens) + message = Message.build(text=sentence) - featurizer.process(message, tf_hub_module=tokenizer.module) + td = TrainingData([message]) + tokenizer.train(td) + tokens = featurizer.tokenize(message, attribute=TEXT) + + featurizer.process(message, tf_hub_module=featurizer.module) expected = np.array([2.2636216, -0.26475656, -1.1358104, -0.49751878, -1.3946456]) expected_cls = np.array( @@ -49,26 +53,29 @@ def test_convert_featurizer_process(component_builder, monkeypatch: MonkeyPatch) @pytest.mark.skip_on_windows -def test_convert_featurizer_train(component_builder, monkeypatch: MonkeyPatch): +def test_convert_featurizer_train(monkeypatch: MonkeyPatch): + tokenizer = WhitespaceTokenizer() monkeypatch.setattr( - ConveRTTokenizer, "_get_validated_model_url", lambda x: RESTRICTED_ACCESS_URL + ConveRTFeaturizer, "_get_validated_model_url", lambda x: RESTRICTED_ACCESS_URL ) - component_config = {"name": "ConveRTTokenizer", "model_url": RESTRICTED_ACCESS_URL} - tokenizer = ConveRTTokenizer(component_config) - featurizer = component_builder.create_component_from_class(ConveRTFeaturizer) + component_config = {"name": "ConveRTFeaturizer", "model_url": RESTRICTED_ACCESS_URL} + featurizer = ConveRTFeaturizer(component_config) sentence = "Hey how are you today ?" message = Message(data={TEXT: sentence}) message.set(RESPONSE, sentence) - tokens = tokenizer.tokenize(message, attribute=TEXT) + td = TrainingData([message]) + tokenizer.train(td) + + tokens = featurizer.tokenize(message, attribute=TEXT) message.set(TOKENS_NAMES[TEXT], tokens) message.set(TOKENS_NAMES[RESPONSE], tokens) featurizer.train( - TrainingData([message]), RasaNLUModelConfig(), tf_hub_module=tokenizer.module + TrainingData([message]), RasaNLUModelConfig(), tf_hub_module=featurizer.module ) expected = np.array([2.2636216, -0.26475656, -1.1358104, -0.49751878, -1.3946456]) @@ -114,14 +121,143 @@ def test_convert_featurizer_train(component_builder, monkeypatch: MonkeyPatch): def test_convert_featurizer_tokens_to_text( sentence: Text, expected_text: Text, monkeypatch: MonkeyPatch ): + tokenizer = WhitespaceTokenizer() monkeypatch.setattr( - ConveRTTokenizer, "_get_validated_model_url", lambda x: RESTRICTED_ACCESS_URL + ConveRTFeaturizer, "_get_validated_model_url", lambda x: RESTRICTED_ACCESS_URL ) - component_config = {"name": "ConveRTTokenizer", "model_url": RESTRICTED_ACCESS_URL} - tokenizer = ConveRTTokenizer(component_config) - tokens = tokenizer.tokenize(Message(data={TEXT: sentence}), attribute=TEXT) + component_config = {"name": "ConveRTFeaturizer", "model_url": RESTRICTED_ACCESS_URL} + featurizer = ConveRTFeaturizer(component_config) + message = Message.build(text=sentence) + td = TrainingData([message]) + tokenizer.train(td) + tokens = featurizer.tokenize(message, attribute=TEXT) actual_text = ConveRTFeaturizer._tokens_to_text([tokens])[0] assert expected_text == actual_text + + +@pytest.mark.skip_on_windows +@pytest.mark.parametrize( + "text, expected_tokens, expected_indices", + [ + ( + "forecast for lunch", + ["forecast", "for", "lunch"], + [(0, 8), (9, 12), (13, 18)], + ), + ("hello", ["hello"], [(0, 5)]), + ("you're", ["you", "re"], [(0, 3), (4, 6)]), + ("r. n. b.", ["r", "n", "b"], [(0, 1), (3, 4), (6, 7)]), + ("rock & roll", ["rock", "&", "roll"], [(0, 4), (5, 6), (7, 11)]), + ("ńöñàśçií", ["ńöñàśçií"], [(0, 8)]), + ], +) +def test_convert_featurizer_token_edge_cases( + text: Text, + expected_tokens: List[Text], + expected_indices: List[Tuple[int]], + monkeypatch: MonkeyPatch, +): + tokenizer = WhitespaceTokenizer() + + monkeypatch.setattr( + ConveRTFeaturizer, "_get_validated_model_url", lambda x: RESTRICTED_ACCESS_URL + ) + component_config = {"name": "ConveRTFeaturizer", "model_url": RESTRICTED_ACCESS_URL} + featurizer = ConveRTFeaturizer(component_config) + message = Message.build(text=text) + td = TrainingData([message]) + tokenizer.train(td) + tokens = featurizer.tokenize(message, attribute=TEXT) + + assert [t.text for t in tokens] == expected_tokens + assert [t.start for t in tokens] == [i[0] for i in expected_indices] + assert [t.end for t in tokens] == [i[1] for i in expected_indices] + + +@pytest.mark.skip_on_windows +@pytest.mark.parametrize( + "text, expected_number_of_sub_tokens", + [("Aarhus is a city", [2, 1, 1, 1]), ("sentence embeddings", [1, 3])], +) +def test_convert_featurizer_number_of_sub_tokens( + text: Text, expected_number_of_sub_tokens: List[int], monkeypatch: MonkeyPatch +): + tokenizer = WhitespaceTokenizer() + + monkeypatch.setattr( + ConveRTFeaturizer, "_get_validated_model_url", lambda x: RESTRICTED_ACCESS_URL + ) + component_config = {"name": "ConveRTFeaturizer", "model_url": RESTRICTED_ACCESS_URL} + featurizer = ConveRTFeaturizer(component_config) + + message = Message.build(text=text) + td = TrainingData([message]) + tokenizer.train(td) + + tokens = featurizer.tokenize(message, attribute=TEXT) + + assert [ + t.get(NUMBER_OF_SUB_TOKENS) for t in tokens + ] == expected_number_of_sub_tokens + + +@pytest.mark.skip_on_windows +@pytest.mark.parametrize( + "model_url, exception_phrase", + [ + (ORIGINAL_TF_HUB_MODULE_URL, "which does not contain the model any longer"), + ( + RESTRICTED_ACCESS_URL, + "which is strictly reserved for pytests of Rasa Open Source only", + ), + (None, """"model_url" was not specified in the configuration"""), + ("", """"model_url" was not specified in the configuration"""), + ], +) +def test_raise_invalid_urls(model_url: Optional[Text], exception_phrase: Text): + + component_config = {"name": "ConveRTFeaturizer", "model_url": model_url} + with pytest.raises(RasaException) as excinfo: + _ = ConveRTFeaturizer(component_config) + + assert exception_phrase in str(excinfo.value) + + +@pytest.mark.skip_on_windows +def test_raise_wrong_model_directory(tmp_path: Path): + + component_config = {"name": "ConveRTFeaturizer", "model_url": str(tmp_path)} + + with pytest.raises(RasaException) as excinfo: + _ = ConveRTFeaturizer(component_config) + + assert "Re-check the files inside the directory" in str(excinfo.value) + + +@pytest.mark.skip_on_windows +def test_raise_wrong_model_file(tmp_path: Path): + + # create a dummy file + temp_file = os.path.join(tmp_path, "saved_model.pb") + f = open(temp_file, "wb") + f.close() + component_config = {"name": "ConveRTFeaturizer", "model_url": temp_file} + + with pytest.raises(RasaException) as excinfo: + _ = ConveRTFeaturizer(component_config) + + assert "set to the path of a file which is invalid" in str(excinfo.value) + + +@pytest.mark.skip_on_windows +def test_raise_invalid_path(): + + component_config = {"name": "ConveRTFeaturizer", "model_url": "saved_model.pb"} + + with pytest.raises(RasaException) as excinfo: + _ = ConveRTFeaturizer(component_config) + + assert "neither a valid remote URL nor a local directory" in str(excinfo.value) diff --git a/tests/nlu/featurizers/test_lm_featurizer.py b/tests/nlu/featurizers/test_lm_featurizer.py index bb87f8f90a79..4acdc78c8de4 100644 --- a/tests/nlu/featurizers/test_lm_featurizer.py +++ b/tests/nlu/featurizers/test_lm_featurizer.py @@ -1,6 +1,20 @@ +from typing import Text, List + import numpy as np import pytest +import logging + +from _pytest.logging import LogCaptureFixture +from rasa.nlu.constants import ( + TOKENS_NAMES, + NUMBER_OF_SUB_TOKENS, + SEQUENCE_FEATURES, + SENTENCE_FEATURES, + LANGUAGE_MODEL_DOCS, +) +from rasa.nlu.tokenizers.lm_tokenizer import LanguageModelTokenizer +from rasa.nlu.tokenizers.whitespace_tokenizer import WhitespaceTokenizer from rasa.shared.nlu.training_data.training_data import TrainingData from rasa.shared.nlu.training_data.message import Message from rasa.nlu.featurizers.dense_featurizer.lm_featurizer import LanguageModelFeaturizer @@ -173,17 +187,15 @@ def test_lm_featurizer_shape_values( model_name, texts, expected_shape, expected_sequence_vec, expected_cls_vec ): - transformers_config = {"model_name": model_name} + config = {"model_name": model_name} - transformers_nlp = HFTransformersNLP(transformers_config) - lm_featurizer = LanguageModelFeaturizer() + lm_featurizer = LanguageModelFeaturizer(config) messages = [] for text in texts: messages.append(Message.build(text=text)) td = TrainingData(messages) - transformers_nlp.train(td) lm_featurizer.train(td) for index in range(len(texts)): @@ -223,3 +235,531 @@ def test_lm_featurizer_shape_values( assert intent_sequence_vec is None assert intent_sentence_vec is None + + +@pytest.mark.parametrize( + "input_sequence_length, model_name, should_overflow", + [(20, "bert", False), (1000, "bert", True), (1000, "xlnet", False)], +) +def test_sequence_length_overflow_train( + input_sequence_length: int, model_name: Text, should_overflow: bool +): + component = LanguageModelFeaturizer( + {"model_name": model_name}, skip_model_load=True + ) + message = Message.build(text=" ".join(["hi"] * input_sequence_length)) + if should_overflow: + with pytest.raises(RuntimeError): + component._validate_sequence_lengths( + [input_sequence_length], [message], "text", inference_mode=False + ) + else: + component._validate_sequence_lengths( + [input_sequence_length], [message], "text", inference_mode=False + ) + + +@pytest.mark.parametrize( + "sequence_embeddings, actual_sequence_lengths, model_name, padding_needed", + [ + (np.ones((1, 512, 5)), [1000], "bert", True), + (np.ones((1, 512, 5)), [1000], "xlnet", False), + (np.ones((1, 256, 5)), [256], "bert", False), + ], +) +def test_long_sequences_extra_padding( + sequence_embeddings: np.ndarray, + actual_sequence_lengths: List[int], + model_name: Text, + padding_needed: bool, +): + component = LanguageModelFeaturizer( + {"model_name": model_name}, skip_model_load=True + ) + modified_sequence_embeddings = component._add_extra_padding( + sequence_embeddings, actual_sequence_lengths + ) + if not padding_needed: + assert np.all(modified_sequence_embeddings) == np.all(sequence_embeddings) + else: + assert modified_sequence_embeddings.shape[1] == actual_sequence_lengths[0] + assert ( + modified_sequence_embeddings[0].shape[-1] + == sequence_embeddings[0].shape[-1] + ) + zero_embeddings = modified_sequence_embeddings[0][ + sequence_embeddings.shape[1] : + ] + assert np.all(zero_embeddings == 0) + + +@pytest.mark.parametrize( + "token_ids, max_sequence_length_model, resulting_length, padding_added", + [ + ([[1] * 200], 512, 512, True), + ([[1] * 700], 512, 512, False), + ([[1] * 200], 200, 200, False), + ], +) +def test_input_padding( + token_ids: List[List[int]], + max_sequence_length_model: int, + resulting_length: int, + padding_added: bool, +): + component = LanguageModelFeaturizer({"model_name": "bert"}, skip_model_load=True) + component.pad_token_id = 0 + padded_input = component._add_padding_to_batch(token_ids, max_sequence_length_model) + assert len(padded_input[0]) == resulting_length + if padding_added: + original_length = len(token_ids[0]) + assert np.all(np.array(padded_input[0][original_length:]) == 0) + + +@pytest.mark.parametrize( + "sequence_length, model_name, model_weights, should_overflow", + [ + (1000, "bert", "bert-base-uncased", True), + (256, "bert", "bert-base-uncased", False), + ], +) +@pytest.mark.skip_on_windows +def test_log_longer_sequence( + sequence_length: int, + model_name: Text, + model_weights: Text, + should_overflow: bool, + caplog, +): + config = {"model_name": model_name, "model_weights": model_weights} + + featurizer = LanguageModelFeaturizer(config) + + text = " ".join(["hi"] * sequence_length) + tokenizer = WhitespaceTokenizer() + message = Message.build(text=text) + td = TrainingData([message]) + tokenizer.train(td) + caplog.set_level(logging.DEBUG) + featurizer.process(message) + if should_overflow: + assert "hi hi hi" in caplog.text + assert len(message.features) >= 2 + + +@pytest.mark.parametrize( + "actual_sequence_length, max_input_sequence_length, zero_start_index", + [(256, 512, 256), (700, 700, 700), (700, 512, 512)], +) +def test_attention_mask( + actual_sequence_length: int, max_input_sequence_length: int, zero_start_index: int +): + component = LanguageModelFeaturizer({"model_name": "bert"}, skip_model_load=True) + + attention_mask = component._compute_attention_mask( + [actual_sequence_length], max_input_sequence_length + ) + mask_ones = attention_mask[0][:zero_start_index] + mask_zeros = attention_mask[0][zero_start_index:] + + assert np.all(mask_ones == 1) + assert np.all(mask_zeros == 0) + + +# TODO: need to fix this failing test +@pytest.mark.skip(reason="Results in random crashing of github action workers") +@pytest.mark.parametrize( + "model_name, model_weights, texts, expected_tokens, expected_indices", + [ + ( + "bert", + None, + [ + "Good evening.", + "you're", + "r. n. b.", + "rock & roll", + "here is the sentence I want embeddings for.", + ], + [ + ["good", "evening"], + ["you", "re"], + ["r", "n", "b"], + ["rock", "&", "roll"], + [ + "here", + "is", + "the", + "sentence", + "i", + "want", + "em", + "bed", + "ding", + "s", + "for", + ], + ], + [ + [(0, 4), (5, 12)], + [(0, 3), (4, 6)], + [(0, 1), (3, 4), (6, 7)], + [(0, 4), (5, 6), (7, 11)], + [ + (0, 4), + (5, 7), + (8, 11), + (12, 20), + (21, 22), + (23, 27), + (28, 30), + (30, 33), + (33, 37), + (37, 38), + (39, 42), + ], + ], + ), + ( + "bert", + "bert-base-chinese", + [ + "晚上好", # normal & easy case + "没问题!", # `!` is a Chinese punctuation + "去东畈村", # `畈` is a OOV token for bert-base-chinese + "好的😃", # include a emoji which is common in Chinese text-based chat + ], + [ + ["晚", "上", "好"], + ["没", "问", "题", "!"], + ["去", "东", "畈", "村"], + ["好", "的", "😃"], + ], + [ + [(0, 1), (1, 2), (2, 3)], + [(0, 1), (1, 2), (2, 3), (3, 4)], + [(0, 1), (1, 2), (2, 3), (3, 4)], + [(0, 1), (1, 2), (2, 3)], + ], + ), + ( + "gpt", + None, + [ + "Good evening.", + "hello", + "you're", + "r. n. b.", + "rock & roll", + "here is the sentence I want embeddings for.", + ], + [ + ["good", "evening"], + ["hello"], + ["you", "re"], + ["r", "n", "b"], + ["rock", "&", "roll"], + ["here", "is", "the", "sentence", "i", "want", "embe", "ddings", "for"], + ], + [ + [(0, 4), (5, 12)], + [(0, 5)], + [(0, 3), (4, 6)], + [(0, 1), (3, 4), (6, 7)], + [(0, 4), (5, 6), (7, 11)], + [ + (0, 4), + (5, 7), + (8, 11), + (12, 20), + (21, 22), + (23, 27), + (28, 32), + (32, 38), + (39, 42), + ], + ], + ), + ( + "gpt2", + None, + [ + "Good evening.", + "hello", + "you're", + "r. n. b.", + "rock & roll", + "here is the sentence I want embeddings for.", + ], + [ + ["Good", "even", "ing"], + ["hello"], + ["you", "re"], + ["r", "n", "b"], + ["rock", "&", "roll"], + [ + "here", + "is", + "the", + "sent", + "ence", + "I", + "want", + "embed", + "d", + "ings", + "for", + ], + ], + [ + [(0, 4), (5, 9), (9, 12)], + [(0, 5)], + [(0, 3), (4, 6)], + [(0, 1), (3, 4), (6, 7)], + [(0, 4), (5, 6), (7, 11)], + [ + (0, 4), + (5, 7), + (8, 11), + (12, 16), + (16, 20), + (21, 22), + (23, 27), + (28, 33), + (33, 34), + (34, 38), + (39, 42), + ], + ], + ), + ( + "xlnet", + None, + [ + "Good evening.", + "hello", + "you're", + "r. n. b.", + "rock & roll", + "here is the sentence I want embeddings for.", + ], + [ + ["Good", "evening"], + ["hello"], + ["you", "re"], + ["r", "n", "b"], + ["rock", "&", "roll"], + [ + "here", + "is", + "the", + "sentence", + "I", + "want", + "embed", + "ding", + "s", + "for", + ], + ], + [4, 3, 4, 5, 5, 12], + ), + ( + "distilbert", + None, + [ + "Good evening.", + "you're", + "r. n. b.", + "rock & roll", + "here is the sentence I want embeddings for.", + ], + [ + ["good", "evening"], + ["you", "re"], + ["r", "n", "b"], + ["rock", "&", "roll"], + [ + "here", + "is", + "the", + "sentence", + "i", + "want", + "em", + "bed", + "ding", + "s", + "for", + ], + ], + [ + [(0, 4), (5, 12)], + [(0, 3), (4, 6)], + [(0, 1), (3, 4), (6, 7)], + [(0, 4), (5, 6), (7, 11)], + [ + (0, 4), + (5, 7), + (8, 11), + (12, 20), + (21, 22), + (23, 27), + (28, 30), + (30, 33), + (33, 37), + (37, 38), + (39, 42), + ], + ], + ), + ( + "roberta", + None, + [ + "Good evening.", + "hello", + "you're", + "r. n. b.", + "rock & roll", + "here is the sentence I want embeddings for.", + ], + [ + ["Good", "even", "ing"], + ["hello"], + ["you", "re"], + ["r", "n", "b"], + ["rock", "&", "roll"], + [ + "here", + "is", + "the", + "sent", + "ence", + "I", + "want", + "embed", + "d", + "ings", + "for", + ], + ], + [ + [(0, 4), (5, 9), (9, 12)], + [(0, 5)], + [(0, 3), (4, 6)], + [(0, 1), (3, 4), (6, 7)], + [(0, 4), (5, 6), (7, 11)], + [ + (0, 4), + (5, 7), + (8, 11), + (12, 16), + (16, 20), + (21, 22), + (23, 27), + (28, 33), + (33, 34), + (34, 38), + (39, 42), + ], + ], + ), + ], +) +@pytest.mark.skip_on_windows +def test_lm_featurizer_edge_cases( + model_name, model_weights, texts, expected_tokens, expected_indices +): + + if model_weights is None: + model_weights_config = {} + else: + model_weights_config = {"model_weights": model_weights} + transformers_config = {**{"model_name": model_name}, **model_weights_config} + + lm_featurizer = LanguageModelFeaturizer(transformers_config) + whitespace_tokenizer = WhitespaceTokenizer() + + for text, gt_tokens, gt_indices in zip(texts, expected_tokens, expected_indices): + + message = Message.build(text=text) + tokens = whitespace_tokenizer.tokenize(message, TEXT) + message.set(TOKENS_NAMES[TEXT], tokens) + lm_featurizer.process(message) + + assert [t.text for t in tokens] == gt_tokens + assert [t.start for t in tokens] == [i[0] for i in gt_indices] + assert [t.end for t in tokens] == [i[1] for i in gt_indices] + + +@pytest.mark.parametrize( + "text, expected_number_of_sub_tokens", + [("sentence embeddings", [1, 4]), ("this is a test", [1, 1, 1, 1])], +) +def test_lm_featurizer_number_of_sub_tokens(text, expected_number_of_sub_tokens): + config = { + "model_name": "bert", + "model_weights": "bert-base-uncased", + } # Test for one should be enough + + lm_featurizer = LanguageModelFeaturizer(config) + whitespace_tokenizer = WhitespaceTokenizer() + + message = Message.build(text=text) + + td = TrainingData([message]) + whitespace_tokenizer.train(td) + lm_featurizer.train(td) + + assert [ + t.get(NUMBER_OF_SUB_TOKENS) for t in message.get(TOKENS_NAMES[TEXT]) + ] == expected_number_of_sub_tokens + + +@pytest.mark.parametrize("text", [("hi there")]) +def test_log_deprecation_warning_with_old_config(text: str, caplog: LogCaptureFixture): + message = Message.build(text) + + transformers_nlp = HFTransformersNLP( + {"model_name": "bert", "model_weights": "bert-base-uncased"} + ) + transformers_nlp.process(message) + + caplog.set_level(logging.DEBUG) + lm_tokenizer = LanguageModelTokenizer() + lm_tokenizer.process(message) + lm_featurizer = LanguageModelFeaturizer(skip_model_load=True) + caplog.clear() + with caplog.at_level(logging.DEBUG): + lm_featurizer.process(message) + + assert "deprecated component HFTransformersNLP" in caplog.text + + +@pytest.mark.skip(reason="Results in random crashing of github action workers") +def test_preserve_sentence_and_sequence_features_old_config(): + attribute = "text" + message = Message.build("hi there") + + transformers_nlp = HFTransformersNLP( + {"model_name": "bert", "model_weights": "bert-base-uncased"} + ) + transformers_nlp.process(message) + lm_tokenizer = LanguageModelTokenizer() + lm_tokenizer.process(message) + + lm_featurizer = LanguageModelFeaturizer({"model_name": "gpt2"}) + lm_featurizer.process(message) + + message.set(LANGUAGE_MODEL_DOCS[attribute], None) + lm_docs = lm_featurizer._get_docs_for_batch( + [message], attribute=attribute, inference_mode=True + )[0] + hf_docs = transformers_nlp._get_docs_for_batch( + [message], attribute=attribute, inference_mode=True + )[0] + assert not (message.features[0].features == lm_docs[SEQUENCE_FEATURES]).any() + assert not (message.features[1].features == lm_docs[SENTENCE_FEATURES]).any() + assert (message.features[0].features == hf_docs[SEQUENCE_FEATURES]).all() + assert (message.features[1].features == hf_docs[SENTENCE_FEATURES]).all() diff --git a/tests/nlu/test_config.py b/tests/nlu/test_config.py index 0b052c9a6286..d682d5d490a1 100644 --- a/tests/nlu/test_config.py +++ b/tests/nlu/test_config.py @@ -54,7 +54,7 @@ def test_invalid_many_tokenizers_in_config(): { "pipeline": [ {"name": "WhitespaceTokenizer"}, - {"name": "LanguageModelFeaturizer"}, + {"name": "MitieIntentClassifier"}, ] } ), diff --git a/tests/nlu/test_train.py b/tests/nlu/test_train.py index 12f0520d1200..459a93933950 100644 --- a/tests/nlu/test_train.py +++ b/tests/nlu/test_train.py @@ -7,12 +7,16 @@ from rasa.shared.nlu.training_data.training_data import TrainingData from rasa.utils.tensorflow.constants import EPOCHS from tests.nlu.conftest import DEFAULT_DATA_PATH -from typing import Any, Dict, List, Tuple, Text, Union, Optional +from typing import Any, Dict, List, Tuple, Text, Union COMPONENTS_TEST_PARAMS = { "DIETClassifier": {EPOCHS: 1}, "ResponseSelector": {EPOCHS: 1}, "HFTransformersNLP": {"model_name": "bert", "model_weights": "bert-base-uncased"}, + "LanguageModelFeaturizer": { + "model_name": "bert", + "model_weights": "bert-base-uncased", + }, } @@ -112,8 +116,8 @@ def pipelines_for_non_windows_tests() -> List[Tuple[Text, List[Dict[Text, Any]]] def test_all_components_are_in_at_least_one_test_pipeline(): """There is a template that includes all components to test the train-persist-load-use cycle. Ensures that - really all components are in there.""" - + really all components are in there. + """ all_pipelines = pipelines_for_tests() + pipelines_for_non_windows_tests() all_components = [c["name"] for _, p in all_pipelines for c in p] diff --git a/tests/nlu/tokenizers/test_convert_tokenizer.py b/tests/nlu/tokenizers/test_convert_tokenizer.py deleted file mode 100644 index ca2770cae6b9..000000000000 --- a/tests/nlu/tokenizers/test_convert_tokenizer.py +++ /dev/null @@ -1,169 +0,0 @@ -import pytest -from typing import Text, List, Tuple, Optional -from pathlib import Path -import os -from _pytest.monkeypatch import MonkeyPatch - -from rasa.shared.nlu.training_data.training_data import TrainingData -from rasa.shared.nlu.training_data.message import Message -from rasa.nlu.constants import TOKENS_NAMES, NUMBER_OF_SUB_TOKENS -from rasa.shared.nlu.constants import TEXT, INTENT -from rasa.nlu.tokenizers.convert_tokenizer import ( - ConveRTTokenizer, - RESTRICTED_ACCESS_URL, - ORIGINAL_TF_HUB_MODULE_URL, -) -from rasa.exceptions import RasaException - - -@pytest.mark.skip_on_windows -@pytest.mark.parametrize( - "text, expected_tokens, expected_indices", - [ - ( - "forecast for lunch", - ["forecast", "for", "lunch"], - [(0, 8), (9, 12), (13, 18)], - ), - ("hello", ["hello"], [(0, 5)]), - ("you're", ["you", "re"], [(0, 3), (4, 6)]), - ("r. n. b.", ["r", "n", "b"], [(0, 1), (3, 4), (6, 7)]), - ("rock & roll", ["rock", "&", "roll"], [(0, 4), (5, 6), (7, 11)]), - ("ńöñàśçií", ["ńöñàśçií"], [(0, 8)]), - ], -) -def test_convert_tokenizer_edge_cases( - text: Text, - expected_tokens: List[Text], - expected_indices: List[Tuple[int]], - monkeypatch: MonkeyPatch, -): - - monkeypatch.setattr( - ConveRTTokenizer, "_get_validated_model_url", lambda x: RESTRICTED_ACCESS_URL - ) - - component_config = {"name": "ConveRTTokenizer", "model_url": RESTRICTED_ACCESS_URL} - tokenizer = ConveRTTokenizer(component_config) - - tokens = tokenizer.tokenize(Message(data={TEXT: text}), attribute=TEXT) - - assert [t.text for t in tokens] == expected_tokens - assert [t.start for t in tokens] == [i[0] for i in expected_indices] - assert [t.end for t in tokens] == [i[1] for i in expected_indices] - - -@pytest.mark.skip_on_windows -@pytest.mark.parametrize( - "text, expected_tokens", - [ - ("Forecast_for_LUNCH", ["Forecast_for_LUNCH"]), - ("Forecast for LUNCH", ["Forecast for LUNCH"]), - ], -) -def test_custom_intent_symbol( - text: Text, expected_tokens: List[Text], monkeypatch: MonkeyPatch -): - - monkeypatch.setattr( - ConveRTTokenizer, "_get_validated_model_url", lambda x: RESTRICTED_ACCESS_URL - ) - - component_config = { - "name": "ConveRTTokenizer", - "model_url": RESTRICTED_ACCESS_URL, - "intent_tokenization": True, - "intent_split_symbol": "+", - } - - tokenizer = ConveRTTokenizer(component_config) - - message = Message(data={TEXT: text}) - message.set(INTENT, text) - - tokenizer.train(TrainingData([message])) - - assert [t.text for t in message.get(TOKENS_NAMES[INTENT])] == expected_tokens - - -@pytest.mark.skip_on_windows -@pytest.mark.parametrize( - "text, expected_number_of_sub_tokens", - [("Aarhus is a city", [2, 1, 1, 1]), ("sentence embeddings", [1, 3])], -) -def test_convert_tokenizer_number_of_sub_tokens( - text: Text, expected_number_of_sub_tokens: List[int], monkeypatch: MonkeyPatch -): - monkeypatch.setattr( - ConveRTTokenizer, "_get_validated_model_url", lambda x: RESTRICTED_ACCESS_URL - ) - component_config = {"name": "ConveRTTokenizer", "model_url": RESTRICTED_ACCESS_URL} - tokenizer = ConveRTTokenizer(component_config) - - message = Message(data={TEXT: text}) - message.set(INTENT, text) - - tokenizer.train(TrainingData([message])) - - assert [ - t.get(NUMBER_OF_SUB_TOKENS) for t in message.get(TOKENS_NAMES[TEXT]) - ] == expected_number_of_sub_tokens - - -@pytest.mark.skip_on_windows -@pytest.mark.parametrize( - "model_url, exception_phrase", - [ - (ORIGINAL_TF_HUB_MODULE_URL, "which does not contain the model any longer"), - ( - RESTRICTED_ACCESS_URL, - "which is strictly reserved for pytests of Rasa Open Source only", - ), - (None, """"model_url" was not specified in the configuration"""), - ("", """"model_url" was not specified in the configuration"""), - ], -) -def test_raise_invalid_urls(model_url: Optional[Text], exception_phrase: Text): - - component_config = {"name": "ConveRTTokenizer", "model_url": model_url} - with pytest.raises(RasaException) as excinfo: - _ = ConveRTTokenizer(component_config) - - assert exception_phrase in str(excinfo.value) - - -@pytest.mark.skip_on_windows -def test_raise_wrong_model_directory(tmp_path: Path): - - component_config = {"name": "ConveRTTokenizer", "model_url": str(tmp_path)} - - with pytest.raises(RasaException) as excinfo: - _ = ConveRTTokenizer(component_config) - - assert "Re-check the files inside the directory" in str(excinfo.value) - - -@pytest.mark.skip_on_windows -def test_raise_wrong_model_file(tmp_path: Path): - - # create a dummy file - temp_file = os.path.join(tmp_path, "saved_model.pb") - f = open(temp_file, "wb") - f.close() - component_config = {"name": "ConveRTTokenizer", "model_url": temp_file} - - with pytest.raises(RasaException) as excinfo: - _ = ConveRTTokenizer(component_config) - - assert "set to the path of a file which is invalid" in str(excinfo.value) - - -@pytest.mark.skip_on_windows -def test_raise_invalid_path(): - - component_config = {"name": "ConveRTTokenizer", "model_url": "saved_model.pb"} - - with pytest.raises(RasaException) as excinfo: - _ = ConveRTTokenizer(component_config) - - assert "neither a valid remote URL nor a local directory" in str(excinfo.value) diff --git a/tests/nlu/tokenizers/test_lm_tokenizer.py b/tests/nlu/tokenizers/test_lm_tokenizer.py deleted file mode 100644 index 74ed9e87328d..000000000000 --- a/tests/nlu/tokenizers/test_lm_tokenizer.py +++ /dev/null @@ -1,430 +0,0 @@ -import pytest - -from rasa.shared.nlu.training_data.training_data import TrainingData -from rasa.shared.nlu.training_data.message import Message -from rasa.nlu.constants import ( - TOKENS_NAMES, - LANGUAGE_MODEL_DOCS, - TOKEN_IDS, - NUMBER_OF_SUB_TOKENS, -) -from rasa.shared.nlu.constants import TEXT, INTENT -from rasa.nlu.tokenizers.lm_tokenizer import LanguageModelTokenizer -from rasa.nlu.utils.hugging_face.hf_transformers import HFTransformersNLP - - -# TODO: need to fix this failing test -@pytest.mark.skip(reason="Results in random crashing of github action workers") -@pytest.mark.parametrize( - "model_name, model_weights, texts, expected_tokens, expected_indices, expected_num_token_ids", - [ - ( - "bert", - None, - [ - "Good evening.", - "you're", - "r. n. b.", - "rock & roll", - "here is the sentence I want embeddings for.", - ], - [ - ["good", "evening"], - ["you", "re"], - ["r", "n", "b"], - ["rock", "&", "roll"], - [ - "here", - "is", - "the", - "sentence", - "i", - "want", - "em", - "bed", - "ding", - "s", - "for", - ], - ], - [ - [(0, 4), (5, 12)], - [(0, 3), (4, 6)], - [(0, 1), (3, 4), (6, 7)], - [(0, 4), (5, 6), (7, 11)], - [ - (0, 4), - (5, 7), - (8, 11), - (12, 20), - (21, 22), - (23, 27), - (28, 30), - (30, 33), - (33, 37), - (37, 38), - (39, 42), - ], - ], - [4, 4, 5, 5, 13], - ), - ( - "bert", - "bert-base-chinese", - [ - "晚上好", # normal & easy case - "没问题!", # `!` is a Chinese punctuation - "去东畈村", # `畈` is a OOV token for bert-base-chinese - "好的😃", # include a emoji which is common in Chinese text-based chat - ], - [ - ["晚", "上", "好"], - ["没", "问", "题", "!"], - ["去", "东", "畈", "村"], - ["好", "的", "😃"], - ], - [ - [(0, 1), (1, 2), (2, 3)], - [(0, 1), (1, 2), (2, 3), (3, 4)], - [(0, 1), (1, 2), (2, 3), (3, 4)], - [(0, 1), (1, 2), (2, 3)], - ], - [3, 4, 4, 3], - ), - ( - "gpt", - None, - [ - "Good evening.", - "hello", - "you're", - "r. n. b.", - "rock & roll", - "here is the sentence I want embeddings for.", - ], - [ - ["good", "evening"], - ["hello"], - ["you", "re"], - ["r", "n", "b"], - ["rock", "&", "roll"], - ["here", "is", "the", "sentence", "i", "want", "embe", "ddings", "for"], - ], - [ - [(0, 4), (5, 12)], - [(0, 5)], - [(0, 3), (4, 6)], - [(0, 1), (3, 4), (6, 7)], - [(0, 4), (5, 6), (7, 11)], - [ - (0, 4), - (5, 7), - (8, 11), - (12, 20), - (21, 22), - (23, 27), - (28, 32), - (32, 38), - (39, 42), - ], - ], - [2, 1, 2, 3, 3, 9], - ), - ( - "gpt2", - None, - [ - "Good evening.", - "hello", - "you're", - "r. n. b.", - "rock & roll", - "here is the sentence I want embeddings for.", - ], - [ - ["Good", "even", "ing"], - ["hello"], - ["you", "re"], - ["r", "n", "b"], - ["rock", "&", "roll"], - [ - "here", - "is", - "the", - "sent", - "ence", - "I", - "want", - "embed", - "d", - "ings", - "for", - ], - ], - [ - [(0, 4), (5, 9), (9, 12)], - [(0, 5)], - [(0, 3), (4, 6)], - [(0, 1), (3, 4), (6, 7)], - [(0, 4), (5, 6), (7, 11)], - [ - (0, 4), - (5, 7), - (8, 11), - (12, 16), - (16, 20), - (21, 22), - (23, 27), - (28, 33), - (33, 34), - (34, 38), - (39, 42), - ], - ], - [3, 1, 2, 3, 3, 11], - ), - ( - "xlnet", - None, - [ - "Good evening.", - "hello", - "you're", - "r. n. b.", - "rock & roll", - "here is the sentence I want embeddings for.", - ], - [ - ["Good", "evening"], - ["hello"], - ["you", "re"], - ["r", "n", "b"], - ["rock", "&", "roll"], - [ - "here", - "is", - "the", - "sentence", - "I", - "want", - "embed", - "ding", - "s", - "for", - ], - ], - [ - [(0, 4), (5, 12)], - [(0, 5)], - [(0, 3), (4, 6)], - [(0, 1), (3, 4), (6, 7)], - [(0, 4), (5, 6), (7, 11)], - [ - (0, 4), - (5, 7), - (8, 11), - (12, 20), - (21, 22), - (23, 27), - (28, 33), - (33, 37), - (37, 38), - (39, 42), - ], - ], - [4, 3, 4, 5, 5, 12], - ), - ( - "distilbert", - None, - [ - "Good evening.", - "you're", - "r. n. b.", - "rock & roll", - "here is the sentence I want embeddings for.", - ], - [ - ["good", "evening"], - ["you", "re"], - ["r", "n", "b"], - ["rock", "&", "roll"], - [ - "here", - "is", - "the", - "sentence", - "i", - "want", - "em", - "bed", - "ding", - "s", - "for", - ], - ], - [ - [(0, 4), (5, 12)], - [(0, 3), (4, 6)], - [(0, 1), (3, 4), (6, 7)], - [(0, 4), (5, 6), (7, 11)], - [ - (0, 4), - (5, 7), - (8, 11), - (12, 20), - (21, 22), - (23, 27), - (28, 30), - (30, 33), - (33, 37), - (37, 38), - (39, 42), - ], - ], - [4, 4, 5, 5, 13], - ), - ( - "roberta", - None, - [ - "Good evening.", - "hello", - "you're", - "r. n. b.", - "rock & roll", - "here is the sentence I want embeddings for.", - ], - [ - ["Good", "even", "ing"], - ["hello"], - ["you", "re"], - ["r", "n", "b"], - ["rock", "&", "roll"], - [ - "here", - "is", - "the", - "sent", - "ence", - "I", - "want", - "embed", - "d", - "ings", - "for", - ], - ], - [ - [(0, 4), (5, 9), (9, 12)], - [(0, 5)], - [(0, 3), (4, 6)], - [(0, 1), (3, 4), (6, 7)], - [(0, 4), (5, 6), (7, 11)], - [ - (0, 4), - (5, 7), - (8, 11), - (12, 16), - (16, 20), - (21, 22), - (23, 27), - (28, 33), - (33, 34), - (34, 38), - (39, 42), - ], - ], - [5, 3, 4, 5, 5, 13], - ), - ], -) -@pytest.mark.skip_on_windows -def test_lm_tokenizer_edge_cases( - model_name, - model_weights, - texts, - expected_tokens, - expected_indices, - expected_num_token_ids, -): - - if model_weights is None: - model_weights_config = {} - else: - model_weights_config = {"model_weights": model_weights} - transformers_config = {**{"model_name": model_name}, **model_weights_config} - - transformers_nlp = HFTransformersNLP(transformers_config) - lm_tokenizer = LanguageModelTokenizer() - - for text, gt_tokens, gt_indices, gt_num_indices in zip( - texts, expected_tokens, expected_indices, expected_num_token_ids - ): - - message = Message.build(text=text) - transformers_nlp.process(message) - tokens = lm_tokenizer.tokenize(message, TEXT) - token_ids = message.get(LANGUAGE_MODEL_DOCS[TEXT])[TOKEN_IDS] - - assert [t.text for t in tokens] == gt_tokens - assert [t.start for t in tokens] == [i[0] for i in gt_indices] - assert [t.end for t in tokens] == [i[1] for i in gt_indices] - assert len(token_ids) == gt_num_indices - - -@pytest.mark.parametrize( - "text, expected_tokens", - [ - ("Forecast_for_LUNCH", ["Forecast_for_LUNCH"]), - ("Forecast for LUNCH", ["Forecast for LUNCH"]), - ("Forecast+for+LUNCH", ["Forecast", "for", "LUNCH"]), - ], -) -@pytest.mark.skip_on_windows -def test_lm_tokenizer_custom_intent_symbol(text, expected_tokens): - component_config = {"intent_tokenization_flag": True, "intent_split_symbol": "+"} - - transformers_config = { - "model_name": "bert", - "model_weights": "bert-base-uncased", - } # Test for one should be enough - - transformers_nlp = HFTransformersNLP(transformers_config) - lm_tokenizer = LanguageModelTokenizer(component_config) - - message = Message.build(text=text) - message.set(INTENT, text) - - td = TrainingData([message]) - - transformers_nlp.train(td) - lm_tokenizer.train(td) - - assert [t.text for t in message.get(TOKENS_NAMES[INTENT])] == expected_tokens - - -@pytest.mark.parametrize( - "text, expected_number_of_sub_tokens", - [("sentence embeddings", [1, 4]), ("this is a test", [1, 1, 1, 1])], -) -@pytest.mark.skip_on_windows -def test_lm_tokenizer_number_of_sub_tokens(text, expected_number_of_sub_tokens): - transformers_config = { - "model_name": "bert", - "model_weights": "bert-base-uncased", - } # Test for one should be enough - - transformers_nlp = HFTransformersNLP(transformers_config) - lm_tokenizer = LanguageModelTokenizer() - - message = Message.build(text=text) - - td = TrainingData([message]) - - transformers_nlp.train(td) - lm_tokenizer.train(td) - - assert [ - t.get(NUMBER_OF_SUB_TOKENS) for t in message.get(TOKENS_NAMES[TEXT]) - ] == expected_number_of_sub_tokens diff --git a/tests/nlu/utils/test_hf_transformers.py b/tests/nlu/utils/test_hf_transformers.py index 82949054f8f2..89362c822ca3 100644 --- a/tests/nlu/utils/test_hf_transformers.py +++ b/tests/nlu/utils/test_hf_transformers.py @@ -5,6 +5,9 @@ from rasa.nlu.utils.hugging_face.hf_transformers import HFTransformersNLP from rasa.shared.nlu.training_data.message import Message +from rasa.nlu.tokenizers.whitespace_tokenizer import WhitespaceTokenizer +from rasa.nlu.constants import TOKENS_NAMES +from rasa.shared.nlu.constants import TEXT @pytest.mark.parametrize( @@ -14,7 +17,6 @@ def test_sequence_length_overflow_train( input_sequence_length: int, model_name: Text, should_overflow: bool ): - component = HFTransformersNLP({"model_name": model_name}, skip_model_load=True) message = Message.build(text=" ".join(["hi"] * input_sequence_length)) if should_overflow: @@ -42,7 +44,6 @@ def test_long_sequences_extra_padding( model_name: Text, padding_needed: bool, ): - component = HFTransformersNLP({"model_name": model_name}, skip_model_load=True) modified_sequence_embeddings = component._add_extra_padding( sequence_embeddings, actual_sequence_lengths @@ -91,7 +92,6 @@ def test_input_padding( "sequence_length, model_name, should_overflow", [(1000, "bert", True), (256, "bert", False)], ) -@pytest.mark.skip_on_windows def test_log_longer_sequence( sequence_length: int, model_name: Text, should_overflow: bool, caplog ): @@ -132,3 +132,330 @@ def test_attention_mask( assert np.all(mask_ones == 1) assert np.all(mask_zeros == 0) + + +# TODO: need to fix this failing test +@pytest.mark.skip(reason="Results in random crashing of github action workers") +@pytest.mark.parametrize( + "model_name, model_weights, texts, expected_tokens, expected_indices", + [ + ( + "bert", + None, + [ + "Good evening.", + "you're", + "r. n. b.", + "rock & roll", + "here is the sentence I want embeddings for.", + ], + [ + ["good", "evening"], + ["you", "re"], + ["r", "n", "b"], + ["rock", "&", "roll"], + [ + "here", + "is", + "the", + "sentence", + "i", + "want", + "em", + "bed", + "ding", + "s", + "for", + ], + ], + [ + [(0, 4), (5, 12)], + [(0, 3), (4, 6)], + [(0, 1), (3, 4), (6, 7)], + [(0, 4), (5, 6), (7, 11)], + [ + (0, 4), + (5, 7), + (8, 11), + (12, 20), + (21, 22), + (23, 27), + (28, 30), + (30, 33), + (33, 37), + (37, 38), + (39, 42), + ], + ], + ), + ( + "bert", + "bert-base-chinese", + [ + "晚上好", # normal & easy case + "没问题!", # `!` is a Chinese punctuation + "去东畈村", # `畈` is a OOV token for bert-base-chinese + "好的😃", # include a emoji which is common in Chinese text-based chat + ], + [ + ["晚", "上", "好"], + ["没", "问", "题", "!"], + ["去", "东", "畈", "村"], + ["好", "的", "😃"], + ], + [ + [(0, 1), (1, 2), (2, 3)], + [(0, 1), (1, 2), (2, 3), (3, 4)], + [(0, 1), (1, 2), (2, 3), (3, 4)], + [(0, 1), (1, 2), (2, 3)], + ], + ), + ( + "gpt", + None, + [ + "Good evening.", + "hello", + "you're", + "r. n. b.", + "rock & roll", + "here is the sentence I want embeddings for.", + ], + [ + ["good", "evening"], + ["hello"], + ["you", "re"], + ["r", "n", "b"], + ["rock", "&", "roll"], + ["here", "is", "the", "sentence", "i", "want", "embe", "ddings", "for"], + ], + [ + [(0, 4), (5, 12)], + [(0, 5)], + [(0, 3), (4, 6)], + [(0, 1), (3, 4), (6, 7)], + [(0, 4), (5, 6), (7, 11)], + [ + (0, 4), + (5, 7), + (8, 11), + (12, 20), + (21, 22), + (23, 27), + (28, 32), + (32, 38), + (39, 42), + ], + ], + ), + ( + "gpt2", + None, + [ + "Good evening.", + "hello", + "you're", + "r. n. b.", + "rock & roll", + "here is the sentence I want embeddings for.", + ], + [ + ["Good", "even", "ing"], + ["hello"], + ["you", "re"], + ["r", "n", "b"], + ["rock", "&", "roll"], + [ + "here", + "is", + "the", + "sent", + "ence", + "I", + "want", + "embed", + "d", + "ings", + "for", + ], + ], + [ + [(0, 4), (5, 9), (9, 12)], + [(0, 5)], + [(0, 3), (4, 6)], + [(0, 1), (3, 4), (6, 7)], + [(0, 4), (5, 6), (7, 11)], + [ + (0, 4), + (5, 7), + (8, 11), + (12, 16), + (16, 20), + (21, 22), + (23, 27), + (28, 33), + (33, 34), + (34, 38), + (39, 42), + ], + ], + ), + ( + "xlnet", + None, + [ + "Good evening.", + "hello", + "you're", + "r. n. b.", + "rock & roll", + "here is the sentence I want embeddings for.", + ], + [ + ["Good", "evening"], + ["hello"], + ["you", "re"], + ["r", "n", "b"], + ["rock", "&", "roll"], + [ + "here", + "is", + "the", + "sentence", + "I", + "want", + "embed", + "ding", + "s", + "for", + ], + ], + [4, 3, 4, 5, 5, 12], + ), + ( + "distilbert", + None, + [ + "Good evening.", + "you're", + "r. n. b.", + "rock & roll", + "here is the sentence I want embeddings for.", + ], + [ + ["good", "evening"], + ["you", "re"], + ["r", "n", "b"], + ["rock", "&", "roll"], + [ + "here", + "is", + "the", + "sentence", + "i", + "want", + "em", + "bed", + "ding", + "s", + "for", + ], + ], + [ + [(0, 4), (5, 12)], + [(0, 3), (4, 6)], + [(0, 1), (3, 4), (6, 7)], + [(0, 4), (5, 6), (7, 11)], + [ + (0, 4), + (5, 7), + (8, 11), + (12, 20), + (21, 22), + (23, 27), + (28, 30), + (30, 33), + (33, 37), + (37, 38), + (39, 42), + ], + ], + ), + ( + "roberta", + None, + [ + "Good evening.", + "hello", + "you're", + "r. n. b.", + "rock & roll", + "here is the sentence I want embeddings for.", + ], + [ + ["Good", "even", "ing"], + ["hello"], + ["you", "re"], + ["r", "n", "b"], + ["rock", "&", "roll"], + [ + "here", + "is", + "the", + "sent", + "ence", + "I", + "want", + "embed", + "d", + "ings", + "for", + ], + ], + [ + [(0, 4), (5, 9), (9, 12)], + [(0, 5)], + [(0, 3), (4, 6)], + [(0, 1), (3, 4), (6, 7)], + [(0, 4), (5, 6), (7, 11)], + [ + (0, 4), + (5, 7), + (8, 11), + (12, 16), + (16, 20), + (21, 22), + (23, 27), + (28, 33), + (33, 34), + (34, 38), + (39, 42), + ], + ], + ), + ], +) +@pytest.mark.skip_on_windows +def test_hf_transformer_edge_cases( + model_name, model_weights, texts, expected_tokens, expected_indices +): + + if model_weights is None: + model_weights_config = {} + else: + model_weights_config = {"model_weights": model_weights} + transformers_config = {**{"model_name": model_name}, **model_weights_config} + + hf_transformer = HFTransformersNLP(transformers_config) + whitespace_tokenizer = WhitespaceTokenizer() + + for text, gt_tokens, gt_indices in zip(texts, expected_tokens, expected_indices): + + message = Message.build(text=text) + tokens = whitespace_tokenizer.tokenize(message, TEXT) + message.set(TOKENS_NAMES[TEXT], tokens) + hf_transformer.process(message) + + assert [t.text for t in tokens] == gt_tokens + assert [t.start for t in tokens] == [i[0] for i in gt_indices] + assert [t.end for t in tokens] == [i[1] for i in gt_indices]