Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Architecture prototype: Explore partial dataset loading for NLU graph #8407

Closed
4 tasks done
wochinge opened this issue Apr 9, 2021 · 5 comments
Closed
4 tasks done
Assignees
Labels
area:rasa-oss 🎡 Anything related to the open source Rasa framework type:enhancement ✨ Additions of new features or changes to existing ones, should be doable in a single PR

Comments

@wochinge
Copy link
Contributor

wochinge commented Apr 9, 2021

Description of Problem:
Research already did a PR for training the NLU model in chunks and any new model architecture in the future will have to be able to support this use case to enable training large scale models on huge datasets. This issue is limited to the NLU part of partial training data set loading. Partial training data set loading / processing doesn't need to be explored.

Overview of the Solution:
There is no need for implementation here. However, we need to have clear picture of how an implementation with the proposed graph architecture could look like.

Blockers (if relevant):
None

Definition of Done:

  • Specify in a document how much control the Components need over the chunked training data (e.g. whether the components need intermedite write operations)
  • Specify the requirements which Components have to be able to train on chunks (e.g. what kind of summary data they need before - these things are currently done in prepare_partial_training)
  • Have an answer whether it's possible to abstract the current training (chunks==1) as a subset of training with multiple chunks
  • propose an interface which abstracts the passed training data (it shouldn't matter to the Component if the data is on disk or in memory) and which integrates into the graph structure
@wochinge wochinge added type:enhancement ✨ Additions of new features or changes to existing ones, should be doable in a single PR area:rasa-oss 🎡 Anything related to the open source Rasa framework feature:rasa-3.0/architecture-prototype labels Apr 9, 2021
@wochinge wochinge changed the title Explore partial dataset loading for NLU graph Architecture prototype: Explore partial dataset loading for NLU graph Apr 9, 2021
@twerkmeister
Copy link
Contributor

twerkmeister commented Apr 12, 2021

What happens currently in prepare partial training

  • general case: nothing
  • RegexEntityExtractor
    • collects patterns from TrainingData object
      • requires: lookup tables and defined regexes, (also knowledge about which regex types are also entity annotations - although this might be solved also in other ways)
  • DIETClassifier
    • sorts intents by name, and creates mapping {intent: id} and {id: intent} if intent classification is enabled
      • requires: all intents
    • creates mappings {entity-type: id}, {entity-role: id}, {entity-group: id}. If bilou is enabled, expand mappings for each bilou prefix
      • requires: all entity types, -roles, and -groups
    • checks all entity annotations of all messages and logs warnings
      • requires: all messages tokenized and entity annotations
  • EntitySynonymMapper
    • collects synonyms from TrainingData object
      • requires: synonyms
  • CountVectorsFeaturizer
    • trains the component and creates a vocabulary plus some buffer for fine-tuning
      • requires: message text, response text, action text (+ intent, action name, intent response key if trained on word level)
    • Actually the situation seems to be the following
      • prepare_partial_training is the actual training/creation of the vectorizers
      • _train_on_examples is the processing of the messages and adding the vectorizer output as additional info to the messages
    • I don't think this component really needs a preparation step, hm. It definitely needs access to all messages, yes. But so does every component in the real training.
  • LexicalSyntacticFeaturizer
    • trains the component and collects all features
      • requires: tokenized message texts
    • similar situation as CountVectorizerFeaturizer: prepare_partial_training is the actual training, and _train_on_examples processes messages and adds the new features to the messages
  • RegexFeaturizer
    • collects all regex and lookup patterns
    • similar situation as CountVectorizerFeaturizer: prepare_partial_training is the actual training, and _train_on_examples processes messages and adds the new features to the messages
  • ResponseSelector
    • like DIETClassifier: sorts intents by name, and creates mapping {intent: id} and {id: intent}
      • requires: all intents
    • saves all retrieval intents
      • requires: all retrieval intents
    • saves all responses
      • requires: the response dict

Overall it seems the components do not collect highly specialized data in prepare_partial_training. One class of components collects mostly simple summaries like all intents, all entity types, all entity roles, etc. which could be collected once on a more centralized level and made available for later consumption by components. The other class collects some statistics based on all message texts. However, it is not clear why this necessarily has to happen as a preparatory step. Consuming a message stream twice would work. This basically what the current implementation in main is doing: running over the List[Message] twice.

@twerkmeister
Copy link
Contributor

twerkmeister commented Apr 12, 2021

How much control do the components need over chunked data

The chunked data is currently balanced once, featurized, and persisted to disc upfront. So far only two components have an implementation for train_on_chunks and they do not alter the chunks:

  • DietClassifier and with that its subclass ResponseSelector
    • Currently does nothing to the chunked data, loads some examples to initialize the model, has the RasaDataChunkFileGenerator batch data for it.
    • RasaDataChunkFileGenerator shuffles the order of chunks internally on epoch end (not the messages inside the chunks though), but never rewrites the chunks
  • EntitySynonymMapper
    • reads the chunks, processes all messages, but no data is written back to disc, so that processing would be lost for later steps it seems :/

I guess the entity synonym wrapper and other extractors like the RegexEntityExtractor whose results aren't immediately available from the training data would have to write their results back somewhere, but not really to the data they are processing themselves. They are augmenting it further for other components down the line

@twerkmeister
Copy link
Contributor

twerkmeister commented Apr 12, 2021

Have an answer whether it's possible to abstract the current training (chunks==1) as a subset of training with multiple chunks

I can't think of any reason why it wouldn't be possible. I just think the current design is suboptimal. By exposing the chunks to the components directly you are adding unnecessary complexity to the components if you want to handle the chunks in a somewhat smart way. Probably you would just have the components create some sort of data handler on the chunks to prevent code duplication. And then why not just hand them some data handler right away that abstracts away the details.

What I think we should do is take the good parts of the existing implementation, such as logic on how to create balanced chunks upfront, rearrange things a bit, and have a more general and cleaner solution.

Will think about this a bit more in any case...

@twerkmeister
Copy link
Contributor

twerkmeister commented Apr 14, 2021

Further notes (work in progress)

  • DIETClassifier, ResponseSelector, and TedPolicy can be configured to create a train/eval split during their training procedure. Definitely needs some additional consideration. Should a split happen beforehand, e.g. should there be a global split that is offered to different components? Should a split happen during the first passing through of the stream?
  • RasaModelData seems like a parallel implementation of TrainingData for the purpose of being consumed by the rasa tensorflow models
  • I think there's plenty of potential to streamline the transformation of our python Message objects into the format used by components. Feels like e.g. DietClassifier runs over the given messages many times and applies different transformations each time, such as add bilou tags, turn into feature arrays, filter those that have intents assigned, and more. Might be possible to take a more functional approach here defining a clear set of functions that is applied sequentially to all Message objects coming in for training. Probably cleaner and easier to parallelize
  • Multiprocessing is an important topic for accessing data streams. Consumption should probably be single threaded and sequential and then transformation can be spread out. The current implementation of RasaDataChunkFileGenerator, while allowing for non-sequential access to batches, would in the worst case load and process new chunks for each batch requested.

@wochinge
Copy link
Contributor Author

wochinge commented Apr 14, 2021

Thanks for the investigation!

I guess the entity synonym wrapper and other extractors like the RegexEntityExtractor whose results aren't immediately available from the training data would have to write their results back somewhere,

There are no components consuming their outputs during training anyway, are there?

Where do we do the balancing?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:rasa-oss 🎡 Anything related to the open source Rasa framework type:enhancement ✨ Additions of new features or changes to existing ones, should be doable in a single PR
Projects
None yet
Development

No branches or pull requests

3 participants