Architecture prototype: Explore partial dataset loading for NLU graph #8407

wochinge · 2021-04-09T08:00:14Z

Description of Problem:
Research already did a PR for training the NLU model in chunks and any new model architecture in the future will have to be able to support this use case to enable training large scale models on huge datasets. This issue is limited to the NLU part of partial training data set loading. Partial training data set loading / processing doesn't need to be explored.

Overview of the Solution:
There is no need for implementation here. However, we need to have clear picture of how an implementation with the proposed graph architecture could look like.

Blockers (if relevant):
None

Definition of Done:

Specify in a document how much control the Components need over the chunked training data (e.g. whether the components need intermedite write operations)
Specify the requirements which Components have to be able to train on chunks (e.g. what kind of summary data they need before - these things are currently done in prepare_partial_training)
Have an answer whether it's possible to abstract the current training (chunks==1) as a subset of training with multiple chunks
propose an interface which abstracts the passed training data (it shouldn't matter to the Component if the data is on disk or in memory) and which integrates into the graph structure

The text was updated successfully, but these errors were encountered:

twerkmeister · 2021-04-12T13:13:56Z

What happens currently in prepare partial training

general case: nothing
RegexEntityExtractor
- collects patterns from TrainingData object
  - requires: lookup tables and defined regexes, (also knowledge about which regex types are also entity annotations - although this might be solved also in other ways)
DIETClassifier
- sorts intents by name, and creates mapping {intent: id} and {id: intent} if intent classification is enabled
  - requires: all intents
- creates mappings {entity-type: id}, {entity-role: id}, {entity-group: id}. If bilou is enabled, expand mappings for each bilou prefix
  - requires: all entity types, -roles, and -groups
- checks all entity annotations of all messages and logs warnings
  - requires: all messages tokenized and entity annotations
EntitySynonymMapper
- collects synonyms from TrainingData object
  - requires: synonyms
CountVectorsFeaturizer
- trains the component and creates a vocabulary plus some buffer for fine-tuning
  - requires: message text, response text, action text (+ intent, action name, intent response key if trained on word level)
- Actually the situation seems to be the following
  - prepare_partial_training is the actual training/creation of the vectorizers
  - _train_on_examples is the processing of the messages and adding the vectorizer output as additional info to the messages
- I don't think this component really needs a preparation step, hm. It definitely needs access to all messages, yes. But so does every component in the real training.
LexicalSyntacticFeaturizer
- trains the component and collects all features
  - requires: tokenized message texts
- similar situation as CountVectorizerFeaturizer: prepare_partial_training is the actual training, and _train_on_examples processes messages and adds the new features to the messages
RegexFeaturizer
- collects all regex and lookup patterns
- similar situation as CountVectorizerFeaturizer: prepare_partial_training is the actual training, and _train_on_examples processes messages and adds the new features to the messages
ResponseSelector
- like DIETClassifier: sorts intents by name, and creates mapping {intent: id} and {id: intent}
  - requires: all intents
- saves all retrieval intents
  - requires: all retrieval intents
- saves all responses
  - requires: the response dict

Overall it seems the components do not collect highly specialized data in prepare_partial_training. One class of components collects mostly simple summaries like all intents, all entity types, all entity roles, etc. which could be collected once on a more centralized level and made available for later consumption by components. The other class collects some statistics based on all message texts. However, it is not clear why this necessarily has to happen as a preparatory step. Consuming a message stream twice would work. This basically what the current implementation in main is doing: running over the List[Message] twice.

twerkmeister · 2021-04-12T13:55:03Z

How much control do the components need over chunked data

The chunked data is currently balanced once, featurized, and persisted to disc upfront. So far only two components have an implementation for train_on_chunks and they do not alter the chunks:

DietClassifier and with that its subclass ResponseSelector
- Currently does nothing to the chunked data, loads some examples to initialize the model, has the RasaDataChunkFileGenerator batch data for it.
- RasaDataChunkFileGenerator shuffles the order of chunks internally on epoch end (not the messages inside the chunks though), but never rewrites the chunks
EntitySynonymMapper
- reads the chunks, processes all messages, but no data is written back to disc, so that processing would be lost for later steps it seems :/

I guess the entity synonym wrapper and other extractors like the RegexEntityExtractor whose results aren't immediately available from the training data would have to write their results back somewhere, but not really to the data they are processing themselves. They are augmenting it further for other components down the line

twerkmeister · 2021-04-12T14:24:48Z

Have an answer whether it's possible to abstract the current training (chunks==1) as a subset of training with multiple chunks

I can't think of any reason why it wouldn't be possible. I just think the current design is suboptimal. By exposing the chunks to the components directly you are adding unnecessary complexity to the components if you want to handle the chunks in a somewhat smart way. Probably you would just have the components create some sort of data handler on the chunks to prevent code duplication. And then why not just hand them some data handler right away that abstracts away the details.

What I think we should do is take the good parts of the existing implementation, such as logic on how to create balanced chunks upfront, rearrange things a bit, and have a more general and cleaner solution.

Will think about this a bit more in any case...

twerkmeister · 2021-04-14T10:54:59Z

Further notes (work in progress)

DIETClassifier, ResponseSelector, and TedPolicy can be configured to create a train/eval split during their training procedure. Definitely needs some additional consideration. Should a split happen beforehand, e.g. should there be a global split that is offered to different components? Should a split happen during the first passing through of the stream?
RasaModelData seems like a parallel implementation of TrainingData for the purpose of being consumed by the rasa tensorflow models
I think there's plenty of potential to streamline the transformation of our python Message objects into the format used by components. Feels like e.g. DietClassifier runs over the given messages many times and applies different transformations each time, such as add bilou tags, turn into feature arrays, filter those that have intents assigned, and more. Might be possible to take a more functional approach here defining a clear set of functions that is applied sequentially to all Message objects coming in for training. Probably cleaner and easier to parallelize
Multiprocessing is an important topic for accessing data streams. Consumption should probably be single threaded and sequential and then transformation can be spread out. The current implementation of RasaDataChunkFileGenerator, while allowing for non-sequential access to batches, would in the worst case load and process new chunks for each batch requested.

wochinge · 2021-04-14T11:28:57Z

Thanks for the investigation!

I guess the entity synonym wrapper and other extractors like the RegexEntityExtractor whose results aren't immediately available from the training data would have to write their results back somewhere,

There are no components consuming their outputs during training anyway, are there?

Where do we do the balancing?

wochinge added type:enhancement ✨ Additions of new features or changes to existing ones, should be doable in a single PR area:rasa-oss 🎡 Anything related to the open source Rasa framework feature:rasa-3.0/architecture-prototype labels Apr 9, 2021

wochinge changed the title ~~Explore partial dataset loading for NLU graph~~ Architecture prototype: Explore partial dataset loading for NLU graph Apr 9, 2021

joejuzl mentioned this issue Apr 9, 2021

Prototype Rasa 3.0 Architecture #8419

Closed

12 tasks

TyDunn assigned twerkmeister Apr 12, 2021

twerkmeister mentioned this issue Apr 21, 2021

proof of concept for streaming messages in nlu training #8518

Closed

TyDunn closed this as completed May 17, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Architecture prototype: Explore partial dataset loading for NLU graph #8407

Architecture prototype: Explore partial dataset loading for NLU graph #8407

wochinge commented Apr 9, 2021 •

edited by twerkmeister

Loading

twerkmeister commented Apr 12, 2021 •

edited

Loading

twerkmeister commented Apr 12, 2021 •

edited

Loading

twerkmeister commented Apr 12, 2021 •

edited

Loading

twerkmeister commented Apr 14, 2021 •

edited

Loading

wochinge commented Apr 14, 2021 •

edited

Loading

Architecture prototype: Explore partial dataset loading for NLU graph #8407

Architecture prototype: Explore partial dataset loading for NLU graph #8407

Comments

wochinge commented Apr 9, 2021 • edited by twerkmeister Loading

twerkmeister commented Apr 12, 2021 • edited Loading

What happens currently in prepare partial training

twerkmeister commented Apr 12, 2021 • edited Loading

How much control do the components need over chunked data

twerkmeister commented Apr 12, 2021 • edited Loading

Have an answer whether it's possible to abstract the current training (chunks==1) as a subset of training with multiple chunks

twerkmeister commented Apr 14, 2021 • edited Loading

Further notes (work in progress)

wochinge commented Apr 14, 2021 • edited Loading

wochinge commented Apr 9, 2021 •

edited by twerkmeister

Loading

twerkmeister commented Apr 12, 2021 •

edited

Loading

twerkmeister commented Apr 12, 2021 •

edited

Loading

twerkmeister commented Apr 12, 2021 •

edited

Loading

twerkmeister commented Apr 14, 2021 •

edited

Loading

wochinge commented Apr 14, 2021 •

edited

Loading