proof of concept for streaming messages in nlu training #8518

twerkmeister · 2021-04-21T14:56:19Z

**DRAFT FOR DISCUSSION PURPOSES **:

proof of concept for Architecture prototype: Explore partial dataset loading for NLU graph #8407
The implementation is pretty rough and just to theoretically test some of the bigger risks (such as, could diet be trained from a stream of messages)
implemented for CountVectorsFeaturizer, Spacy Featurizer, CRF, and DietClassifier
main training data object (with unfeaturized messages) stays in memory right now, as so many parts of the codebase use it that way
features can be successively built and stored on disc (right now just pickling a list of features per message, ultimately a safer and more sophisticated structure should be chosen)
messages can be joined with features on the fly and consumed as a Generator. For the matching I am using a fingerprint of the message that is just based on the following attributes of the message: text, intent, response, action name, action text, and intent response key. I think this way the fingerprint would be invariant across processing steps and across runs.
keras.Sequence can be used to hide the generator, but there are probably better abstractions available. Possibly also some multiprocessing here to fill a batch queue in a second thread or so.
Chunking directories are just based on the component name right now, and if some component requests a message stream, the latest chunking dir will be used
tested with memory profiling on multiwoz, for details see below
shuffling is implemented by: using smallish chunks and randomizing chunk reading order + reading multiple chunks at once and then shuffling the acumulated again

notes

I think the methods of many components could really benefit from taking a more message centric perspective. So many of components' methods operate on collections of messages. That makes the functions more complex, harder to understand, and harder to reuse. Was able to throw out a lot of code in the CountVectorFeaturizer by focusing more on messages
RasaModelData seems a bit like a parallel world to the TrainingData class. I think there are many parallel implementations of the same logic on these classes, such as splitting training data according to some labels. I am not sure this intermediate RasaModelData class is really necessary. It seems all that's needed is some logic to arrange messages in a certain way and to turn a bunch of them into tensors.

wochinge · 2021-04-22T13:36:32Z

rasa/nlu/classifiers/diet_classifier.py

-            self.component_config[EVAL_NUM_EXAMPLES],
-            self.component_config[RANDOM_SEED],
+        data_generator, validation_data_generator = (
+            MessageStreamDataGenerator(


do you know if the Caution: part from here applies?

Yeah maybe something like Dataset.from_generator would be the right way to go. Thanks for the pointer, will look into that more deeply!

so it seems tf.data is preferred for non-python preprocessing (== all pre-processing is expressed via TensorFlow functions) while the Keras sequence is preferred if you do a lot of Python preprocessing. So in our case the Keras sequence seems to be preferable.

Yeah maybe something like Dataset.from_generator would be the right way to go

Constructing a tf.data.Dataset from Dataset.generator becomes cumbersome when you need control on what should happen after the end of every epoch. If you need that control you need to write a custom train_step function instead of replying on keras.model.fit() to handle that for you.

tf.keras.utils.Sequence explicitly has a function for it. That's why we shifted from tf.data.Dataset.from_generator to tf.keras.utils.Sequence during the transition from pure TF to reusing Keras methods.

However, we recently recognized that tensorflow has breaking support for tf.RaggedTensors inside tf.keras.utils.Sequence..
Using tf.RaggedTensors will simplify a LOT of code inside our TF models. So, I am all up for exploring how we can use tf.data.Dataset.from_generator without having to write our own model.fit() function.

wochinge · 2021-04-22T13:37:13Z

rasa/utils/tensorflow/data_generator.py

+
+    def on_epoch_end(self) -> None:
+        """Update the data after every epoch."""
+        self.current_message_stream = self.training_data.stream_featurized_messages(


how would we e.g. shuffle chunks after epoch end?

The TrainingData object could tell the featurereader to read the chunks in a random/different order upon creating a new streaam. And then you still need to assure that the right message gets the right features. Some ID of the message that is also stored in the chunk or so.

sweet, thanks for implementing this as well 💯

rasa/nlu/featurizers/sparse_featurizer/count_vectors_featurizer.py

wochinge

How about showing this to Daksh / Vova? I'd be interested in their feedback. I think your prototype seems very promising! 💯

wochinge · 2021-04-27T12:26:31Z

rasa/utils/tensorflow/data_generator.py

+
+    def on_epoch_end(self) -> None:
+        """Update the data after every epoch."""
+        self.current_message_stream = self.training_data.stream_featurized_messages(


sweet, thanks for implementing this as well 💯

wochinge · 2021-04-27T12:31:54Z

rasa/nlu/extractors/crf_entity_extractor.py

+                self._message_to_features(msg, tag_name)
+                for msg in training_data.stream_featurized_messages()
            )
            y_train = (
-                self._crf_tokens_to_tags(sentence, tag_name) for sentence in df_train
+                self._message_to_tags(msg, tag_name)
+                for msg in training_data.stream_featurized_messages()


that's potentially dangerous if shuffle would be enabled as X and Y would be shuffled differently, right?

Yes, indeed that would break. The reason this is here is that you can only consume a generator once. However, you could split the generator, if you can guarantee that no copy is advancing more than n elements ahead of the others. It would be possible to do this here as the crf internally zips x and y and then runs through them together.

updated that code again a bit

twerkmeister · 2021-04-27T12:37:35Z

I ran some mem profilings on multiwoz dataset with roughly 55k nlu sentences. I compared the current main branch with my branch here. It saves memory all around, but the effect is very much dependent on the specific component. It's the least pronounced for the crf in this case as that model is building up the entire dataset internally itself.

config

I think most of it pretty standard, added the crf features here to get text_dense_features from the spacy vectors in

language: en

pipeline:
  - name: "SpacyNLP"
  - name: "SpacyTokenizer"
  - name: "SpacyFeaturizer"
  - name: "CountVectorsFeaturizer"
  - name: "CRFEntityExtractor"
    max_iterations: 1
    features: [
    ["pos", "pos2"],
    [
      "bias",
      "prefix5",
      "prefix2",
      "suffix5",
      "suffix3",
      "suffix2",
      "pos",
      "pos2",
      "digit",
      "text_dense_features"
    ],
    ["pos", "pos2"],
    ]
    L1_c: 0.01,
    L2_c: 0.05
  - name: "DIETClassifier"
    entity_recognition: False
    epochs: 1
    batch_size: 64

Results

main

this branch

Memory:

For each model jumps to about 2.5 GB when loading spacy model and tokenizing all the messages
During crf data accumulation:
- main: jumps and hovers at around 5.5 to 6GB
- this branch: slow increase from 2.5 to about 4.5GB
- still wondering why the memory usage behavior has such a different characteristic at this step...
During crf training:
- main: peak memory usage at about 7.5 GB
- this branch: peak memory usage at about 6 GB
During Diet training
- main: constant memory consumption of around 4.3 GB
- this branch: memory consumption of around 2-2.2 GB
summary: for this pipeline and dataset around 2GB memory saved
Depending on the size of features and size of dataset, savings might differ

Time:

main: ~1600s
this branch: ~1850s
Maybe a minute of that goes to CRF training or before that, both of them end at similar times at around 1000s
The rest seems to be attributable to DIET training
I think this is mostly expected as 1) the current implementation for the keras sequence is very rough, and 2) caching on disc is fundamentally a trade off between memory and time
The effect would most likely be larger if we trained for multiple epochs on DIET, for CRF there wouldn't be a difference (as it's storing the full feature set in memory anyway)

twerkmeister · 2021-04-27T12:47:38Z

How about showing this to Daksh / Vova? I'd be interested in their feedback. I think your prototype seems very promising! 💯

I'd be interested in their thoughts aswell! 💯 Might be good to do it as part of the entire pipeline talk with research?

joejuzl · 2021-04-29T07:56:15Z

How does changing the MSG_PER_CHUNK affect the memory usage?

twerkmeister · 2021-05-03T06:39:32Z

How does changing the MSG_PER_CHUNK affect the memory usage?

The larger the chunk the more memory is used. At the extreme you could have a chunk that is the size of the dataset even for a big dataset.

The features in this particular setup for 55k messages are about 2GB in memory. So 512 messages are about 20MB.

For the shuffling I am currently using 3 random chunks. So that would make 60mb when using shuffling.

Both the size and number of chunks when shuffling is adjustable.

twerkmeister · 2021-05-10T13:19:50Z

@wochinge @dakshvar22 Brief update on the benchmarking over multiple epochs:

Balanced Batching and Preprocessing

Balanced batching can make datasets quite a bit. If you look at my old two graphs of main vs streaming again there's something interesting. Turns out the noisy pattern at the end is actually the DIET training. So it was much faster on the streaming branch, but only because main used balacned batching by default. Given that the nlu data of multiwoz follows a power law for examples per intent (few intents have thousands of examples, and a lot of intents only have a few examples) balanced batching effectively really more than doubled the size of the dataset during training.

The first long flat stretch in the streaming training on the other hand was a preprocessing step that was very costly on streaming form disc. I would load the data more than a 100 times from disc during this step in the first naive implementation.

With the batching strategy set to sequence (no rebalancing) and adjusting the preproccessing slightly, the two systems are now more comparable

Timing comparison

I ran the experiments multiple times and consistently observed a 10-15% time penalty for the streaming solution. So for example, on the latest run streaming would take 39s per epoch on an 80% multiwoz train set, and in memory would take 34s. That was consistent over three runs, doing 15-25 epochs each on a google cloud k80 instance.

stale · 2022-04-16T07:54:50Z

This PR has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

twerkmeister added 5 commits April 19, 2021 12:23

Removed non-functional casting code

8f824b9

Featurizer prototype

3cd9ae1

Merge branch 'main' into nlu-partial-dataset-loading-proto

43075ef

Dirty prototype for stream data generator

f927a0d

a bit of cleanup

1c9de2e

twerkmeister requested a review from wochinge April 21, 2021 14:56

twerkmeister marked this pull request as draft April 21, 2021 14:56

removed comment line

ebfe6cd

wochinge requested a review from joejuzl April 22, 2021 13:10

wochinge reviewed Apr 22, 2021

View reviewed changes

rasa/nlu/featurizers/sparse_featurizer/count_vectors_featurizer.py Show resolved Hide resolved

twerkmeister added 3 commits April 26, 2021 15:08

shuffling stream, robustness, tests, refactoring

124e037

Added quickfix for tracking chunk dirs automatically

ae3569a

Added streamprocessing to crf entity extractor

c00c23a

twerkmeister changed the title ~~minimal proof of concept for streaming messages in nlu training~~ proof of concept for streaming messages in nlu training Apr 26, 2021

wochinge reviewed Apr 27, 2021

View reviewed changes

splitting streams inside crf

06b54c4

twerkmeister closed this Apr 27, 2021

twerkmeister reopened this Apr 27, 2021

twerkmeister mentioned this pull request May 3, 2021

Support training on big datasets #8595

Closed

2 tasks

Improved very costly preprocessing step

5411721

joejuzl removed their request for review July 21, 2021 07:22

stale bot added the stale label Apr 16, 2022

twerkmeister closed this Aug 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

proof of concept for streaming messages in nlu training #8518

proof of concept for streaming messages in nlu training #8518

twerkmeister commented Apr 21, 2021 •

edited

Loading

wochinge Apr 22, 2021

twerkmeister Apr 22, 2021 •

edited

Loading

wochinge Apr 27, 2021

dakshvar22 Apr 30, 2021

wochinge Apr 22, 2021

twerkmeister Apr 22, 2021 •

edited

Loading

wochinge Apr 27, 2021

wochinge left a comment

wochinge Apr 27, 2021

wochinge Apr 27, 2021

twerkmeister Apr 27, 2021

twerkmeister Apr 27, 2021

twerkmeister commented Apr 27, 2021 •

edited

Loading

twerkmeister commented Apr 27, 2021 •

edited

Loading

joejuzl commented Apr 29, 2021

twerkmeister commented May 3, 2021

twerkmeister commented May 10, 2021

stale bot commented Apr 16, 2022

proof of concept for streaming messages in nlu training #8518

proof of concept for streaming messages in nlu training #8518

Conversation

twerkmeister commented Apr 21, 2021 • edited Loading

Choose a reason for hiding this comment

twerkmeister Apr 22, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

twerkmeister Apr 22, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wochinge left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

twerkmeister commented Apr 27, 2021 • edited Loading

config

Results

main

this branch

twerkmeister commented Apr 27, 2021 • edited Loading

joejuzl commented Apr 29, 2021

twerkmeister commented May 3, 2021

twerkmeister commented May 10, 2021

Balanced Batching and Preprocessing

Timing comparison

stale bot commented Apr 16, 2022

twerkmeister commented Apr 21, 2021 •

edited

Loading

twerkmeister Apr 22, 2021 •

edited

Loading

twerkmeister Apr 22, 2021 •

edited

Loading

twerkmeister commented Apr 27, 2021 •

edited

Loading

twerkmeister commented Apr 27, 2021 •

edited

Loading