Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proof of concept for streaming messages in nlu training #8518

Closed
wants to merge 11 commits into from

Conversation

twerkmeister
Copy link
Contributor

@twerkmeister twerkmeister commented Apr 21, 2021

**DRAFT FOR DISCUSSION PURPOSES **:

  • proof of concept for Architecture prototype: Explore partial dataset loading for NLU graph #8407
  • The implementation is pretty rough and just to theoretically test some of the bigger risks (such as, could diet be trained from a stream of messages)
  • implemented for CountVectorsFeaturizer, Spacy Featurizer, CRF, and DietClassifier
  • main training data object (with unfeaturized messages) stays in memory right now, as so many parts of the codebase use it that way
  • features can be successively built and stored on disc (right now just pickling a list of features per message, ultimately a safer and more sophisticated structure should be chosen)
  • messages can be joined with features on the fly and consumed as a Generator. For the matching I am using a fingerprint of the message that is just based on the following attributes of the message: text, intent, response, action name, action text, and intent response key. I think this way the fingerprint would be invariant across processing steps and across runs.
  • keras.Sequence can be used to hide the generator, but there are probably better abstractions available. Possibly also some multiprocessing here to fill a batch queue in a second thread or so.
  • Chunking directories are just based on the component name right now, and if some component requests a message stream, the latest chunking dir will be used
  • tested with memory profiling on multiwoz, for details see below
  • shuffling is implemented by: using smallish chunks and randomizing chunk reading order + reading multiple chunks at once and then shuffling the acumulated again

notes

  • I think the methods of many components could really benefit from taking a more message centric perspective. So many of components' methods operate on collections of messages. That makes the functions more complex, harder to understand, and harder to reuse. Was able to throw out a lot of code in the CountVectorFeaturizer by focusing more on messages
  • RasaModelData seems a bit like a parallel world to the TrainingData class. I think there are many parallel implementations of the same logic on these classes, such as splitting training data according to some labels. I am not sure this intermediate RasaModelData class is really necessary. It seems all that's needed is some logic to arrange messages in a certain way and to turn a bunch of them into tensors.

@twerkmeister twerkmeister requested a review from wochinge April 21, 2021 14:56
@twerkmeister twerkmeister marked this pull request as draft April 21, 2021 14:56
@wochinge wochinge requested a review from joejuzl April 22, 2021 13:10
self.component_config[EVAL_NUM_EXAMPLES],
self.component_config[RANDOM_SEED],
data_generator, validation_data_generator = (
MessageStreamDataGenerator(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you know if the Caution: part from here applies?

Copy link
Contributor Author

@twerkmeister twerkmeister Apr 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah maybe something like Dataset.from_generator would be the right way to go. Thanks for the pointer, will look into that more deeply!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so it seems tf.data is preferred for non-python preprocessing (== all pre-processing is expressed via TensorFlow functions) while the Keras sequence is preferred if you do a lot of Python preprocessing. So in our case the Keras sequence seems to be preferable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah maybe something like Dataset.from_generator would be the right way to go

Constructing a tf.data.Dataset from Dataset.generator becomes cumbersome when you need control on what should happen after the end of every epoch. If you need that control you need to write a custom train_step function instead of replying on keras.model.fit() to handle that for you.

tf.keras.utils.Sequence explicitly has a function for it. That's why we shifted from tf.data.Dataset.from_generator to tf.keras.utils.Sequence during the transition from pure TF to reusing Keras methods.

However, we recently recognized that tensorflow has breaking support for tf.RaggedTensors inside tf.keras.utils.Sequence..
Using tf.RaggedTensors will simplify a LOT of code inside our TF models. So, I am all up for exploring how we can use tf.data.Dataset.from_generator without having to write our own model.fit() function.


def on_epoch_end(self) -> None:
"""Update the data after every epoch."""
self.current_message_stream = self.training_data.stream_featurized_messages(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how would we e.g. shuffle chunks after epoch end?

Copy link
Contributor Author

@twerkmeister twerkmeister Apr 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The TrainingData object could tell the featurereader to read the chunks in a random/different order upon creating a new streaam. And then you still need to assure that the right message gets the right features. Some ID of the message that is also stored in the chunk or so.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sweet, thanks for implementing this as well 💯

@twerkmeister twerkmeister changed the title minimal proof of concept for streaming messages in nlu training proof of concept for streaming messages in nlu training Apr 26, 2021
Copy link
Contributor

@wochinge wochinge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about showing this to Daksh / Vova? I'd be interested in their feedback. I think your prototype seems very promising! 💯


def on_epoch_end(self) -> None:
"""Update the data after every epoch."""
self.current_message_stream = self.training_data.stream_featurized_messages(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sweet, thanks for implementing this as well 💯

Comment on lines 587 to 592
self._message_to_features(msg, tag_name)
for msg in training_data.stream_featurized_messages()
)
y_train = (
self._crf_tokens_to_tags(sentence, tag_name) for sentence in df_train
self._message_to_tags(msg, tag_name)
for msg in training_data.stream_featurized_messages()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's potentially dangerous if shuffle would be enabled as X and Y would be shuffled differently, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, indeed that would break. The reason this is here is that you can only consume a generator once. However, you could split the generator, if you can guarantee that no copy is advancing more than n elements ahead of the others. It would be possible to do this here as the crf internally zips x and y and then runs through them together.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated that code again a bit

@twerkmeister
Copy link
Contributor Author

twerkmeister commented Apr 27, 2021

I ran some mem profilings on multiwoz dataset with roughly 55k nlu sentences. I compared the current main branch with my branch here. It saves memory all around, but the effect is very much dependent on the specific component. It's the least pronounced for the crf in this case as that model is building up the entire dataset internally itself.

config

I think most of it pretty standard, added the crf features here to get text_dense_features from the spacy vectors in

language: en

pipeline:
  - name: "SpacyNLP"
  - name: "SpacyTokenizer"
  - name: "SpacyFeaturizer"
  - name: "CountVectorsFeaturizer"
  - name: "CRFEntityExtractor"
    max_iterations: 1
    features: [
    ["pos", "pos2"],
    [
      "bias",
      "prefix5",
      "prefix2",
      "suffix5",
      "suffix3",
      "suffix2",
      "pos",
      "pos2",
      "digit",
      "text_dense_features"
    ],
    ["pos", "pos2"],
    ]
    L1_c: 0.01,
    L2_c: 0.05
  - name: "DIETClassifier"
    entity_recognition: False
    epochs: 1
    batch_size: 64

Results

main

standard_with_child

this branch

streamed_with_child

Memory:

  • For each model jumps to about 2.5 GB when loading spacy model and tokenizing all the messages
  • During crf data accumulation:
    • main: jumps and hovers at around 5.5 to 6GB
    • this branch: slow increase from 2.5 to about 4.5GB
    • still wondering why the memory usage behavior has such a different characteristic at this step...
  • During crf training:
    • main: peak memory usage at about 7.5 GB
    • this branch: peak memory usage at about 6 GB
  • During Diet training
    • main: constant memory consumption of around 4.3 GB
    • this branch: memory consumption of around 2-2.2 GB
  • summary: for this pipeline and dataset around 2GB memory saved
  • Depending on the size of features and size of dataset, savings might differ

Time:

  • main: ~1600s
  • this branch: ~1850s
  • Maybe a minute of that goes to CRF training or before that, both of them end at similar times at around 1000s
  • The rest seems to be attributable to DIET training
  • I think this is mostly expected as 1) the current implementation for the keras sequence is very rough, and 2) caching on disc is fundamentally a trade off between memory and time
  • The effect would most likely be larger if we trained for multiple epochs on DIET, for CRF there wouldn't be a difference (as it's storing the full feature set in memory anyway)

@twerkmeister
Copy link
Contributor Author

twerkmeister commented Apr 27, 2021

How about showing this to Daksh / Vova? I'd be interested in their feedback. I think your prototype seems very promising! 💯

I'd be interested in their thoughts aswell! 💯 Might be good to do it as part of the entire pipeline talk with research?

@joejuzl
Copy link
Contributor

joejuzl commented Apr 29, 2021

How does changing the MSG_PER_CHUNK affect the memory usage?

@twerkmeister
Copy link
Contributor Author

How does changing the MSG_PER_CHUNK affect the memory usage?

The larger the chunk the more memory is used. At the extreme you could have a chunk that is the size of the dataset even for a big dataset.

The features in this particular setup for 55k messages are about 2GB in memory. So 512 messages are about 20MB.

For the shuffling I am currently using 3 random chunks. So that would make 60mb when using shuffling.

Both the size and number of chunks when shuffling is adjustable.

@twerkmeister
Copy link
Contributor Author

@wochinge @dakshvar22 Brief update on the benchmarking over multiple epochs:

Balanced Batching and Preprocessing

Balanced batching can make datasets quite a bit. If you look at my old two graphs of main vs streaming again there's something interesting. Turns out the noisy pattern at the end is actually the DIET training. So it was much faster on the streaming branch, but only because main used balacned batching by default. Given that the nlu data of multiwoz follows a power law for examples per intent (few intents have thousands of examples, and a lot of intents only have a few examples) balanced batching effectively really more than doubled the size of the dataset during training.

The first long flat stretch in the streaming training on the other hand was a preprocessing step that was very costly on streaming form disc. I would load the data more than a 100 times from disc during this step in the first naive implementation.

With the batching strategy set to sequence (no rebalancing) and adjusting the preproccessing slightly, the two systems are now more comparable

Timing comparison

I ran the experiments multiple times and consistently observed a 10-15% time penalty for the streaming solution. So for example, on the latest run streaming would take 39s per epoch on an 80% multiwoz train set, and in memory would take 34s. That was consistent over three runs, doing 15-25 epochs each on a google cloud k80 instance.

@joejuzl joejuzl removed their request for review July 21, 2021 07:22
@stale
Copy link

stale bot commented Apr 16, 2022

This PR has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Apr 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants