-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
proof of concept for streaming messages in nlu training #8518
Conversation
self.component_config[EVAL_NUM_EXAMPLES], | ||
self.component_config[RANDOM_SEED], | ||
data_generator, validation_data_generator = ( | ||
MessageStreamDataGenerator( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you know if the Caution:
part from here applies?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah maybe something like Dataset.from_generator would be the right way to go. Thanks for the pointer, will look into that more deeply!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so it seems tf.data
is preferred for non-python preprocessing (== all pre-processing is expressed via TensorFlow functions) while the Keras sequence is preferred if you do a lot of Python preprocessing. So in our case the Keras sequence seems to be preferable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah maybe something like Dataset.from_generator would be the right way to go
Constructing a tf.data.Dataset
from Dataset.generator
becomes cumbersome when you need control on what should happen after the end of every epoch. If you need that control you need to write a custom train_step
function instead of replying on keras.model.fit()
to handle that for you.
tf.keras.utils.Sequence explicitly has a function for it. That's why we shifted from tf.data.Dataset.from_generator
to tf.keras.utils.Sequence
during the transition from pure TF to reusing Keras methods.
However, we recently recognized that tensorflow has breaking support for tf.RaggedTensors
inside tf.keras.utils.Sequence
..
Using tf.RaggedTensors
will simplify a LOT of code inside our TF models. So, I am all up for exploring how we can use tf.data.Dataset.from_generator
without having to write our own model.fit()
function.
|
||
def on_epoch_end(self) -> None: | ||
"""Update the data after every epoch.""" | ||
self.current_message_stream = self.training_data.stream_featurized_messages( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how would we e.g. shuffle chunks after epoch end?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The TrainingData object could tell the featurereader to read the chunks in a random/different order upon creating a new streaam. And then you still need to assure that the right message gets the right features. Some ID of the message that is also stored in the chunk or so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sweet, thanks for implementing this as well 💯
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about showing this to Daksh / Vova? I'd be interested in their feedback. I think your prototype seems very promising! 💯
|
||
def on_epoch_end(self) -> None: | ||
"""Update the data after every epoch.""" | ||
self.current_message_stream = self.training_data.stream_featurized_messages( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sweet, thanks for implementing this as well 💯
self._message_to_features(msg, tag_name) | ||
for msg in training_data.stream_featurized_messages() | ||
) | ||
y_train = ( | ||
self._crf_tokens_to_tags(sentence, tag_name) for sentence in df_train | ||
self._message_to_tags(msg, tag_name) | ||
for msg in training_data.stream_featurized_messages() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's potentially dangerous if shuffle would be enabled as X and Y would be shuffled differently, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, indeed that would break. The reason this is here is that you can only consume a generator once. However, you could split the generator, if you can guarantee that no copy is advancing more than n
elements ahead of the others. It would be possible to do this here as the crf internally zips x and y and then runs through them together.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated that code again a bit
I ran some mem profilings on multiwoz dataset with roughly 55k nlu sentences. I compared the current main branch with my branch here. It saves memory all around, but the effect is very much dependent on the specific component. It's the least pronounced for the crf in this case as that model is building up the entire dataset internally itself. configI think most of it pretty standard, added the crf features here to get
Resultsmainthis branchMemory:
Time:
|
I'd be interested in their thoughts aswell! 💯 Might be good to do it as part of the entire pipeline talk with research? |
How does changing the |
The larger the chunk the more memory is used. At the extreme you could have a chunk that is the size of the dataset even for a big dataset. The features in this particular setup for 55k messages are about 2GB in memory. So 512 messages are about 20MB. For the shuffling I am currently using 3 random chunks. So that would make 60mb when using shuffling. Both the size and number of chunks when shuffling is adjustable. |
@wochinge @dakshvar22 Brief update on the benchmarking over multiple epochs: Balanced Batching and PreprocessingBalanced batching can make datasets quite a bit. If you look at my old two graphs of main vs streaming again there's something interesting. Turns out the noisy pattern at the end is actually the DIET training. So it was much faster on the streaming branch, but only because main used balacned batching by default. Given that the nlu data of multiwoz follows a power law for examples per intent (few intents have thousands of examples, and a lot of intents only have a few examples) balanced batching effectively really more than doubled the size of the dataset during training. The first long flat stretch in the streaming training on the other hand was a preprocessing step that was very costly on streaming form disc. I would load the data more than a 100 times from disc during this step in the first naive implementation. With the batching strategy set to Timing comparisonI ran the experiments multiple times and consistently observed a 10-15% time penalty for the streaming solution. So for example, on the latest run streaming would take 39s per epoch on an 80% multiwoz train set, and in memory would take 34s. That was consistent over three runs, doing 15-25 epochs each on a google cloud k80 instance. |
This PR has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
**DRAFT FOR DISCUSSION PURPOSES **:
notes
CountVectorFeaturizer
by focusing more on messagesRasaModelData
seems a bit like a parallel world to theTrainingData
class. I think there are many parallel implementations of the same logic on these classes, such as splitting training data according to some labels. I am not sure this intermediateRasaModelData
class is really necessary. It seems all that's needed is some logic to arrange messages in a certain way and to turn a bunch of them into tensors.