New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

make ted train on chunks #8353

Closed

Ghostvv wants to merge 3 commits into partial-dataset-loading-core from policy-train-in-chunk

Contributor

Ghostvv commented Apr 1, 2021

Proposed changes:

Status (please check what you already did):

added some tests for the functionality
updated the documentation
updated the changelog (please check changelog for instructions)
reformat files using black (please check Readme for instructions)


          make ted train on chunks

8b766ff

Ghostvv requested review from dakshvar22, koernerfelicia, a team and Tawakalt and removed request for a team

April 1, 2021 13:57

Ghostvv marked this pull request as draft

April 1, 2021 13:58

Ghostvv removed the request for review from Tawakalt

April 1, 2021 13:58

Contributor

koernerfelicia commented Apr 1, 2021

@Ghostvv heads up I'm out starting tomorrow until next Friday

Ghostvv added 2 commits

April 6, 2021 12:47


          temporary fix to ensure every chunk contain entities

3c52879


          fix test training data

97c8d8d

dakshvar22 reviewed

View reviewed changes

Contributor

dakshvar22 left a comment

The logic seems largely fine to me at first glance.
I take that the code quality isn't expected to be merge ready and just a prototype?

rasa/core/featurizers/tracker_featurizers.py

Comment on lines +267 to +275

+                      Returns:
+                          - a dictionary of state types (INTENT, TEXT, ACTION_NAME, ACTION_TEXT,
+                            ENTITIES, SLOTS, ACTIVE_LOOP) to a list of features for all dialogue
+                            turns in all training trackers
+                          - the label ids (e.g. action ids) for every dialogue turn in all training
+                            trackers
+                          - A dictionary of entity type (ENTITY_TAGS) to a list of features
+                            containing entity tag ids for text user inputs otherwise empty dict
+                            for all dialogue turns in all training trackers

Contributor

dakshvar22 Apr 9, 2021

Modify according to what function does.

rasa/core/featurizers/tracker_featurizers.py

+                              )
+                          )
+                      data_chunk_dir = TempDirectoryPath(tempfile.mkdtemp())

Contributor

dakshvar22 Apr 9, 2021

Use create_temporarty_directory from rasa.utils.io.

rasa/core/featurizers/tracker_featurizers.py

		@@ -240,6 +247,299 @@ def featurize_trackers(

		return tracker_state_features, label_ids, entity_tags

		def featurize_trackers_in_chunks(

Contributor

dakshvar22 Apr 9, 2021

I'd break this function into smaller functions.

rasa/core/featurizers/tracker_featurizers.py

+                          ids = np.random.permutation(num_label_examples)
+                          trackers_as_states_for_label = trackers_as_states[labels == label][
+                              ids
+                          ].tolist()

Contributor

dakshvar22 Apr 9, 2021

I think you can keep them as np arrays?

rasa/core/featurizers/tracker_featurizers.py

+                      data_chunk_dir = TempDirectoryPath(tempfile.mkdtemp())
+                      data_chunk_files = []
+                      for chunk_index in range(number_of_chunks):

Contributor

dakshvar22 Apr 9, 2021

I think you can create a generic function which will accumulate the members of chunk for a given trackers_as_X list. X can be states, actions, entities.

rasa/core/featurizers/tracker_featurizers.py

Comment on lines +436 to +445

+                          for turn_state_features, turn_label_ids, turn_entity_tags in zip(
+                              tracker_state_features_chunk, label_ids_chunk, entity_tags_chunk
+                          ):
+                              example = tf.train.Example(
+                                  features=tf.train.Features(
+                                      feature=self._to_tf_features(
+                                          turn_state_features, turn_label_ids, turn_entity_tags
+                                      )
+                                  )
+                              )

Contributor

dakshvar22 Apr 9, 2021

Hmm, we should check if we can merge this function and persist_chunk inside rasa.shared.nlu.training_data.training_data.py` and create a generic TFRecord writer function that can be reused.

rasa/core/featurizers/tracker_featurizers.py


		return file_path

		def load_chunk(

Contributor

dakshvar22 Apr 9, 2021

Could you add some comments for future purposes in this function?

rasa/model_training.py

@@ @@ -590,7 +590,9 @@ async def _train_core_with_validated_data( @@
                               await rasa.core.train.train_in_chunks(
                                   domain_file=domain,
                                   training_resource=file_importer,
-                                  output_path=Path(_train_path, DEFAULT_CORE_SUBDIRECTORY_NAME),
+                                  output_path=os.path.join(

Contributor

dakshvar22 Apr 9, 2021

Can you leave it to Path to be consistent with the code? Any particular reason why it should change?

Contributor Author

Ghostvv Apr 9, 2021

this is the change from main

rasa/shared/utils/chunk_utils.py

		@@ -0,0 +1,152 @@
		import logging

Contributor

dakshvar22 Apr 9, 2021

We can add the generic persist and load methods for chunk here.

rasa/model_training.py

		@@ -590,7 +590,9 @@ async def _train_core_with_validated_data(
		await rasa.core.train.train_in_chunks(

Contributor

dakshvar22 Apr 9, 2021

Do we want to support finetuning?
It's okay to leave it out in case this is going to be a prototype, but i think you will just need to add a model_to_finetune parameter to this method and set the agent's ensemble to model_to_finetune's policy ensemble and the rest should be fine.

Contributor Author

Ghostvv Apr 9, 2021

my thinking was: if you do fine-tuning, you don't fine-tune on a lot of data, so normal train should be good

Contributor Author

Ghostvv commented Apr 9, 2021

I take that the code quality isn't expected to be merge ready and just a prototype?

yes

Contributor

koernerfelicia commented Jun 7, 2021 •

edited

Loading

NOTE: there is a known bug (maybe two) with this implementation (see discussion here).

The first, known bug occurs with datasets that have a long tail distribution of action labels. For example, end-to-end datasets with lots of unique bot utterances (as the label for these is action_text).

This means that the chunks are not (even roughly) equal in size. The first chunk will contain examples for each label, and subsequent chunks will be smaller, as there are no more examples for this label. This occurs here. I suspect this may be a problem with NLU chunking as well, though here we are far less likely to see this type of distribution.

The second, potential bug, which is possibly related is that the subsequent smaller chunks may not have entities. We hypothesized that this may be because (in this scenario with the long tail of unique bot utterances) the large chunk contains all the unique utterances and the subsequent chunks contain examples only for action_listen. It's not clear if this is related, since Vova did provision for missing entities here.

The bug can be reproduced with MultiWOZ for 400 dialogues (312 train, these dialogues are produced with the script here with number of dialogues flag set to 400) and chunk=5.

dakshvar22 mentioned this pull request

Support training on big datasets #8595

Closed

2 tasks

stale bot commented Apr 16, 2022

This PR has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale bot added the stale label

m-vdb closed this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels