Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

make ted train on chunks #8353

Closed
wants to merge 3 commits into from

Conversation

Ghostvv
Copy link
Contributor

@Ghostvv Ghostvv commented Apr 1, 2021

Proposed changes:

Status (please check what you already did):

  • added some tests for the functionality
  • updated the documentation
  • updated the changelog (please check changelog for instructions)
  • reformat files using black (please check Readme for instructions)

@Ghostvv Ghostvv requested review from dakshvar22, koernerfelicia, a team and Tawakalt and removed request for a team April 1, 2021 13:57
@Ghostvv Ghostvv marked this pull request as draft April 1, 2021 13:58
@Ghostvv Ghostvv removed the request for review from Tawakalt April 1, 2021 13:58
@koernerfelicia
Copy link
Contributor

@Ghostvv heads up I'm out starting tomorrow until next Friday

Copy link
Contributor

@dakshvar22 dakshvar22 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic seems largely fine to me at first glance.
I take that the code quality isn't expected to be merge ready and just a prototype?

Comment on lines +267 to +275
Returns:
- a dictionary of state types (INTENT, TEXT, ACTION_NAME, ACTION_TEXT,
ENTITIES, SLOTS, ACTIVE_LOOP) to a list of features for all dialogue
turns in all training trackers
- the label ids (e.g. action ids) for every dialogue turn in all training
trackers
- A dictionary of entity type (ENTITY_TAGS) to a list of features
containing entity tag ids for text user inputs otherwise empty dict
for all dialogue turns in all training trackers
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modify according to what function does.

)
)

data_chunk_dir = TempDirectoryPath(tempfile.mkdtemp())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use create_temporarty_directory from rasa.utils.io.

@@ -240,6 +247,299 @@ def featurize_trackers(

return tracker_state_features, label_ids, entity_tags

def featurize_trackers_in_chunks(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd break this function into smaller functions.

ids = np.random.permutation(num_label_examples)
trackers_as_states_for_label = trackers_as_states[labels == label][
ids
].tolist()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can keep them as np arrays?


data_chunk_dir = TempDirectoryPath(tempfile.mkdtemp())
data_chunk_files = []
for chunk_index in range(number_of_chunks):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can create a generic function which will accumulate the members of chunk for a given trackers_as_X list. X can be states, actions, entities.

Comment on lines +436 to +445
for turn_state_features, turn_label_ids, turn_entity_tags in zip(
tracker_state_features_chunk, label_ids_chunk, entity_tags_chunk
):
example = tf.train.Example(
features=tf.train.Features(
feature=self._to_tf_features(
turn_state_features, turn_label_ids, turn_entity_tags
)
)
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, we should check if we can merge this function and persist_chunk inside rasa.shared.nlu.training_data.training_data.py` and create a generic TFRecord writer function that can be reused.


return file_path

def load_chunk(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add some comments for future purposes in this function?

@@ -590,7 +590,9 @@ async def _train_core_with_validated_data(
await rasa.core.train.train_in_chunks(
domain_file=domain,
training_resource=file_importer,
output_path=Path(_train_path, DEFAULT_CORE_SUBDIRECTORY_NAME),
output_path=os.path.join(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you leave it to Path to be consistent with the code? Any particular reason why it should change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the change from main

@@ -0,0 +1,152 @@
import logging
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can add the generic persist and load methods for chunk here.

@@ -590,7 +590,9 @@ async def _train_core_with_validated_data(
await rasa.core.train.train_in_chunks(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to support finetuning?
It's okay to leave it out in case this is going to be a prototype, but i think you will just need to add a model_to_finetune parameter to this method and set the agent's ensemble to model_to_finetune's policy ensemble and the rest should be fine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my thinking was: if you do fine-tuning, you don't fine-tune on a lot of data, so normal train should be good

@Ghostvv
Copy link
Contributor Author

Ghostvv commented Apr 9, 2021

I take that the code quality isn't expected to be merge ready and just a prototype?

yes

@koernerfelicia
Copy link
Contributor

koernerfelicia commented Jun 7, 2021

NOTE: there is a known bug (maybe two) with this implementation (see discussion here).

The first, known bug occurs with datasets that have a long tail distribution of action labels. For example, end-to-end datasets with lots of unique bot utterances (as the label for these is action_text).

This means that the chunks are not (even roughly) equal in size. The first chunk will contain examples for each label, and subsequent chunks will be smaller, as there are no more examples for this label. This occurs here. I suspect this may be a problem with NLU chunking as well, though here we are far less likely to see this type of distribution.

The second, potential bug, which is possibly related is that the subsequent smaller chunks may not have entities. We hypothesized that this may be because (in this scenario with the long tail of unique bot utterances) the large chunk contains all the unique utterances and the subsequent chunks contain examples only for action_listen. It's not clear if this is related, since Vova did provision for missing entities here.

The bug can be reproduced with MultiWOZ for 400 dialogues (312 train, these dialogues are produced with the script here with number of dialogues flag set to 400) and chunk=5.

@dakshvar22 dakshvar22 mentioned this pull request Jun 11, 2021
2 tasks
@stale
Copy link

stale bot commented Apr 16, 2022

This PR has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Apr 16, 2022
@m-vdb m-vdb closed this Jun 2, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants