-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
make ted train on chunks #8353
make ted train on chunks #8353
Conversation
@Ghostvv heads up I'm out starting tomorrow until next Friday |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic seems largely fine to me at first glance.
I take that the code quality isn't expected to be merge ready and just a prototype?
Returns: | ||
- a dictionary of state types (INTENT, TEXT, ACTION_NAME, ACTION_TEXT, | ||
ENTITIES, SLOTS, ACTIVE_LOOP) to a list of features for all dialogue | ||
turns in all training trackers | ||
- the label ids (e.g. action ids) for every dialogue turn in all training | ||
trackers | ||
- A dictionary of entity type (ENTITY_TAGS) to a list of features | ||
containing entity tag ids for text user inputs otherwise empty dict | ||
for all dialogue turns in all training trackers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Modify according to what function does.
) | ||
) | ||
|
||
data_chunk_dir = TempDirectoryPath(tempfile.mkdtemp()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use create_temporarty_directory
from rasa.utils.io.
@@ -240,6 +247,299 @@ def featurize_trackers( | |||
|
|||
return tracker_state_features, label_ids, entity_tags | |||
|
|||
def featurize_trackers_in_chunks( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd break this function into smaller functions.
ids = np.random.permutation(num_label_examples) | ||
trackers_as_states_for_label = trackers_as_states[labels == label][ | ||
ids | ||
].tolist() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can keep them as np arrays?
|
||
data_chunk_dir = TempDirectoryPath(tempfile.mkdtemp()) | ||
data_chunk_files = [] | ||
for chunk_index in range(number_of_chunks): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can create a generic function which will accumulate the members of chunk for a given trackers_as_X
list. X
can be states
, actions
, entities
.
for turn_state_features, turn_label_ids, turn_entity_tags in zip( | ||
tracker_state_features_chunk, label_ids_chunk, entity_tags_chunk | ||
): | ||
example = tf.train.Example( | ||
features=tf.train.Features( | ||
feature=self._to_tf_features( | ||
turn_state_features, turn_label_ids, turn_entity_tags | ||
) | ||
) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, we should check if we can merge this function and persist_chunk
inside rasa.shared.nlu.training_data.training_data.py` and create a generic TFRecord writer function that can be reused.
|
||
return file_path | ||
|
||
def load_chunk( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add some comments for future purposes in this function?
@@ -590,7 +590,9 @@ async def _train_core_with_validated_data( | |||
await rasa.core.train.train_in_chunks( | |||
domain_file=domain, | |||
training_resource=file_importer, | |||
output_path=Path(_train_path, DEFAULT_CORE_SUBDIRECTORY_NAME), | |||
output_path=os.path.join( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you leave it to Path
to be consistent with the code? Any particular reason why it should change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is the change from main
@@ -0,0 +1,152 @@ | |||
import logging |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can add the generic persist and load methods for chunk here.
@@ -590,7 +590,9 @@ async def _train_core_with_validated_data( | |||
await rasa.core.train.train_in_chunks( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to support finetuning?
It's okay to leave it out in case this is going to be a prototype, but i think you will just need to add a model_to_finetune
parameter to this method and set the agent's ensemble to model_to_finetune
's policy ensemble and the rest should be fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my thinking was: if you do fine-tuning, you don't fine-tune on a lot of data, so normal train should be good
yes |
NOTE: there is a known bug (maybe two) with this implementation (see discussion here). The first, known bug occurs with datasets that have a long tail distribution of action labels. For example, end-to-end datasets with lots of unique bot utterances (as the label for these is This means that the chunks are not (even roughly) equal in size. The first chunk will contain examples for each label, and subsequent chunks will be smaller, as there are no more examples for this label. This occurs here. I suspect this may be a problem with NLU chunking as well, though here we are far less likely to see this type of distribution. The second, potential bug, which is possibly related is that the subsequent smaller chunks may not have entities. We hypothesized that this may be because (in this scenario with the long tail of unique bot utterances) the large chunk contains all the unique utterances and the subsequent chunks contain examples only for The bug can be reproduced with MultiWOZ for 400 dialogues (312 train, these dialogues are produced with the script here with number of dialogues flag set to 400) and chunk=5. |
This PR has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
Proposed changes:
Status (please check what you already did):
black
(please check Readme for instructions)