-
Notifications
You must be signed in to change notification settings - Fork 220
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MTF train script #295
MTF train script #295
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great job! initial thoughts
# Option 1 of data loading using --data-path | ||
# For T0, data has to be provided in the form --data-path input-data target-data input-data2 target-data2 ... | ||
if args.data_path: | ||
# TODO: Not yet compatible with dataset weights (Will break at prefixes, weights = analyze_data_prefix(args.data_path)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the question is:
- Do we want separate indexed datasets for each T0 dataset?
-
Easy to remove unwanted indexed datasets (e.g. if we the bias filtering reveals something)
-
- Do we want to combine them all into one giant indexed dataset?
-
Probably faster to get started, as no need to worry about weights; small datasets with too few items to make a batch; BlendableDataset compatibility
-
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- data_path accepts multiple arguments (haven't tested but should work, it's just the weighting mechanism that we should have IMO)
- works as well. I don't really know how much we're going to be able to experiment with different filtered version of T0.
Co-authored-by: Niklas Muennighoff <[email protected]>
- Remove unecessary code from MTFDataset - Create size API for MTF dataset - Use new size API to build packed index much faster
# TODO @thomasw21 handle the case where a single sample cannot fit inside a row. We can | ||
# - silently skip that value [currently implemented] | ||
# - truncate to `seq_length`, and keep the right part |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering is we actually should warn someone about it? Like add a warning or something. I don't know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it should almost never happen for T0 & seq len 2048, right?
So having a warning would be good imo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah ... let me add that.
Too late for those that wanted to review. |
Co-authored-by: Lintang Sutawika <[email protected]> Co-authored-by: Lintang Sutawika <[email protected]> Co-authored-by: Muennighoff <[email protected]>
No description provided.