Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for iterate_over_all for the CombinedDataset #122

Merged
merged 8 commits into from
May 7, 2024
Merged

Conversation

tchaton
Copy link
Collaborator

@tchaton tchaton commented May 7, 2024

Before submitting
  • Was this discussed/agreed via a Github issue? (no need for typos and docs improvements)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure to update the docs?
  • Did you write any new necessary tests?

This PR aims at improving the behaviour of the combined dataset for more traditional use case.

Before this PR, the CombinedDataset would stop iterating as soon as one of its wrapped dataset would trigger a stop iteration. This is a great for LLM pre-training but less practical for more traditional use case.

This PR introduces the argument iterate_over_all to enable the combined dataset to see all the samples from all the datasets. The sampling is still random but only the last items will be un-sampled.

What does this PR do?

Fixes #112

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in GitHub issues there's a high chance it will not be merged.

Did you have fun?

Make sure you had fun coding 🙃

@tchaton tchaton requested a review from awaelchli as a code owner May 7, 2024 13:56
@tchaton tchaton changed the title Add support for iterate_over_all for the CombinedDataset. Add support for iterate_over_all for the CombinedDataset May 7, 2024
@tchaton tchaton merged commit 015f21c into main May 7, 2024
32 checks passed
@tchaton tchaton deleted the add_do_all branch May 7, 2024 14:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Progress bar missing with litdata.StreamingDataset and wrong number of steps in an epoch
2 participants