Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataloader now shuffles the shards and documents within #52

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

fraserlove
Copy link

Added a function to separate a shard into documents, shuffle the documents, and then concatenate the documents back to a single shard. Individual shards are also shuffled. This removes the unusual periodicity present within the fineweb-edu dataset and results in a smoother training loss trajectory (see image below).

train_gpt2.py Outdated Show resolved Hide resolved
@dustinwloring1988
Copy link

@lukasugar I have been training multiply model and have event customized the model. I was seeing if you would be interest in working together on a new model. I have a FastTokenizer with 128K vocab and new special tokens. If you are interested I can share more information in a PM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants