Dataloader now shuffles the shards and documents within #52

fraserlove · 2024-07-03T20:03:00Z

Added a function to separate a shard into documents, shuffle the documents, and then concatenate the documents back to a single shard. Individual shards are also shuffled. This removes the unusual periodicity present within the fineweb-edu dataset and results in a smoother training loss trajectory (see image below).

train_gpt2.py

dustinwloring1988 · 2024-07-08T20:35:10Z

@lukasugar I have been training multiply model and have event customized the model. I was seeing if you would be interest in working together on a new model. I have a FastTokenizer with 128K vocab and new special tokens. If you are interested I can share more information in a PM

Dataloader now shuffles the shards and documents within

4fb2aa2

lukasugar reviewed Jul 4, 2024

View reviewed changes

train_gpt2.py Outdated Show resolved Hide resolved

Dataloader now reshuffles after each epoch

faaed08

fraserlove force-pushed the dataloader-shuffle branch from dd75feb to faaed08 Compare July 4, 2024 19:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataloader now shuffles the shards and documents within #52

Dataloader now shuffles the shards and documents within #52

fraserlove commented Jul 3, 2024

dustinwloring1988 commented Jul 8, 2024

Dataloader now shuffles the shards and documents within #52

Are you sure you want to change the base?

Dataloader now shuffles the shards and documents within #52

Conversation

fraserlove commented Jul 3, 2024

dustinwloring1988 commented Jul 8, 2024