Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Simplified preparation of pretraining datasets #1057

Draft
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

awaelchli
Copy link
Contributor

@awaelchli awaelchli commented Mar 7, 2024

The idea is that data modules that expose prepare_data can be called in advance to prepare data. For in-memory datasets (e.g. finetuning) this is a no-op and not required. But for pretraining datasets (terrabytes), this is very useful as it can be scaled to a large cluster with a single command:

litgpt prepare --data TinyLlama --tokenizer_dir checkpoints/meta-llama/Llama-2-7b-hf

@awaelchli awaelchli added the enhancement New feature or request label Mar 7, 2024
@carmocca carmocca added this to the Configurability milestone Mar 13, 2024
@carmocca
Copy link
Contributor

This is blocked by not being able to run two optimize calls together. Maybe we should have tutorials suggest python -m litgpt.data.prepare_* in the meantime for people who use this externally.

@carmocca carmocca removed this from the Configurability milestone Mar 14, 2024
@awaelchli awaelchli force-pushed the refactor/prepare_data branch from ade1c1b to f9099f3 Compare April 8, 2024 09:32
@awaelchli awaelchli changed the base branch from wip to main April 8, 2024 09:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants