Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor pre tokenization tool #219

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

eliebak
Copy link
Contributor

@eliebak eliebak commented Aug 21, 2024

Support slurm for launching tokenization job + parquet reader. Add more option from the datatrove library and refactor the parser with no subparser.

Simple example on how to use:

Before

python3 tools/preprocess_data.py --tokenizer-name-or-path meta-llama/Meta-Llama-3-8B --output-folder datasets/emotion --n-tasks 16 --reader hf --dataset dair-ai/emotion

After

With slurm

python3 tools/preprocess_data.py --tokenizer-name-or-path HuggingFaceTB/cosmo2-tokenizer --output-folder datasets/cosmopedia-v2 --n-tasks 100 --reader parquet --dataset hf://datasets/HuggingFaceTB/smollm-corpus/cosmopedia-v2 --column text --slurm --partition "insert_cpu_partition_name"

Without slurm

python3 tools/preprocess_data.py --tokenizer-name-or-path HuggingFaceTB/cosmo2-tokenizer --output-folder datasets/cosmopedia-v2 --n-tasks 100 --reader parquet --dataset hf://datasets/HuggingFaceTB/smollm-corpus/cosmopedia-v2 --column text

@3outeille 3outeille self-assigned this Sep 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants