Refactor pre tokenization tool #219

eliebak · 2024-08-21T23:17:37Z

Support slurm for launching tokenization job + parquet reader. Add more option from the datatrove library and refactor the parser with no subparser.

Simple example on how to use:

Before

python3 tools/preprocess_data.py --tokenizer-name-or-path meta-llama/Meta-Llama-3-8B --output-folder datasets/emotion --n-tasks 16 --reader hf --dataset dair-ai/emotion

After

With slurm

python3 tools/preprocess_data.py --tokenizer-name-or-path HuggingFaceTB/cosmo2-tokenizer --output-folder datasets/cosmopedia-v2 --n-tasks 100 --reader parquet --dataset hf://datasets/HuggingFaceTB/smollm-corpus/cosmopedia-v2 --column text --slurm --partition "insert_cpu_partition_name"

Without slurm

python3 tools/preprocess_data.py --tokenizer-name-or-path HuggingFaceTB/cosmo2-tokenizer --output-folder datasets/cosmopedia-v2 --n-tasks 100 --reader parquet --dataset hf://datasets/HuggingFaceTB/smollm-corpus/cosmopedia-v2 --column text

eliebak added 7 commits August 18, 2024 01:41

add Shuffle and ParquetReader support

126a90b

add slurm support and refactor the code

5f5f2f1

no email in local pipeline

5a038cb

fix: cpus-per-task typo

03f2c5d

add mem_per_cpu_gb flag

bb61d8d

fix typo readers => reader

5dc5861

fix if => elif

7502bae

3outeille self-assigned this Sep 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor pre tokenization tool #219

Refactor pre tokenization tool #219

eliebak commented Aug 21, 2024 •

edited

Loading

Refactor pre tokenization tool #219

Are you sure you want to change the base?

Refactor pre tokenization tool #219

Conversation

eliebak commented Aug 21, 2024 • edited Loading

Before

After

With slurm

Without slurm

eliebak commented Aug 21, 2024 •

edited

Loading