Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Deepspeed] getting multiple prints of: Avoid using tokenizers before the fork if possible #10400

Closed
stas00 opened this issue Feb 25, 2021 · 2 comments

Comments

@stas00
Copy link
Contributor

stas00 commented Feb 25, 2021

on master when running with DeepSpeed I started getting multiple dumps of:

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
        - Avoid using `tokenizers` before the fork if possible
        - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)

This script:

export BS=16; rm -r output_dir; PYTHONPATH=src USE_TF=0 CUDA_VISIBLE_DEVICES=0,1 deepspeed --num_gpus=2 examples/seq2seq/run_seq2seq.py --model_name_or_path t5-small --output_dir output_dir --adam_eps 1e-06 --do_eval --do_train --do_predict --evaluation_strategy=steps  --label_smoothing 0.1 --learning_rate 3e-5 --logging_first_step --logging_steps 1000 --max_source_length 128 --max_target_length 128 --num_train_epochs 1 --overwrite_output_dir --per_device_eval_batch_size $BS --per_device_train_batch_size $BS --predict_with_generate --eval_steps 25000  --sortish_sampler --task translation_en_to_ro  --val_max_target_length 128 --warmup_steps 500 --max_train_samples 100 --max_val_samples 100 --max_test_samples 100 --dataset_name wmt16 --dataset_config ro-en  --source_prefix "translate English to Romanian: " --deepspeed examples/tests/deepspeed/ds_config.json

prints it 15 times.

There are no 15 forks, it probably gets triggered by threads. The problem doesn't happen with DDP or DP.

Thank you.

@LysandreJik, @n1t0

@huggingface huggingface deleted a comment from chrissyjsartt Feb 25, 2021
@stas00
Copy link
Contributor Author

stas00 commented Feb 25, 2021

@chrissyjsartt, you probably accidentally subscribed/set to "Watching" the transformers repository which will now send you every comment on every Issue or PR.

So urgently go to https://github.com/watching and "Unwatch" this or any other repositories you may have set to Watch. Then you will stop getting these notifications.

@stas00
Copy link
Contributor Author

stas00 commented Feb 25, 2021

@LysandreJik replied elsewhere to set TOKENIZERS_PARALLELISM=false and to read huggingface/tokenizers#187 (comment) for the explanation of why this is needed.

But this could make things slow, so trying =true first is a better idea - if it doesn't hang then all is good.

Also Anthony shared:

If the tokenizer wasn't used to encode before forking the process, it shouldn't happen. So just a new encode_batch somewhere before the fork happens can be enough to trigger this.

@stas00 stas00 closed this as completed Feb 25, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant