-
Notifications
You must be signed in to change notification settings - Fork 203
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
enable dynamic compile for mpi(training) #1509
base: main
Are you sure you want to change the base?
Conversation
@libinta this commit is required with next OH release |
@chaojun-zhang @astachowiczhabana |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have tested this with the following tests and it looks fine.
@chaojun-zhang I suggest we add a test for testing dynamic compile to test_text_generation. Thanks.
PT_ENABLE_INT64_SUPPORT=1 PT_HPU_LAZY_MODE=0 python --use_mpi run_mlm.py --model_name_or_path roberta-large --dataset_name wikitext --dataset_config_name wikitext-2-raw-v1 --per_device_train_batch_size 8 --per_device_eval_batch_size 8 --do_train --do_eval --output_dir /tmp/test-mlm --overwrite_output_dir --use_habana --torch_compile_backend hpu_backend --use_lazy_mode False --torch_compile --gaudi_config_name Habana/roberta-base --throughput_warmup_steps 3 --bf16
***** train metrics *****
epoch = 3.0
max_memory_allocated (GB) = 94.03
memory_allocated (GB) = 5.13
total_flos = 12500244GF
total_memory_available (GB) = 94.62
train_loss = 1.1576
train_runtime = 0:02:37.23
train_samples = 4798
train_samples_per_second = 114.138
train_steps_per_second = 14.273
11/25/2024 22:33:47 - INFO - __main__ - *** Evaluate ***
[INFO|trainer.py:831] 2024-11-25 22:33:47,483 >> The following columns in the evaluation set don't have a corresponding argument in `RobertaForMaskedLM.forward` and have been ignored: special_tokens_mask. If special_tokens_mask are not expected by `RobertaForMaskedLM.forward`, you can safely ignore this message.
[INFO|trainer.py:1850] 2024-11-25 22:33:47,487 >>
***** Running Evaluation *****
[INFO|trainer.py:1852] 2024-11-25 22:33:47,487 >> Num examples = 496
[INFO|trainer.py:1855] 2024-11-25 22:33:47,487 >> Batch size = 8
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 62/62 [00:03<00:00, 16.69it/s]
***** eval metrics *****
epoch = 3.0
eval_accuracy = 0.7633
eval_loss = 1.0517
eval_runtime = 0:00:03.80
eval_samples = 496
eval_samples_per_second = 131.343
eval_steps_per_second = 16.418
max_memory_allocated (GB) = 94.03
memory_allocated (GB) = 5.13
perplexity = 2.8626
total_memory_available (GB) = 94.62
PT_ENABLE_INT64_SUPPORT=1 PT_HPU_LAZY_MODE=0 python ../gaudi_spawn.py \
--world_size 8 --use_mpi run_mlm.py \
--model_name_or_path roberta-large \
--dataset_name wikitext \
--dataset_config_name wikitext-2-raw-v1 \
--per_device_train_batch_size 8 \
--per_device_eval_batch_size 8 \
--do_train \
--do_eval \
--output_dir /tmp/test-mlm \
--use_habana \
--torch_compile_backend hpu_backend --use_lazy_mode False \
--torch_compile \
--gaudi_config_name Habana/roberta-base \
--throughput_warmup_steps 3 \
--bf16
***** train metrics *****
epoch = 3.0
max_memory_allocated (GB) = 16.2
memory_allocated (GB) = 6.06
total_flos = 12500244GF
total_memory_available (GB) = 94.62
train_loss = 1.1425
train_runtime = 0:00:50.12
train_samples = 4798
train_samples_per_second = 723.717
train_steps_per_second = 11.313
11/25/2024 21:12:33 - INFO - __main__ - *** Evaluate ***
[INFO|trainer.py:831] 2024-11-25 21:12:33,775 >> The following columns in the evaluation set don't have a corresponding argument in `RobertaForMaskedLM.forward` and have been ignored: special_tokens_mask. If special_tokens_mask are not expected by `RobertaForMaskedLM.forward`, you can safely ignore this message.
[INFO|trainer.py:1850] 2024-11-25 21:12:33,779 >>
***** Running Evaluation *****
[INFO|trainer.py:1852] 2024-11-25 21:12:33,779 >> Num examples = 496
[INFO|trainer.py:1855] 2024-11-25 21:12:33,779 >> Batch size = 8
^M 0%| | 0/8 [00:00<?, ?it/s]^M 38%|▒~V~H▒~V~H▒~V~H▒~V~J | 3/8 [00:00<00:00, 28.42it/s]^M 75%|▒~V~H▒~V~H▒~V~H▒~V~H▒~V~H▒~V~H▒~V~H▒~V~L | 6/8 [00:00<00:00, 20.72it/s]^M100%|▒~V~H▒~V~H▒~V~H▒~V~H▒~V~H▒~V~H▒~V~H▒~V~H▒~V~H▒~V~H| 8/8 [00:00<00:00, 13.71it/s]
***** eval metrics *****
epoch = 3.0
eval_accuracy = 0.768
eval_loss = 1.0268
eval_runtime = 0:00:00.65
eval_samples = 496
eval_samples_per_second = 650.538
eval_steps_per_second = 10.7
max_memory_allocated (GB) = 16.2
memory_allocated (GB) = 6.06
perplexity = 2.7921
total_memory_available (GB) = 94.62
GAUDI2_CI=1 RUN_SLOW=true python -m pytest tests/test_text_generation_example.py -s -v torch_compile
.
.
================= 2 passed, 54 deselected in 160.06s (0:02:40) =================
The code quality check failed, please run |
@chaojun-zhang I agree with @imangohari1 that we should add a test. Also, please sync your branch with main and run |
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
dd9fc04
to
32b7eba
Compare
…rf Degrade by 41.02% enable compile dynamic for mpi
32b7eba
to
134f935
Compare
Thanks for your review, @regisss This fix is primarily aimed at model training using MPI and torch.compile, where the inference process currently does not reach this point, so there is no mpi integration for inference at present. |
@chaojun-zhang |
What does this PR do?
enable dynamic compile for mpi( training)
Fixes # (issue)
Before submitting