enable dynamic compile for mpi(training) #1509

chaojun-zhang · 2024-11-21T01:23:26Z

What does this PR do?

enable dynamic compile for mpi( training)

Fixes # (issue)

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?

astachowiczhabana · 2024-11-21T14:58:53Z

@libinta this commit is required with next OH release

imangohari1 · 2024-11-22T19:14:53Z

@chaojun-zhang @astachowiczhabana
How can we test these changes?

imangohari1

I have tested this with the following tests and it looks fine.
@chaojun-zhang I suggest we add a test for testing dynamic compile to test_text_generation. Thanks.

PT_ENABLE_INT64_SUPPORT=1 PT_HPU_LAZY_MODE=0 python --use_mpi run_mlm.py --model_name_or_path roberta-large --dataset_name wikitext --dataset_config_name wikitext-2-raw-v1 --per_device_train_batch_size 8 --per_device_eval_batch_size 8 --do_train --do_eval --output_dir /tmp/test-mlm --overwrite_output_dir --use_habana --torch_compile_backend hpu_backend --use_lazy_mode False --torch_compile --gaudi_config_name Habana/roberta-base --throughput_warmup_steps 3 --bf16

***** train metrics *****
 epoch                       =        3.0
 max_memory_allocated (GB)   =      94.03
 memory_allocated (GB)       =       5.13
 total_flos                  = 12500244GF
 total_memory_available (GB) =      94.62
 train_loss                  =     1.1576
 train_runtime               = 0:02:37.23
 train_samples               =       4798
 train_samples_per_second    =    114.138
 train_steps_per_second      =     14.273
11/25/2024 22:33:47 - INFO - __main__ - *** Evaluate ***
[INFO|trainer.py:831] 2024-11-25 22:33:47,483 >> The following columns in the evaluation set don't have a corresponding argument in `RobertaForMaskedLM.forward` and have been ignored: special_tokens_mask. If special_tokens_mask are not expected by `RobertaForMaskedLM.forward`,  you can safely ignore this message.
[INFO|trainer.py:1850] 2024-11-25 22:33:47,487 >>
***** Running Evaluation *****
[INFO|trainer.py:1852] 2024-11-25 22:33:47,487 >>   Num examples = 496
[INFO|trainer.py:1855] 2024-11-25 22:33:47,487 >>   Batch size = 8
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 62/62 [00:03<00:00, 16.69it/s]
***** eval metrics *****
 epoch                       =        3.0
 eval_accuracy               =     0.7633
 eval_loss                   =     1.0517
 eval_runtime                = 0:00:03.80
 eval_samples                =        496
 eval_samples_per_second     =    131.343
 eval_steps_per_second       =     16.418
 max_memory_allocated (GB)   =      94.03
 memory_allocated (GB)       =       5.13
 perplexity                  =     2.8626
 total_memory_available (GB) =      94.62


PT_ENABLE_INT64_SUPPORT=1 PT_HPU_LAZY_MODE=0 python ../gaudi_spawn.py \
    --world_size 8 --use_mpi run_mlm.py \
    --model_name_or_path roberta-large \
    --dataset_name wikitext \
    --dataset_config_name wikitext-2-raw-v1 \
    --per_device_train_batch_size 8 \
    --per_device_eval_batch_size 8 \
    --do_train \
    --do_eval \
    --output_dir /tmp/test-mlm \
    --use_habana \
    --torch_compile_backend hpu_backend --use_lazy_mode False \
    --torch_compile \
    --gaudi_config_name Habana/roberta-base \
    --throughput_warmup_steps 3 \
    --bf16


***** train metrics *****
  epoch                       =        3.0
  max_memory_allocated (GB)   =       16.2
  memory_allocated (GB)       =       6.06
  total_flos                  = 12500244GF
  total_memory_available (GB) =      94.62
  train_loss                  =     1.1425
  train_runtime               = 0:00:50.12
  train_samples               =       4798
  train_samples_per_second    =    723.717
  train_steps_per_second      =     11.313
11/25/2024 21:12:33 - INFO - __main__ - *** Evaluate ***
[INFO|trainer.py:831] 2024-11-25 21:12:33,775 >> The following columns in the evaluation set don't have a corresponding argument in `RobertaForMaskedLM.forward` and have been ignored: special_tokens_mask. If special_tokens_mask are not expected by `RobertaForMaskedLM.forward`,  you can safely ignore this message.
[INFO|trainer.py:1850] 2024-11-25 21:12:33,779 >>
***** Running Evaluation *****
[INFO|trainer.py:1852] 2024-11-25 21:12:33,779 >>   Num examples = 496
[INFO|trainer.py:1855] 2024-11-25 21:12:33,779 >>   Batch size = 8
^M  0%|          | 0/8 [00:00<?, ?it/s]^M 38%|▒~V~H▒~V~H▒~V~H▒~V~J      | 3/8 [00:00<00:00, 28.42it/s]^M 75%|▒~V~H▒~V~H▒~V~H▒~V~H▒~V~H▒~V~H▒~V~H▒~V~L  | 6/8 [00:00<00:00, 20.72it/s]^M100%|▒~V~H▒~V~H▒~V~H▒~V~H▒~V~H▒~V~H▒~V~H▒~V~H▒~V~H▒~V~H| 8/8 [00:00<00:00, 13.71it/s]
***** eval metrics *****
  epoch                       =        3.0
  eval_accuracy               =      0.768
  eval_loss                   =     1.0268
  eval_runtime                = 0:00:00.65
  eval_samples                =        496
  eval_samples_per_second     =    650.538
  eval_steps_per_second       =       10.7
  max_memory_allocated (GB)   =       16.2
  memory_allocated (GB)       =       6.06
  perplexity                  =     2.7921
  total_memory_available (GB) =      94.62

GAUDI2_CI=1 RUN_SLOW=true python -m pytest tests/test_text_generation_example.py -s -v torch_compile

.
.
================= 2 passed, 54 deselected in 160.06s (0:02:40) =================

github-actions · 2024-11-26T11:30:22Z

The code quality check failed, please run make style.

regisss · 2024-11-26T11:31:03Z

@chaojun-zhang I agree with @imangohari1 that we should add a test.

Also, please sync your branch with main and run make style again.

HuggingFaceDocBuilderDev · 2024-11-26T11:33:41Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

…rf Degrade by 41.02% enable compile dynamic for mpi

chaojun-zhang · 2024-11-28T07:14:47Z

@chaojun-zhang I agree with @imangohari1 that we should add a test.

Also, please sync your branch with main and run make style again.

Thanks for your review, @regisss This fix is primarily aimed at model training using MPI and torch.compile, where the inference process currently does not reach this point, so there is no mpi integration for inference at present.

imangohari1 · 2024-11-28T18:25:40Z

@chaojun-zhang I agree with @imangohari1 that we should add a test.
Also, please sync your branch with main and run make style again.

Thanks for your review, @regisss This fix is primarily aimed at model training using MPI and torch.compile, where the inference process currently does not reach this point, so there is no mpi integration for inference at present.

@chaojun-zhang
Independent of the usecase, we need to include a pytest for this.
Please look at the tests/ folder and add the proper tests that this PR intends to be useful.
Thanks.

chaojun-zhang requested a review from regisss as a code owner November 21, 2024 01:23

chaojun-zhang changed the title ~~enable compile dynamic for mpi~~ enable dynamic compile with mpi Nov 21, 2024

chaojun-zhang changed the title ~~enable dynamic compile with mpi~~ enable dynamic compile for mpi Nov 21, 2024

imangohari1 suggested changes Nov 25, 2024

View reviewed changes

chaojun-zhang force-pushed the auto-pr-e2dc35c branch 2 times, most recently from dd9fc04 to 32b7eba Compare November 28, 2024 03:56

[SW-204201] [PT_2.5.0][G2][T5-LARGE-HF (Torch.compile)][Perf][8x]: Pe…

134f935

…rf Degrade by 41.02% enable compile dynamic for mpi

chaojun-zhang force-pushed the auto-pr-e2dc35c branch from 32b7eba to 134f935 Compare November 28, 2024 04:18

chaojun-zhang changed the title ~~enable dynamic compile for mpi~~ enable dynamic compile for mpi(training) Nov 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

enable dynamic compile for mpi(training) #1509

enable dynamic compile for mpi(training) #1509

chaojun-zhang commented Nov 21, 2024 •

edited

Loading

astachowiczhabana commented Nov 21, 2024

imangohari1 commented Nov 22, 2024

imangohari1 left a comment

github-actions bot commented Nov 26, 2024

regisss commented Nov 26, 2024

HuggingFaceDocBuilderDev commented Nov 26, 2024

chaojun-zhang commented Nov 28, 2024

imangohari1 commented Nov 28, 2024

enable dynamic compile for mpi(training) #1509

Are you sure you want to change the base?

enable dynamic compile for mpi(training) #1509

Conversation

chaojun-zhang commented Nov 21, 2024 • edited Loading

What does this PR do?

Before submitting

astachowiczhabana commented Nov 21, 2024

imangohari1 commented Nov 22, 2024

imangohari1 left a comment

Choose a reason for hiding this comment

github-actions bot commented Nov 26, 2024

regisss commented Nov 26, 2024

HuggingFaceDocBuilderDev commented Nov 26, 2024

chaojun-zhang commented Nov 28, 2024

imangohari1 commented Nov 28, 2024

chaojun-zhang commented Nov 21, 2024 •

edited

Loading