Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

enable dynamic compile for mpi(training) #1509

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

chaojun-zhang
Copy link
Contributor

@chaojun-zhang chaojun-zhang commented Nov 21, 2024

What does this PR do?

enable dynamic compile for mpi( training)

Fixes # (issue)

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests?

@chaojun-zhang chaojun-zhang changed the title enable compile dynamic for mpi enable dynamic compile with mpi Nov 21, 2024
@chaojun-zhang chaojun-zhang changed the title enable dynamic compile with mpi enable dynamic compile for mpi Nov 21, 2024
@astachowiczhabana
Copy link
Contributor

@libinta this commit is required with next OH release

@imangohari1
Copy link
Contributor

@chaojun-zhang @astachowiczhabana
How can we test these changes?

Copy link
Contributor

@imangohari1 imangohari1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have tested this with the following tests and it looks fine.
@chaojun-zhang I suggest we add a test for testing dynamic compile to test_text_generation. Thanks.

PT_ENABLE_INT64_SUPPORT=1 PT_HPU_LAZY_MODE=0 python --use_mpi run_mlm.py --model_name_or_path roberta-large --dataset_name wikitext --dataset_config_name wikitext-2-raw-v1 --per_device_train_batch_size 8 --per_device_eval_batch_size 8 --do_train --do_eval --output_dir /tmp/test-mlm --overwrite_output_dir --use_habana --torch_compile_backend hpu_backend --use_lazy_mode False --torch_compile --gaudi_config_name Habana/roberta-base --throughput_warmup_steps 3 --bf16
***** train metrics *****
 epoch                       =        3.0
 max_memory_allocated (GB)   =      94.03
 memory_allocated (GB)       =       5.13
 total_flos                  = 12500244GF
 total_memory_available (GB) =      94.62
 train_loss                  =     1.1576
 train_runtime               = 0:02:37.23
 train_samples               =       4798
 train_samples_per_second    =    114.138
 train_steps_per_second      =     14.273
11/25/2024 22:33:47 - INFO - __main__ - *** Evaluate ***
[INFO|trainer.py:831] 2024-11-25 22:33:47,483 >> The following columns in the evaluation set don't have a corresponding argument in `RobertaForMaskedLM.forward` and have been ignored: special_tokens_mask. If special_tokens_mask are not expected by `RobertaForMaskedLM.forward`,  you can safely ignore this message.
[INFO|trainer.py:1850] 2024-11-25 22:33:47,487 >>
***** Running Evaluation *****
[INFO|trainer.py:1852] 2024-11-25 22:33:47,487 >>   Num examples = 496
[INFO|trainer.py:1855] 2024-11-25 22:33:47,487 >>   Batch size = 8
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 62/62 [00:03<00:00, 16.69it/s]
***** eval metrics *****
 epoch                       =        3.0
 eval_accuracy               =     0.7633
 eval_loss                   =     1.0517
 eval_runtime                = 0:00:03.80
 eval_samples                =        496
 eval_samples_per_second     =    131.343
 eval_steps_per_second       =     16.418
 max_memory_allocated (GB)   =      94.03
 memory_allocated (GB)       =       5.13
 perplexity                  =     2.8626
 total_memory_available (GB) =      94.62

PT_ENABLE_INT64_SUPPORT=1 PT_HPU_LAZY_MODE=0 python ../gaudi_spawn.py \
    --world_size 8 --use_mpi run_mlm.py \
    --model_name_or_path roberta-large \
    --dataset_name wikitext \
    --dataset_config_name wikitext-2-raw-v1 \
    --per_device_train_batch_size 8 \
    --per_device_eval_batch_size 8 \
    --do_train \
    --do_eval \
    --output_dir /tmp/test-mlm \
    --use_habana \
    --torch_compile_backend hpu_backend --use_lazy_mode False \
    --torch_compile \
    --gaudi_config_name Habana/roberta-base \
    --throughput_warmup_steps 3 \
    --bf16

***** train metrics *****
  epoch                       =        3.0
  max_memory_allocated (GB)   =       16.2
  memory_allocated (GB)       =       6.06
  total_flos                  = 12500244GF
  total_memory_available (GB) =      94.62
  train_loss                  =     1.1425
  train_runtime               = 0:00:50.12
  train_samples               =       4798
  train_samples_per_second    =    723.717
  train_steps_per_second      =     11.313
11/25/2024 21:12:33 - INFO - __main__ - *** Evaluate ***
[INFO|trainer.py:831] 2024-11-25 21:12:33,775 >> The following columns in the evaluation set don't have a corresponding argument in `RobertaForMaskedLM.forward` and have been ignored: special_tokens_mask. If special_tokens_mask are not expected by `RobertaForMaskedLM.forward`,  you can safely ignore this message.
[INFO|trainer.py:1850] 2024-11-25 21:12:33,779 >>
***** Running Evaluation *****
[INFO|trainer.py:1852] 2024-11-25 21:12:33,779 >>   Num examples = 496
[INFO|trainer.py:1855] 2024-11-25 21:12:33,779 >>   Batch size = 8
^M  0%|          | 0/8 [00:00<?, ?it/s]^M 38%|▒~V~H▒~V~H▒~V~H▒~V~J      | 3/8 [00:00<00:00, 28.42it/s]^M 75%|▒~V~H▒~V~H▒~V~H▒~V~H▒~V~H▒~V~H▒~V~H▒~V~L  | 6/8 [00:00<00:00, 20.72it/s]^M100%|▒~V~H▒~V~H▒~V~H▒~V~H▒~V~H▒~V~H▒~V~H▒~V~H▒~V~H▒~V~H| 8/8 [00:00<00:00, 13.71it/s]
***** eval metrics *****
  epoch                       =        3.0
  eval_accuracy               =      0.768
  eval_loss                   =     1.0268
  eval_runtime                = 0:00:00.65
  eval_samples                =        496
  eval_samples_per_second     =    650.538
  eval_steps_per_second       =       10.7
  max_memory_allocated (GB)   =       16.2
  memory_allocated (GB)       =       6.06
  perplexity                  =     2.7921
  total_memory_available (GB) =      94.62
GAUDI2_CI=1 RUN_SLOW=true python -m pytest tests/test_text_generation_example.py -s -v torch_compile
.
.
================= 2 passed, 54 deselected in 160.06s (0:02:40) =================

Copy link

The code quality check failed, please run make style.

@regisss
Copy link
Collaborator

regisss commented Nov 26, 2024

@chaojun-zhang I agree with @imangohari1 that we should add a test.

Also, please sync your branch with main and run make style again.

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@chaojun-zhang chaojun-zhang force-pushed the auto-pr-e2dc35c branch 2 times, most recently from dd9fc04 to 32b7eba Compare November 28, 2024 03:56
…rf Degrade by 41.02%

enable compile dynamic for mpi
@chaojun-zhang
Copy link
Contributor Author

@chaojun-zhang I agree with @imangohari1 that we should add a test.

Also, please sync your branch with main and run make style again.

Thanks for your review, @regisss This fix is primarily aimed at model training using MPI and torch.compile, where the inference process currently does not reach this point, so there is no mpi integration for inference at present.

@chaojun-zhang chaojun-zhang changed the title enable dynamic compile for mpi enable dynamic compile for mpi(training) Nov 28, 2024
@imangohari1
Copy link
Contributor

@chaojun-zhang I agree with @imangohari1 that we should add a test.
Also, please sync your branch with main and run make style again.

Thanks for your review, @regisss This fix is primarily aimed at model training using MPI and torch.compile, where the inference process currently does not reach this point, so there is no mpi integration for inference at present.

@chaojun-zhang
Independent of the usecase, we need to include a pytest for this.
Please look at the tests/ folder and add the proper tests that this PR intends to be useful.
Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants