Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support to run distributed training tests in CI #926

Closed
Nic-Ma opened this issue Aug 20, 2020 · 2 comments · Fixed by #1295
Closed

Add support to run distributed training tests in CI #926

Nic-Ma opened this issue Aug 20, 2020 · 2 comments · Fixed by #1295
Assignees
Labels
CI/CD enhancement New feature or request

Comments

@Nic-Ma
Copy link
Contributor

Nic-Ma commented Aug 20, 2020

Is your feature request related to a problem? Please describe.
Now we already have some distributed training test cases, like: https://github.com/Project-MONAI/MONAI/blob/master/tests/test_handler_rocauc_dist.py
Need to be executed in our CI system.
We can run with 2 GPUs in 1 node first.

@Nic-Ma Nic-Ma added the enhancement New feature or request label Aug 20, 2020
@IsaacYangSLA
Copy link
Contributor

This seems straightforward. The CI job running inside the container can see all gpus. So we can either add one item in the bash script, or run it as a separate step in CI. I will try to include it.

@wyli
Copy link
Contributor

wyli commented Aug 29, 2020

just a follow-up on this, the CI environment already provides 2 GPUs per job, the remaining tasks are to

  • update runtests.sh to run commands such as
    # python -m torch.distributed.launch --nproc_per_node=NUM_GPUS_PER_NODE
    # --nnodes=NUM_NODES --node_rank=INDEX_CURRENT_NODE
    # --master_addr="192.168.1.1" --master_port=1234
    # test_handler_rocauc_dist.py

    if multiple gpus are detected locally
  • double-check that the CI pipeline runs the updated runtests.sh properly

@wyli wyli added the CI/CD label Nov 12, 2020
wyli added a commit that referenced this issue Nov 27, 2020
wyli added a commit to wyli/MONAI that referenced this issue Dec 15, 2020
Signed-off-by: Wenqi Li <[email protected]>
wyli added a commit to wyli/MONAI that referenced this issue Dec 15, 2020
Signed-off-by: Wenqi Li <[email protected]>
IsaacYangSLA added a commit that referenced this issue Jan 4, 2021
Signed-off-by: Wenqi Li <[email protected]>

Co-authored-by: Nic Ma <[email protected]>
Co-authored-by: Isaac Yang <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CI/CD enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants