Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

926 distributed training tests #1295

Merged
merged 6 commits into from
Nov 27, 2020

Conversation

wyli
Copy link
Contributor

@wyli wyli commented Nov 25, 2020

Signed-off-by: Wenqi Li [email protected]

Fixes #926

Description

adds a utility for distributed tests

Status

Ready

Types of changes

  • Non-breaking change (fix or new feature that would not break existing functionality).
  • New tests added to cover the changes.
  • Integration tests passed locally by running ./runtests.sh --codeformat --coverage.
  • Quick tests passed locally by running ./runtests.sh --quick.
  • In-line docstrings updated.
  • Documentation updated, tested make html command in the docs/ folder.

@wyli wyli marked this pull request as draft November 25, 2020 23:39
@wyli wyli force-pushed the 926-distributed-training-tests branch 4 times, most recently from a81a782 to 424b607 Compare November 26, 2020 11:10
@wyli wyli requested review from Nic-Ma and ericspod and removed request for Nic-Ma November 26, 2020 11:34
@wyli wyli marked this pull request as ready for review November 26, 2020 11:34
@wyli
Copy link
Contributor Author

wyli commented Nov 26, 2020

- tested single machine multiprocess on macos and ubuntu w/o GPU

on windows CI there's an error for the distributed test OSError: [WinError 1455] The paging file is too small for this operation to complete. Error loading "C:\hostedtoolcache\windows\Python\3.8.6\x64\lib\site-packages\torch\lib\caffe2_detectron_ops_gpu.dll" or one of its dependencies. I think this is a limitation of the CI instance rather than a code issue. so I skip the tests on windows

updates:

  • fixed windows issue with 3rd party github action al-cheb/[email protected]
  • tested single machine multiprocess on macos/ubuntu/windows w/o GPU

@wyli wyli requested a review from Nic-Ma November 26, 2020 11:42
@wyli
Copy link
Contributor Author

wyli commented Nov 26, 2020

/integration-test

wyli added 3 commits November 26, 2020 17:11
Signed-off-by: Wenqi Li <[email protected]>
Signed-off-by: Wenqi Li <[email protected]>
@wyli wyli force-pushed the 926-distributed-training-tests branch from 98b4034 to 4748431 Compare November 26, 2020 17:15
@wyli wyli force-pushed the 926-distributed-training-tests branch from 4748431 to da4092b Compare November 26, 2020 17:30
tests/utils.py Show resolved Hide resolved
@Nic-Ma Nic-Ma self-requested a review November 27, 2020 14:15
Copy link
Contributor

@Nic-Ma Nic-Ma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use spawn for now, good enough for the first version, maybe some expert can help update it later.

Thanks.

@wyli
Copy link
Contributor Author

wyli commented Nov 27, 2020

Let's use spawn for now, good enough for the first version, maybe some expert can help update it later.

Thanks.

sure thanks! this also needs to be extended to multi-node test cases

@wyli wyli merged commit dcc0a38 into Project-MONAI:master Nov 27, 2020
@wyli wyli deleted the 926-distributed-training-tests branch April 12, 2021 14:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add support to run distributed training tests in CI
2 participants