Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tf32 warnings #6816

Merged
merged 27 commits into from
Aug 7, 2023
Merged

Tf32 warnings #6816

merged 27 commits into from
Aug 7, 2023

Conversation

qingpeng9802
Copy link
Contributor

@qingpeng9802 qingpeng9802 commented Aug 3, 2023

about #6754 .

Description

show a warning if any thing may enable tf32 is detected

Types of changes

  • Non-breaking change (fix or new feature that would not break existing functionality).
  • Breaking change (fix or new feature that would cause existing functionality to change).
  • New tests added to cover the changes.
  • Integration tests passed locally by running ./runtests.sh -f -u --net --coverage.
  • Quick tests passed locally by running ./runtests.sh --quick --unittests --disttests.
  • In-line docstrings updated.
  • Documentation updated, tested make html command in the docs/ folder.

qingpeng9802 and others added 8 commits August 3, 2023 16:52
Signed-off-by: Qingpeng Li <[email protected]>
Signed-off-by: Qingpeng Li <[email protected]>
when a function/class is imported

Signed-off-by: Qingpeng Li <[email protected]>
Signed-off-by: Qingpeng Li <[email protected]>
Signed-off-by: Qingpeng Li <[email protected]>
Signed-off-by: Qingpeng Li <[email protected]>
@qingpeng9802
Copy link
Contributor Author

qingpeng9802 commented Aug 3, 2023

/black

Signed-off-by: monai-bot <[email protected]>
Copy link
Contributor

@wyli wyli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks! please see some minor inline comments

monai/losses/__init__.py Outdated Show resolved Hide resolved
monai/utils/tf32.py Show resolved Hide resolved
Signed-off-by: Qingpeng Li <[email protected]>
@qingpeng9802 qingpeng9802 marked this pull request as ready for review August 4, 2023 18:48
@wyli
Copy link
Contributor

wyli commented Aug 4, 2023

/black

Copy link
Contributor

@wyli wyli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, it looks good to me, testing with more environments

@wyli wyli enabled auto-merge (squash) August 4, 2023 22:22
@wyli wyli disabled auto-merge August 4, 2023 22:37
@qingpeng9802
Copy link
Contributor Author

qingpeng9802 commented Aug 5, 2023

The failed test is caused by #2161 (also see kornia/kornia#1951). Pytorch only initializes torch.cuda once if torch.cuda is called, and there is no way to revert the initialized state (pytorch/pytorch#28829). The only solution is letting users to adjust the position of os.environ["CUDA_VISIBLE_DEVICES"], which is recommend by a pytorch maintainer.

@qingpeng9802
Copy link
Contributor Author

/black

@wyli
Copy link
Contributor

wyli commented Aug 5, 2023

The failed test is caused by #2161 (also see kornia/kornia#1951). Pytorch only initializes torch.cuda once if torch.cuda is called, and there is no way to revert the initialized state (pytorch/pytorch#28829). The only solution is letting users to adjust the position of os.environ["CUDA_VISIBLE_DEVICES"], which is recommend by a pytorch maintainer.

ok... in this case how about only calling detect_default_tf32 in the config printing ?

def get_gpu_info() -> OrderedDict:

currently the import monai is a bit slow already.

@qingpeng9802
Copy link
Contributor Author

Per my test, the function reset_torch_cuda_after_run in the commit 348f089 can resolve the torch.cuda issue. @wyli Could you trigger the GPU CIs to confirm?
Also, this new added warning increases time <0.005% so it should be fine.

@wyli
Copy link
Contributor

wyli commented Aug 5, 2023

/build

@qingpeng9802
Copy link
Contributor Author

This failed case is kind of interesting🤔
Since this issue pytorch/pytorch#80876 is resolved in PyTorch 1.12.1, @wyli Could you trigger the GPU CI PT113 and PT210?

My test env is Pytorch version: 2.0.0+cu117 NVIDIA-SMI 515.65.01 Driver Version: 516.94 CUDA Version: 11.7, and this env can pass tests.test_set_visible_devices.

@wyli
Copy link
Contributor

wyli commented Aug 6, 2023

Sure, I’ll try to rerun them. I think the utility function in the PR looks good, but adding it and the workaround to monai/__init__.py is a bit risky. (I read the relevant GitHub discussions, most of them are trying to avoid unnecessary early calls to torch.cuda) @ericspod @Nic-Ma

@qingpeng9802
Copy link
Contributor Author

qingpeng9802 commented Aug 6, 2023

The key of torch.cuda issue is calling to https://github.com/pytorch/pytorch/blob/v2.0.1/torch/cuda/__init__.py#L219. IMO, importlib.reload should be relatively safe on PyTorch(Python) side. The main risk is that it is unclear whether PyTorch's initialization operation on CUDA is idempotent. That is, we would expect get the same result on the CUDA state after twice initialization.

Actually, I have an ugly but safe solution: subprocess.check_output("nvidia-smi") as alternative 😅.

monai/utils/tf32.py Outdated Show resolved Hide resolved
@qingpeng9802
Copy link
Contributor Author

qingpeng9802 commented Aug 7, 2023

As the disscussion above, we prefer to use alt 2. Feel free to add any comment on alt 2. If alt 2. is okay, I can push an alt 2. commit.

Looks like PyNVML is a safe choice. There are some similar usages in PyTorch, such as https://github.com/pytorch/pytorch/blob/v2.0.1/torch/cuda/__init__.py#L771-L794

@wyli
Copy link
Contributor

wyli commented Aug 7, 2023

sure, I think we can go for option 2 if it works fine across the test platforms and the import timing doesn't increase too much python -X importtime -c "import monai"

Signed-off-by: Qingpeng Li <[email protected]>
@qingpeng9802
Copy link
Contributor Author

/black

@wyli
Copy link
Contributor

wyli commented Aug 7, 2023

/build

requirements.txt Outdated Show resolved Hide resolved
@wyli
Copy link
Contributor

wyli commented Aug 7, 2023

/build

@wyli wyli enabled auto-merge (squash) August 7, 2023 16:19
@wyli wyli merged commit ca96867 into Project-MONAI:dev Aug 7, 2023
@qingpeng9802 qingpeng9802 mentioned this pull request Aug 8, 2023
7 tasks
wyli pushed a commit that referenced this pull request Aug 8, 2023
Following #6816 
### Description

make `is_tf32_env()` safer.
check `cuda`  to prevent fallthrough case when `pynvml` is not found

### Types of changes
<!--- Put an `x` in all the boxes that apply, and remove the not
applicable items -->
- [x] Non-breaking change (fix or new feature that would not break
existing functionality).
- [ ] Breaking change (fix or new feature that would cause existing
functionality to change).
- [ ] New tests added to cover the changes.
- [ ] Integration tests passed locally by running `./runtests.sh -f -u
--net --coverage`.
- [ ] Quick tests passed locally by running `./runtests.sh --quick
--unittests --disttests`.
- [ ] In-line docstrings updated.
- [ ] Documentation updated, tested `make html` command in the `docs/`
folder.

Signed-off-by: Qingpeng Li <[email protected]>
@qingpeng9802 qingpeng9802 deleted the tf32-warnings branch August 15, 2023 14:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants