Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[nccl-pg] Avoid using NCCL_ prefix for non-NCCL env variables #114077

Closed
wants to merge 1 commit into from

Conversation

pavanbalaji
Copy link
Contributor

NCCL_ prefix should only be used for NCCL library's environment variables. We currently use a few environment variables in PyTorch with the NCCL_ prefix that are the NCCL library does not understand.

This patch renames such environment variables to use the TORCH_NCCL_ prefix instead. We still maintain the old NCCL_ variables, but throw a warning when they are used.

The following env changes have been made:

NCCL_BLOCKING_WAIT -> TORCH_NCCL_BLOCKING_WAIT
NCCL_ENABLE_TIMING -> TORCH_NCCL_ENABLE_TIMING
NCCL_DESYNC_DEBUG -> TORCH_NCCL_DESYNC_DEBUG
NCCL_ASYNC_ERROR_HANDLING -> TORCH_NCCL_ASYNC_ERROR_HANDLING
ENABLE_NCCL_HEALTH_CHECK -> TORCH_ENABLE_NCCL_HEALTH_CHECK
NCCL_USE_TENSOR_REGISTER_ALLOCATOR_HOOK -> TORCH_NCCL_USE_TENSOR_REGISTER_ALLOCATOR_HOOK

Fixes #ISSUE_NUMBER

NCCL_ prefix should only be used for NCCL library's environment
variables.  We currently use a few environment variables in PyTorch
with the NCCL_ prefix that are the NCCL library does not understand.

This patch renames such environment variables to use the TORCH_NCCL_
prefix instead.  We still maintain the old NCCL_ variables, but throw
a warning when they are used.

The following env changes have been made:

`NCCL_BLOCKING_WAIT` -> `TORCH_NCCL_BLOCKING_WAIT`
`NCCL_ENABLE_TIMING` -> `TORCH_NCCL_ENABLE_TIMING`
`NCCL_DESYNC_DEBUG` -> `TORCH_NCCL_DESYNC_DEBUG`
`NCCL_ASYNC_ERROR_HANDLING` -> `TORCH_NCCL_ASYNC_ERROR_HANDLING`
`ENABLE_NCCL_HEALTH_CHECK` -> `TORCH_ENABLE_NCCL_HEALTH_CHECK`
`NCCL_USE_TENSOR_REGISTER_ALLOCATOR_HOOK` -> `TORCH_NCCL_USE_TENSOR_REGISTER_ALLOCATOR_HOOK`
@pytorch-bot pytorch-bot bot added the release notes: distributed (c10d) release notes category label Nov 19, 2023
Copy link

pytorch-bot bot commented Nov 19, 2023

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/114077

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 36f9d06 with merge base 226384b (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot
Copy link
Contributor

@pavanbalaji has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@pytorchbot merge

(Initiating merge automatically since Phabricator Diff has merged)

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Nov 21, 2023
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

chipturner added a commit to chipturner/pytorch that referenced this pull request Nov 30, 2023
… to fix warnings

Previously:

```
[W Utils.hpp:133] Warning: Environment variable NCCL_ASYNC_ERROR_HANDLING is deprecated; use TORCH_NCCL_ASYNC_ERROR_HANDLING instead (function getCvarInt)
[W Utils.hpp:133] Warning: Environment variable NCCL_ASYNC_ERROR_HANDLING is deprecated; use TORCH_NCCL_ASYNC_ERROR_HANDLING instead (function getCvarInt)
```

With this PR, those warnings disappear.  They were introduced in pytorch#114077

This change was generated with this sed script, applied with `sed -i -f /tmp/x **/*.{py,hpp,cpp,cc}` and hand inspected.

```
s/\bNCCL_BLOCKING_WAIT\b/TORCH_NCCL_BLOCKING_WAIT/g
s/\bNCCL_ENABLE_TIMING\b/TORCH_NCCL_ENABLE_TIMING/g
s/\bNCCL_DESYNC_DEBUG\b/TORCH_NCCL_DESYNC_DEBUG/g
s/\bNCCL_ASYNC_ERROR_HANDLING\b/TORCH_NCCL_ASYNC_ERROR_HANDLING/g
s/\bENABLE_NCCL_HEALTH_CHECK\b/TORCH_ENABLE_NCCL_HEALTH_CHECK/g
s/\bNCCL_USE_TENSOR_REGISTER_ALLOCATOR_HOOK\b/TORCH_NCCL_USE_TENSOR_REGISTER_ALLOCATOR_HOOK/g
```
pytorchmergebot pushed a commit that referenced this pull request Dec 1, 2023
… to fix warnings (#114880)

Previously:

```
[W Utils.hpp:133] Warning: Environment variable NCCL_ASYNC_ERROR_HANDLING is deprecated; use TORCH_NCCL_ASYNC_ERROR_HANDLING instead (function getCvarInt)
[W Utils.hpp:133] Warning: Environment variable NCCL_ASYNC_ERROR_HANDLING is deprecated; use TORCH_NCCL_ASYNC_ERROR_HANDLING instead (function getCvarInt)
```

With this PR, those warnings disappear.  They were introduced in #114077

This change was generated with this sed script, applied with `sed -i -f /tmp/x **/*.{py,hpp,cpp,cc}` and hand inspected.

```
s/\bNCCL_BLOCKING_WAIT\b/TORCH_NCCL_BLOCKING_WAIT/g
s/\bNCCL_ENABLE_TIMING\b/TORCH_NCCL_ENABLE_TIMING/g
s/\bNCCL_DESYNC_DEBUG\b/TORCH_NCCL_DESYNC_DEBUG/g
s/\bNCCL_ASYNC_ERROR_HANDLING\b/TORCH_NCCL_ASYNC_ERROR_HANDLING/g
s/\bENABLE_NCCL_HEALTH_CHECK\b/TORCH_ENABLE_NCCL_HEALTH_CHECK/g
s/\bNCCL_USE_TENSOR_REGISTER_ALLOCATOR_HOOK\b/TORCH_NCCL_USE_TENSOR_REGISTER_ALLOCATOR_HOOK/g
```

Pull Request resolved: #114880
Approved by: https://github.com/kwen2501
dmenig pushed a commit to dmenig/pytorch that referenced this pull request Dec 21, 2023
… to fix warnings (pytorch#114880)

Previously:

```
[W Utils.hpp:133] Warning: Environment variable NCCL_ASYNC_ERROR_HANDLING is deprecated; use TORCH_NCCL_ASYNC_ERROR_HANDLING instead (function getCvarInt)
[W Utils.hpp:133] Warning: Environment variable NCCL_ASYNC_ERROR_HANDLING is deprecated; use TORCH_NCCL_ASYNC_ERROR_HANDLING instead (function getCvarInt)
```

With this PR, those warnings disappear.  They were introduced in pytorch#114077

This change was generated with this sed script, applied with `sed -i -f /tmp/x **/*.{py,hpp,cpp,cc}` and hand inspected.

```
s/\bNCCL_BLOCKING_WAIT\b/TORCH_NCCL_BLOCKING_WAIT/g
s/\bNCCL_ENABLE_TIMING\b/TORCH_NCCL_ENABLE_TIMING/g
s/\bNCCL_DESYNC_DEBUG\b/TORCH_NCCL_DESYNC_DEBUG/g
s/\bNCCL_ASYNC_ERROR_HANDLING\b/TORCH_NCCL_ASYNC_ERROR_HANDLING/g
s/\bENABLE_NCCL_HEALTH_CHECK\b/TORCH_ENABLE_NCCL_HEALTH_CHECK/g
s/\bNCCL_USE_TENSOR_REGISTER_ALLOCATOR_HOOK\b/TORCH_NCCL_USE_TENSOR_REGISTER_ALLOCATOR_HOOK/g
```

Pull Request resolved: pytorch#114880
Approved by: https://github.com/kwen2501
@GuWei007
Copy link

GuWei007 commented Jun 7, 2024

@pavanbalaji
How to use the environment variables ?

simonbyrne added a commit to simonbyrne/modulus that referenced this pull request Nov 15, 2024
It looks like the patch from pytorch/pytorch#114077 landed in torch 2.2.0.

Fixes NVIDIA#568.
simonbyrne added a commit to simonbyrne/modulus that referenced this pull request Nov 19, 2024
It looks like the patch from pytorch/pytorch#114077 landed in torch 2.2.0.

Fixes NVIDIA#568.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/trunk Trigger trunk jobs on your pull request Merged release notes: distributed (c10d) release notes category
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants