Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: Distributed package doesn't have NCCL built in #70

Closed
manoj21192 opened this issue Aug 31, 2023 · 8 comments
Closed

RuntimeError: Distributed package doesn't have NCCL built in #70

manoj21192 opened this issue Aug 31, 2023 · 8 comments

Comments

@manoj21192
Copy link

When trying to run example_completion.py file in my windows laptop, I am getting below error:
Error

I am using pytorch 2.0 version with CUDA 11.7 . On typing the command
import torch.distributed as dist

if dist.is_nccl_available():
print("NCCL is available and built into PyTorch.")
else:
print("NCCL is not available in this PyTorch installation.")

I am getting the output "NCCL is not available in this PyTorch installation."
What should I do ?

@GaganHonor
Copy link

Based on the error message you provided, it seems that you are encountering a ChildFailedError when trying to run the example_completion.py file in your Windows laptop. This error is related to distributed training in PyTorch.

To fix this issue, you can try the following steps:

Check PyTorch Documentation: Visit the PyTorch documentation page on elastic errors to enable traceback and get more information about the error.

Update PyTorch and CUDA Versions: Make sure you are using compatible versions of PyTorch and CUDA. It is recommended to use the latest stable versions to ensure compatibility and access to the latest

@manoj21192
Copy link
Author

Everything checked but unable to resolve. I am using latest version of Pytorch and 11.7 CUDA.

@srinivaskumarramdas
Copy link

Getting the same error on my macbook as well.

@GaganHonor
Copy link

please try to explain more and share information like system info , and all logs in txt form or link to them

@manoj21192
Copy link
Author

manoj21192 commented Sep 1, 2023

As of now, for 7B parameter model, its working on windows by making changes to generator.py file by using torch.distributed.init_process_group("gloo"), instead of "nccl".
Is this methodology fine if I want to use high parameter model in future?

@manoj21192 manoj21192 reopened this Sep 1, 2023
@srinivaskumarramdas
Copy link

srinivaskumarramdas commented Sep 1, 2023

"gloo" worked on Apple M2 chips mac as well. had some more additional changes covered in this PR: #18 .

@GaganHonor
Copy link

Please close this issue if issue is Fixed
RunForrestPuppetGIF

@manoj21192
Copy link
Author

Use gloo to make it work on Windows

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants