-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RuntimeError: Distributed package doesn't have NCCL built in #70
Comments
Based on the error message you provided, it seems that you are encountering a ChildFailedError when trying to run the example_completion.py file in your Windows laptop. This error is related to distributed training in PyTorch. To fix this issue, you can try the following steps: Check PyTorch Documentation: Visit the PyTorch documentation page on elastic errors to enable traceback and get more information about the error. Update PyTorch and CUDA Versions: Make sure you are using compatible versions of PyTorch and CUDA. It is recommended to use the latest stable versions to ensure compatibility and access to the latest |
Everything checked but unable to resolve. I am using latest version of Pytorch and 11.7 CUDA. |
Getting the same error on my macbook as well. |
please try to explain more and share information like system info , and all logs in txt form or link to them |
As of now, for 7B parameter model, its working on windows by making changes to generator.py file by using torch.distributed.init_process_group("gloo"), instead of "nccl". |
"gloo" worked on Apple M2 chips mac as well. had some more additional changes covered in this PR: #18 . |
Use gloo to make it work on Windows |
When trying to run example_completion.py file in my windows laptop, I am getting below error:
I am using pytorch 2.0 version with CUDA 11.7 . On typing the command
import torch.distributed as dist
if dist.is_nccl_available():
print("NCCL is available and built into PyTorch.")
else:
print("NCCL is not available in this PyTorch installation.")
I am getting the output "NCCL is not available in this PyTorch installation."
What should I do ?
The text was updated successfully, but these errors were encountered: