-
Notifications
You must be signed in to change notification settings - Fork 9.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Distributed package doesn't have NCCL / The requested address is not valid in its context. #104
Comments
nccl is not available on Windows. Switch to Linux or change "nccl" to "gloo" here in example.py |
Won't that use CPU instead of GPU? |
NCCL is a pain. I'm assuming you are running this on windows in conda or similar environment? The easiest way is to just deal with hpc-sdk as it includes nccl. However you will most likely will have to download the tar from nvidia, and extract it yourself. Ensure you have full privileges or it won't work. |
@Inserian I encounter the same error on ubuntu 20.04 with nvidia-hpc-sdk module enabled. Do you know if there might be another error preventing llama from using nccl? |
I assumed we would just be running the smaller models on our own GPU without distributed training. |
I had the same issues y´all described. So i tried everything i could find, and finally i found my problem. If you install pytorch via conda, the standard package is cpu only. I will provide a link where you can find further information on how to download the gpu variant for pytorch. https://pytorch.org/get-started/locally/ I hope this helps at least some of you. |
Seems like the issue is resolved by suggestions above. Please re-open as needed with more detail. |
The text was updated successfully, but these errors were encountered: