-
Notifications
You must be signed in to change notification settings - Fork 716
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NCCL WARN Failed to open libibverbs.so[.1] #1168
Comments
Issue Label Bot is not confident enough to auto-label this issue. |
Issue-Label Bot is automatically applying the labels:
Please mark this comment with 👍 or 👎 to give our bot feedback! |
are you using nvidia-runtime? |
IB is not tested with tf-operator. Maybe you should use ethernet |
Yes I using nvidia-container-runtime=2.0.0+docker18.09.7-3 |
Can you show us the code? |
official code
|
I think it may be caused by # if your GPUs don't support NCCL, replace "communication" with another
strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy(
communication=tf.distribute.experimental.CollectiveCommunication.NCCL) |
My nvidia GPU should be able to use NCCL, I don't know why this is not working
and it same situation |
Try to use RING and show the new log, thanks. |
code
Pod logs
|
Issue-Label Bot is automatically applying the labels:
Please mark this comment with 👍 or 👎 to give our bot feedback! |
OK, will dive into it. But seems that it does not affect the training process, right? |
yes |
This is offical docker file
May I ack did tensorflow/tensorflow:2.1.0-gpu-py3 have NCCL ? |
not sure if you can use NCCL or not. Just to avoid problems about it. |
OK thanks you! If not use NCCL |
I think it does not affect the training process. It is just a warning. |
It traing but it still one gpu using |
Hi, did you solve the problem? |
I use the official example https://github.com/kubeflow/tf-operator/tree/master/examples/v1/distribution_strategy/keras-API
1.POD is working but use only one GPU
and the second failed is
2.
eval_fn
is not passed in. Theworker_fn
will be used if an "evaluator" task exists in the clustereval_fn
is not passed in. Theworker_fn
will be used if an "evaluator" task exists in the clusterSystem information
This is pod logs
The text was updated successfully, but these errors were encountered: