NCCL WARN Failed to open libibverbs.so[.1] #1168

LearnKen · 2020-06-06T12:44:29Z

I use the official example https://github.com/kubeflow/tf-operator/tree/master/examples/v1/distribution_strategy/keras-API

1.POD is working but use only one GPU
and the second failed is
2. eval_fn is not passed in. The worker_fn will be used if an "evaluator" task exists in the cluster
eval_fn is not passed in. The worker_fn will be used if an "evaluator" task exists in the cluster

System information

Ubuntu 16.04
1Master   IP:14X.XXX.XXX.1
node1     IP:14X.XXX.XXX.8     GTX1060
node2     IP:14X.XXX.XXX.9     GTX1060
node3     IP:14X.XXX.XXX.10    GTX1070
docker18.09.7-3
cuda 10.0
nvidia-container-runtime=2.0.0
kubernetes 1.5.7
kubeflow 1.01

This is pod logs

2020-06-06 13:42:10.403622: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libnvinfer.so.6
2020-06-06 13:42:10.405097: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libnvinfer_plugin.so.6
2020-06-06 13:42:11.360954: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-06-06 13:42:11.392254: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties: 
pciBusID: 0000:05:00.0 name: GeForce GTX 1070 computeCapability: 6.1
coreClock: 1.7715GHz coreCount: 15 deviceMemorySize: 7.93GiB deviceMemoryBandwidth: 238.66GiB/s
2020-06-06 13:42:11.392302: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-06-06 13:42:11.392337: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-06-06 13:42:11.394046: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2020-06-06 13:42:11.394338: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2020-06-06 13:42:11.396368: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2020-06-06 13:42:11.397503: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2020-06-06 13:42:11.397553: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-06-06 13:42:11.399221: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1697] Adding visible gpu devices: 0
2020-06-06 13:42:11.399588: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-06-06 13:42:11.407640: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2198720000 Hz
2020-06-06 13:42:11.409315: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x4216610 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-06-06 13:42:11.409351: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2020-06-06 13:42:11.506014: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x4a3ed20 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2020-06-06 13:42:11.506070: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): GeForce GTX 1070, Compute Capability 6.1
2020-06-06 13:42:11.507576: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties: 
pciBusID: 0000:05:00.0 name: GeForce GTX 1070 computeCapability: 6.1
coreClock: 1.7715GHz coreCount: 15 deviceMemorySize: 7.93GiB deviceMemoryBandwidth: 238.66GiB/s
2020-06-06 13:42:11.507666: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-06-06 13:42:11.507688: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-06-06 13:42:11.507725: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2020-06-06 13:42:11.507748: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2020-06-06 13:42:11.507782: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2020-06-06 13:42:11.507812: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2020-06-06 13:42:11.507832: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-06-06 13:42:11.510127: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1697] Adding visible gpu devices: 0
2020-06-06 13:42:11.510187: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-06-06 13:42:11.871917: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1096] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-06-06 13:42:11.871965: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1102]      0 
2020-06-06 13:42:11.871973: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] 0:   N 
2020-06-06 13:42:11.873633: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1241] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7169 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1070, pci bus id: 0000:05:00.0, compute capability: 6.1)
2020-06-06 13:42:11.877266: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties: 
pciBusID: 0000:05:00.0 name: GeForce GTX 1070 computeCapability: 6.1
coreClock: 1.7715GHz coreCount: 15 deviceMemorySize: 7.93GiB deviceMemoryBandwidth: 238.66GiB/s
2020-06-06 13:42:11.877328: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-06-06 13:42:11.877352: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-06-06 13:42:11.877384: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2020-06-06 13:42:11.877416: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2020-06-06 13:42:11.877439: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2020-06-06 13:42:11.877466: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2020-06-06 13:42:11.877493: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-06-06 13:42:11.879798: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1697] Adding visible gpu devices: 0
2020-06-06 13:42:11.879840: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1096] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-06-06 13:42:11.879856: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1102]      0 
2020-06-06 13:42:11.879866: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] 0:   N 
2020-06-06 13:42:11.882065: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1241] Created TensorFlow device (/job:worker/replica:0/task:0/device:GPU:0 with 7169 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1070, pci bus id: 0000:05:00.0, compute capability: 6.1)
2020-06-06 13:42:11.890039: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:300] Initialize GrpcChannelCache for job worker -> {0 -> localhost:2222, 1 -> multi-worker-worker-1.kubeflow.svc:2222, 2 -> multi-worker-worker-2.kubeflow.svc:2222}
2020-06-06 13:42:11.891220: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:390] Started server with target: grpc://localhost:2222
WARNING:absl:Dataset mnist is hosted on GCS. It will automatically be downloaded to your
local data directory. If you'd instead prefer to read directly from our public
GCS bucket (recommended if you're running on GCP), you can instead pass
`try_gcs=True` to `tfds.load` or set `data_dir=gs://tfds-data/datasets`.

multi-worker-worker-0:1:210 [0] NCCL INFO NET/Socket : Using [0]eth0:10.244.3.184<0>
multi-worker-worker-0:1:210 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).

multi-worker-worker-0:1:210 [0] external/nccl_archive/src/misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
WARNING:tensorflow:`eval_fn` is not passed in. The `worker_fn` will be used if an "evaluator" task exists in the cluster.
WARNING:tensorflow:`eval_fn` is not passed in. The `worker_fn` will be used if an "evaluator" task exists in the cluster.
WARNING:tensorflow:`eval_strategy` is not passed in. No distribution strategy will be used for evaluation.
WARNING:tensorflow:`eval_strategy` is not passed in. No distribution strategy will be used for evaluation.
2020-06-06 13:42:25.072418: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-06-06 13:42:25.854184: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
NCCL version 2.4.7+cudaCUDA_MAJOR.CUDA_MINOR
multi-worker-worker-0:1:292 [0] NCCL INFO Setting affinity for GPU 0 to 0fffff
multi-worker-worker-0:1:292 [0] NCCL INFO Could not find real path of /sys/class/net/eth0/device
multi-worker-worker-0:1:292 [0] NCCL INFO bazel-out/k8-py2-opt/bin/external/nccl_archive/_virtual_includes/include_hdrs/net.h:24 -> 2
multi-worker-worker-0:1:292 [0] NCCL INFO CUDA Dev 0[0], Socket NIC distance :  SYS
multi-worker-worker-0:1:292 [0] NCCL INFO Channel 00 :    0   1   2
multi-worker-worker-0:1:292 [0] NCCL INFO Could not find real path of /sys/class/net/eth0/device
multi-worker-worker-0:1:292 [0] NCCL INFO bazel-out/k8-py2-opt/bin/external/nccl_archive/_virtual_includes/include_hdrs/net.h:24 -> 2
multi-worker-worker-0:1:292 [0] NCCL INFO Ring 00 : 2 -> 0 [receive] via NET/Socket/0
multi-worker-worker-0:1:292 [0] NCCL INFO Ring 00 : 0 -> 1 [send] via NET/Socket/0
multi-worker-worker-0:1:292 [0] NCCL INFO Using 256 threads, Min Comp Cap 6, Trees disabled
multi-worker-worker-0:1:292 [0] NCCL INFO comm 0x7f5c38305e70 rank 0 nranks 3 cudaDev 0 nvmlDev 0 - Init COMPLETE
multi-worker-worker-0:1:291 [0] NCCL INFO Launch mode Parallel
2020-06-06 13:42:27.385638: I tensorflow/core/profiler/lib/profiler_session.cc:225] Profiler session started.
2020-06-06 13:42:27.385715: I tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1259] Profiler found 1 GPUs
2020-06-06 13:42:27.387084: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcupti.so.10.1
2020-06-06 13:42:27.487586: E tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1307] function cupti_interface_->Subscribe( &subscriber_, (CUpti_CallbackFunc)ApiCallback, this)failed with error CUPTI_ERROR_INSUFFICIENT_PRIVILEGES
2020-06-06 13:42:27.488406: E tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1346] function cupti_interface_->ActivityRegisterCallbacks( AllocCuptiActivityBuffer, FreeCuptiActivityBuffer)failed with error CUPTI_ERROR_INSUFFICIENT_PRIVILEGES
Downloading and preparing dataset mnist/3.0.1 (download: 11.06 MiB, generated: 21.00 MiB, total: 32.06 MiB) to /root/tensorflow_datasets/mnist/3.0.1...
Dataset mnist downloaded and prepared to /root/tensorflow_datasets/mnist/3.0.1. Subsequent calls will reuse this data.
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
conv2d (Conv2D)              (None, 26, 26, 32)        320       
_________________________________________________________________
max_pooling2d (MaxPooling2D) (None, 13, 13, 32)        0         
_________________________________________________________________
conv2d_1 (Conv2D)            (None, 11, 11, 64)        18496     
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 5, 5, 64)          0         
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 3, 3, 64)          36928     
_________________________________________________________________
flatten (Flatten)            (None, 576)               0         
_________________________________________________________________
dense (Dense)                (None, 64)                36928     
_________________________________________________________________
dense_1 (Dense)              (None, 10)                650       
=================================================================
Total params: 93,322
Trainable params: 93,322
Non-trainable params: 0
_________________________________________________________________
Train for 70 steps
Epoch 1/10
2020-06-06 13:42:27.521448: E tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1329] function cupti_interface_->EnableCallback( 0 , subscriber_, CUPTI_CB_DOMAIN_DRIVER_API, cbid)failed with error CUPTI_ERROR_INVALID_PARAMETER
2020-06-06 13:42:27.521483: I tensorflow/core/profiler/internal/gpu/device_tracer.cc:88]  GpuTracer has collected 0 callback api events and 0 activity events.
67/70 [===========================>..] - ETA: 0s - loss: 0.8933 - accuracy: 0.7327  
Learning rate for epoch 1 is 0.0010000000474974513
70/70 [==============================] - 9s 123ms/step - loss: 0.8707 - accuracy: 0.7398
Epoch 2/10
66/70 [===========================>..] - ETA: 0s - loss: 0.2171 - accuracy: 0.9326
Learning rate for epoch 2 is 0.0010000000474974513
70/70 [==============================] - 3s 44ms/step - loss: 0.2138 - accuracy: 0.9333
Epoch 3/10
67/70 [===========================>..] - ETA: 0s - loss: 0.1507 - accuracy: 0.9527
Learning rate for epoch 3 is 0.0010000000474974513
70/70 [==============================] - 3s 49ms/step - loss: 0.1503 - accuracy: 0.9530
Epoch 4/10
65/70 [==========================>...] - ETA: 0s - loss: 0.1092 - accuracy: 0.9683
Learning rate for epoch 4 is 9.999999747378752e-05
70/70 [==============================] - 3s 48ms/step - loss: 0.1098 - accuracy: 0.9685
Epoch 5/10
67/70 [===========================>..] - ETA: 0s - loss: 0.1067 - accuracy: 0.9702
Learning rate for epoch 5 is 9.999999747378752e-05
70/70 [==============================] - 4s 52ms/step - loss: 0.1068 - accuracy: 0.9702
Epoch 6/10
69/70 [============================>.] - ETA: 0s - loss: 0.0914 - accuracy: 0.9709
Learning rate for epoch 6 is 9.999999747378752e-05
70/70 [==============================] - 4s 52ms/step - loss: 0.0909 - accuracy: 0.9711
Epoch 7/10
67/70 [===========================>..] - ETA: 0s - loss: 0.0889 - accuracy: 0.9722
Learning rate for epoch 7 is 9.999999747378752e-05
70/70 [==============================] - 4s 52ms/step - loss: 0.0882 - accuracy: 0.9724
Epoch 8/10
67/70 [===========================>..] - ETA: 0s - loss: 0.0907 - accuracy: 0.9732
Learning rate for epoch 8 is 9.999999747378752e-06
70/70 [==============================] - 4s 52ms/step - loss: 0.0908 - accuracy: 0.9730
Epoch 9/10
66/70 [===========================>..] - ETA: 0s - loss: 0.0968 - accuracy: 0.9704
Learning rate for epoch 9 is 9.999999747378752e-06
70/70 [==============================] - 4s 54ms/step - loss: 0.0955 - accuracy: 0.9710
Epoch 10/10
69/70 [============================>.] - ETA: 0s - loss: 0.0858 - accuracy: 0.9736
Learning rate for epoch 10 is 9.999999747378752e-06
70/70 [==============================] - 4s 53ms/step - loss: 0.0860 - accuracy: 0.9736
2020-06-06 13:43:03.252370: W tensorflow/core/kernels/data/generator_dataset_op.cc:103] Error occurred when finalizing GeneratorDataset iterator: Cancelled: Operation was cancelled
2020-06-06 13:43:03.625375: W tensorflow/python/util/util.cc:319] Sets are not currently considered sequences, but this may change in the future, so consider avoiding using them.
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/resource_variable_ops.py:1786: calling BaseResourceVariable.__init__ (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed in a future version.
Instructions for updating:
If using Keras pass *_constraint arguments to layers.
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/resource_variable_ops.py:1786: calling BaseResourceVariable.__init__ (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed in a future version.
Instructions for updating:
If using Keras pass *_constraint arguments to layers.
2020-06-06 13:43:05.762394: W tensorflow/core/common_runtime/eager/context.cc:349] Unable to destroy server_ object, so releasing instead. Servers don't support clean shutdown.

The text was updated successfully, but these errors were encountered:

issue-label-bot · 2020-06-06T12:44:36Z

Issue Label Bot is not confident enough to auto-label this issue.
See dashboard for more details.

issue-label-bot · 2020-06-06T12:47:56Z

Issue-Label Bot is automatically applying the labels:

Label	Probability
kind/bug	0.52

Please mark this comment with 👍 or 👎 to give our bot feedback!
Links: app homepage, dashboard and code for this bot.

gaocegege · 2020-06-07T00:46:06Z

are you using nvidia-runtime?

gaocegege · 2020-06-07T00:47:17Z

IB is not tested with tf-operator. Maybe you should use ethernet

LearnKen · 2020-06-07T04:20:17Z

Yes I using nvidia-container-runtime=2.0.0+docker18.09.7-3
Sorry, where show IB is not tested with tf-operator
My Kubernetes is built locally
What should I set to use the Internet?

gaocegege · 2020-06-07T08:22:32Z

Can you show us the code?

LearnKen · 2020-06-07T08:43:06Z

official code

"""An example of multi-worker training with Keras model using Strategy API."""

from __future__ import absolute_import, division, print_function

import argparse
import json
import os

import tensorflow as tf
import tensorflow_datasets as tfds
from tensorflow.keras import datasets, layers, models


def make_datasets_unbatched():
  BUFFER_SIZE = 10000

  # Scaling MNIST data from (0, 255] to (0., 1.]
  def scale(image, label):
    image = tf.cast(image, tf.float32)
    image /= 255
    return image, label

  datasets, info = tfds.load(name='mnist', with_info=True, as_supervised=True)

  return datasets['train'].map(scale).cache().shuffle(BUFFER_SIZE)


def build_and_compile_cnn_model():
  model = models.Sequential()
  model.add(
    layers.Conv2D(32, (3, 3), activation='relu', input_shape=(28, 28, 1)))
  model.add(layers.MaxPooling2D((2, 2)))
  model.add(layers.Conv2D(64, (3, 3), activation='relu'))
  model.add(layers.MaxPooling2D((2, 2)))
  model.add(layers.Conv2D(64, (3, 3), activation='relu'))
  model.add(layers.Flatten())
  model.add(layers.Dense(64, activation='relu'))
  model.add(layers.Dense(10, activation='softmax'))

  model.summary()

  model.compile(optimizer='adam',
                loss='sparse_categorical_crossentropy',
                metrics=['accuracy'])

  return model


def decay(epoch):
  if epoch < 3:
    return 1e-3
  elif epoch >= 3 and epoch < 7:
    return 1e-4
  else:
    return 1e-5


def main(args):

  # MultiWorkerMirroredStrategy creates copies of all variables in the model's
  # layers on each device across all workers
  # if your GPUs don't support NCCL, replace "communication" with another
  strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy(
    communication=tf.distribute.experimental.CollectiveCommunication.NCCL)

  BATCH_SIZE_PER_REPLICA = 64
  BATCH_SIZE = BATCH_SIZE_PER_REPLICA * strategy.num_replicas_in_sync

  with strategy.scope():
    ds_train = make_datasets_unbatched().batch(BATCH_SIZE).repeat()
    options = tf.data.Options()
    options.experimental_distribute.auto_shard_policy = \
                                        tf.data.experimental.AutoShardPolicy.DATA
    ds_train = ds_train.with_options(options)
    # Model building/compiling need to be within `strategy.scope()`.
    multi_worker_model = build_and_compile_cnn_model()

  # Define the checkpoint directory to store the checkpoints
  checkpoint_dir = args.checkpoint_dir

  # Name of the checkpoint files
  checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt_{epoch}")

  # Function for decaying the learning rate.
  # You can define any decay function you need.
  # Callback for printing the LR at the end of each epoch.
  class PrintLR(tf.keras.callbacks.Callback):

    def on_epoch_end(self, epoch, logs=None):
      print('\nLearning rate for epoch {} is {}'.format(
        epoch + 1, multi_worker_model.optimizer.lr.numpy()))

  callbacks = [
    tf.keras.callbacks.TensorBoard(log_dir='./logs'),
    tf.keras.callbacks.ModelCheckpoint(filepath=checkpoint_prefix,
                                       save_weights_only=True),
    tf.keras.callbacks.LearningRateScheduler(decay),
    PrintLR()
  ]

  # Keras' `model.fit()` trains the model with specified number of epochs and
  # number of steps per epoch. Note that the numbers here are for demonstration
  # purposes only and may not sufficiently produce a model with good quality.
  multi_worker_model.fit(ds_train,
                         epochs=10,
                         steps_per_epoch=70,
                         callbacks=callbacks)

  # Saving a model
  # Let `is_chief` be a utility function that inspects the cluster spec and
  # current task type and returns True if the worker is the chief and False
  # otherwise.
  def is_chief():
    return (TASK_INDEX == 0)

  if is_chief():
    model_path = args.saved_model_dir

  else:
    # Save to a path that is unique across workers.
    model_path = args.saved_model_dir + '/worker_tmp_' + str(TASK_INDEX)

  multi_worker_model.save(model_path)


if __name__ == '__main__':
  os.environ['NCCL_DEBUG'] = 'INFO'

  tfds.disable_progress_bar()

  # to decide if a worker is chief, get TASK_INDEX in Cluster info
  tf_config = json.loads(os.environ.get('TF_CONFIG') or '{}')
  TASK_INDEX = tf_config['task']['index']

  parser = argparse.ArgumentParser()
  parser.add_argument('--saved_model_dir',
                      type=str,
                      required=True,
                      help='Tensorflow export directory.')

  parser.add_argument('--checkpoint_dir',
                      type=str,
                      required=True,
                      help='Tensorflow checkpoint directory.')

  args = parser.parse_args()
  main(args)

gaocegege · 2020-06-07T09:14:32Z

I think it may be caused by

# if your GPUs don't support NCCL, replace "communication" with another
  strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy(
    communication=tf.distribute.experimental.CollectiveCommunication.NCCL)

LearnKen · 2020-06-07T13:11:01Z

My nvidia GPU should be able to use NCCL, I don't know why this is not working
I clean that

strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy()

and it same situation

gaocegege · 2020-06-08T01:31:16Z

Try to use RING and show the new log, thanks.

LearnKen · 2020-06-08T05:34:47Z

code

import argparse
import json
import os

import tensorflow as tf
import tensorflow_datasets as tfds
from tensorflow.keras import datasets, layers, models


def make_datasets_unbatched():
  BUFFER_SIZE = 10000

  # Scaling MNIST data from (0, 255] to (0., 1.]
  def scale(image, label):
    image = tf.cast(image, tf.float32)
    image /= 255
    return image, label

  datasets, info = tfds.load(name='mnist', with_info=True, as_supervised=True)

  return datasets['train'].map(scale).cache().shuffle(BUFFER_SIZE)


def build_and_compile_cnn_model():
  model = models.Sequential()
  model.add(
    layers.Conv2D(32, (3, 3), activation='relu', input_shape=(28, 28, 1)))
  model.add(layers.MaxPooling2D((2, 2)))
  model.add(layers.Conv2D(64, (3, 3), activation='relu'))
  model.add(layers.MaxPooling2D((2, 2)))
  model.add(layers.Conv2D(64, (3, 3), activation='relu'))
  model.add(layers.Flatten())
  model.add(layers.Dense(64, activation='relu'))
  model.add(layers.Dense(10, activation='softmax'))

  model.summary()

  model.compile(optimizer='adam',
                loss='sparse_categorical_crossentropy',
                metrics=['accuracy'])

  return model


def decay(epoch):
  if epoch < 3:
    return 1e-3
  elif epoch >= 3 and epoch < 7:
    return 1e-4
  else:
    return 1e-5


def main(args):

  # MultiWorkerMirroredStrategy creates copies of all variables in the model's
  # layers on each device across all workers
  # if your GPUs don't support NCCL, replace "communication" with another
  strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy(
    communication=tf.distribute.experimental.CollectiveCommunication.RING)

  BATCH_SIZE_PER_REPLICA = 64
  BATCH_SIZE = BATCH_SIZE_PER_REPLICA * strategy.num_replicas_in_sync

  with strategy.scope():
    ds_train = make_datasets_unbatched().batch(BATCH_SIZE).repeat()
    options = tf.data.Options()
    options.experimental_distribute.auto_shard_policy = \
                                        tf.data.experimental.AutoShardPolicy.DATA
    ds_train = ds_train.with_options(options)
    # Model building/compiling need to be within `strategy.scope()`.
    multi_worker_model = build_and_compile_cnn_model()

  # Define the checkpoint directory to store the checkpoints
  checkpoint_dir = args.checkpoint_dir

  # Name of the checkpoint files
  checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt_{epoch}")

  # Function for decaying the learning rate.
  # You can define any decay function you need.
  # Callback for printing the LR at the end of each epoch.
  class PrintLR(tf.keras.callbacks.Callback):

    def on_epoch_end(self, epoch, logs=None):
      print('\nLearning rate for epoch {} is {}'.format(
        epoch + 1, multi_worker_model.optimizer.lr.numpy()))

  callbacks = [
    tf.keras.callbacks.TensorBoard(log_dir='./logs'),
    tf.keras.callbacks.ModelCheckpoint(filepath=checkpoint_prefix,
                                       save_weights_only=True),
    tf.keras.callbacks.LearningRateScheduler(decay),
    PrintLR()
  ]

  # Keras' `model.fit()` trains the model with specified number of epochs and
  # number of steps per epoch. Note that the numbers here are for demonstration
  # purposes only and may not sufficiently produce a model with good quality.
  multi_worker_model.fit(ds_train,
                         epochs=10,
                         steps_per_epoch=70,
                         callbacks=callbacks)

  # Saving a model
  # Let `is_chief` be a utility function that inspects the cluster spec and
  # current task type and returns True if the worker is the chief and False
  # otherwise.
  def is_chief():
    return (TASK_INDEX == 0)

  if is_chief():
    model_path = args.saved_model_dir

  else:
    # Save to a path that is unique across workers.
    model_path = args.saved_model_dir + '/worker_tmp_' + str(TASK_INDEX)

  multi_worker_model.save(model_path)


if __name__ == '__main__':
  os.environ['NCCL_DEBUG'] = 'INFO'

  tfds.disable_progress_bar()

  # to decide if a worker is chief, get TASK_INDEX in Cluster info
  tf_config = json.loads(os.environ.get('TF_CONFIG') or '{}')
  TASK_INDEX = tf_config['task']['index']

  parser = argparse.ArgumentParser()
  parser.add_argument('--saved_model_dir',
                      type=str,
                      required=True,
                      help='Tensorflow export directory.')

  parser.add_argument('--checkpoint_dir',
                      type=str,
                      required=True,
                      help='Tensorflow checkpoint directory.')

  args = parser.parse_args()
  main(args)

Pod logs

2020-06-08 05:28:41.691754: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libnvinfer.so.6
2020-06-08 05:28:41.699445: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libnvinfer_plugin.so.6
2020-06-08 05:28:43.429191: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-06-08 05:28:43.443933: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties: 
pciBusID: 0000:05:00.0 name: GeForce GTX 1070 computeCapability: 6.1
coreClock: 1.7715GHz coreCount: 15 deviceMemorySize: 7.93GiB deviceMemoryBandwidth: 238.66GiB/s
2020-06-08 05:28:43.443975: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-06-08 05:28:43.444009: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-06-08 05:28:43.453214: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2020-06-08 05:28:43.454666: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2020-06-08 05:28:43.459652: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2020-06-08 05:28:43.462783: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2020-06-08 05:28:43.462898: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-06-08 05:28:43.465029: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1697] Adding visible gpu devices: 0
2020-06-08 05:28:43.465817: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-06-08 05:28:43.499084: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2198720000 Hz
2020-06-08 05:28:43.500562: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x437d190 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-06-08 05:28:43.500596: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2020-06-08 05:28:43.719362: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x4d32810 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2020-06-08 05:28:43.719397: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): GeForce GTX 1070, Compute Capability 6.1
2020-06-08 05:28:43.720061: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties: 
pciBusID: 0000:05:00.0 name: GeForce GTX 1070 computeCapability: 6.1
coreClock: 1.7715GHz coreCount: 15 deviceMemorySize: 7.93GiB deviceMemoryBandwidth: 238.66GiB/s
2020-06-08 05:28:43.720101: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-06-08 05:28:43.720114: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-06-08 05:28:43.720137: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2020-06-08 05:28:43.720151: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2020-06-08 05:28:43.720163: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2020-06-08 05:28:43.720176: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2020-06-08 05:28:43.720186: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-06-08 05:28:43.721015: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1697] Adding visible gpu devices: 0
2020-06-08 05:28:43.721050: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-06-08 05:28:44.856742: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1096] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-06-08 05:28:44.856780: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1102]      0 
2020-06-08 05:28:44.856788: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] 0:   N 
2020-06-08 05:28:44.858089: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1241] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 2114 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1070, pci bus id: 0000:05:00.0, compute capability: 6.1)
2020-06-08 05:28:44.862250: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties: 
pciBusID: 0000:05:00.0 name: GeForce GTX 1070 computeCapability: 6.1
coreClock: 1.7715GHz coreCount: 15 deviceMemorySize: 7.93GiB deviceMemoryBandwidth: 238.66GiB/s
2020-06-08 05:28:44.862304: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-06-08 05:28:44.862319: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-06-08 05:28:44.862338: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2020-06-08 05:28:44.862358: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2020-06-08 05:28:44.862373: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2020-06-08 05:28:44.862390: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2020-06-08 05:28:44.862404: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-06-08 05:28:44.863625: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1697] Adding visible gpu devices: 0
2020-06-08 05:28:44.863670: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1096] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-06-08 05:28:44.863689: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1102]      0 
2020-06-08 05:28:44.863703: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] 0:   N 
2020-06-08 05:28:44.864763: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1241] Created TensorFlow device (/job:worker/replica:0/task:0/device:GPU:0 with 2114 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1070, pci bus id: 0000:05:00.0, compute capability: 6.1)
2020-06-08 05:28:44.878685: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:300] Initialize GrpcChannelCache for job worker -> {0 -> localhost:2222, 1 -> multi-worker-worker-1.kubeflow.svc:2222}
2020-06-08 05:28:44.879767: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:390] Started server with target: grpc://localhost:2222
multi-worker-worker-0:1:195 [0] NCCL INFO NET/Socket : Using [0]eth0:10.244.3.205<0>
multi-worker-worker-0:1:195 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).

multi-worker-worker-0:1:195 [0] external/nccl_archive/src/misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
WARNING:absl:Dataset mnist is hosted on GCS. It will automatically be downloaded to your
local data directory. If you'd instead prefer to read directly from our public
GCS bucket (recommended if you're running on GCP), you can instead pass
`try_gcs=True` to `tfds.load` or set `data_dir=gs://tfds-data/datasets`.

WARNING:tensorflow:`eval_fn` is not passed in. The `worker_fn` will be used if an "evaluator" task exists in the cluster.
WARNING:tensorflow:`eval_fn` is not passed in. The `worker_fn` will be used if an "evaluator" task exists in the cluster.
WARNING:tensorflow:`eval_strategy` is not passed in. No distribution strategy will be used for evaluation.
WARNING:tensorflow:`eval_strategy` is not passed in. No distribution strategy will be used for evaluation.
2020-06-08 05:28:56.056109: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-06-08 05:28:57.943216: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-06-08 05:29:01.939470: I tensorflow/core/profiler/lib/profiler_session.cc:225] Profiler session started.
2020-06-08 05:29:01.939527: I tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1259] Profiler found 1 GPUs
2020-06-08 05:29:01.940754: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcupti.so.10.1
2020-06-08 05:29:02.041189: E tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1307] function cupti_interface_->Subscribe( &subscriber_, (CUpti_CallbackFunc)ApiCallback, this)failed with error CUPTI_ERROR_INSUFFICIENT_PRIVILEGES
2020-06-08 05:29:02.041943: E tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1346] function cupti_interface_->ActivityRegisterCallbacks( AllocCuptiActivityBuffer, FreeCuptiActivityBuffer)failed with error CUPTI_ERROR_INSUFFICIENT_PRIVILEGES
Downloading and preparing dataset mnist/3.0.1 (download: 11.06 MiB, generated: 21.00 MiB, total: 32.06 MiB) to /root/tensorflow_datasets/mnist/3.0.1...
Dataset mnist downloaded and prepared to /root/tensorflow_datasets/mnist/3.0.1. Subsequent calls will reuse this data.
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
conv2d (Conv2D)              (None, 26, 26, 32)        320       
_________________________________________________________________
max_pooling2d (MaxPooling2D) (None, 13, 13, 32)        0         
_________________________________________________________________
conv2d_1 (Conv2D)            (None, 11, 11, 64)        18496     
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 5, 5, 64)          0         
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 3, 3, 64)          36928     
_________________________________________________________________
flatten (Flatten)            (None, 576)               0         
_________________________________________________________________
dense (Dense)                (None, 64)                36928     
_________________________________________________________________
dense_1 (Dense)              (None, 10)                650       
=================================================================
Total params: 93,322
Trainable params: 93,322
Non-trainable params: 0
_________________________________________________________________
Train for 70 steps
Epoch 1/10
2020-06-08 05:29:02.141818: E tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1329] function cupti_interface_->EnableCallback( 0 , subscriber_, CUPTI_CB_DOMAIN_DRIVER_API, cbid)failed with error CUPTI_ERROR_INVALID_PARAMETER
2020-06-08 05:29:02.141878: I tensorflow/core/profiler/internal/gpu/device_tracer.cc:88]  GpuTracer has collected 0 callback api events and 0 activity events.
69/70 [============================>.] - ETA: 0s - loss: 0.9294 - accuracy: 0.7225   
Learning rate for epoch 1 is 0.0010000000474974513
70/70 [==============================] - 17s 239ms/step - loss: 0.9222 - accuracy: 0.7246
Epoch 2/10
69/70 [============================>.] - ETA: 0s - loss: 0.2402 - accuracy: 0.9264
Learning rate for epoch 2 is 0.0010000000474974513
70/70 [==============================] - 7s 107ms/step - loss: 0.2405 - accuracy: 0.9260
Epoch 3/10
69/70 [============================>.] - ETA: 0s - loss: 0.1641 - accuracy: 0.9476
Learning rate for epoch 3 is 0.0010000000474974513
70/70 [==============================] - 7s 107ms/step - loss: 0.1628 - accuracy: 0.9481
Epoch 4/10
69/70 [============================>.] - ETA: 0s - loss: 0.1222 - accuracy: 0.9624
Learning rate for epoch 4 is 9.999999747378752e-05
70/70 [==============================] - 8s 109ms/step - loss: 0.1225 - accuracy: 0.9624
Epoch 5/10
69/70 [============================>.] - ETA: 0s - loss: 0.1166 - accuracy: 0.9649
Learning rate for epoch 5 is 9.999999747378752e-05
70/70 [==============================] - 7s 100ms/step - loss: 0.1170 - accuracy: 0.9645
Epoch 6/10
68/70 [============================>.] - ETA: 0s - loss: 0.1075 - accuracy: 0.9686
Learning rate for epoch 6 is 9.999999747378752e-05
70/70 [==============================] - 7s 102ms/step - loss: 0.1074 - accuracy: 0.9688
Epoch 7/10
69/70 [============================>.] - ETA: 0s - loss: 0.1058 - accuracy: 0.9698
Learning rate for epoch 7 is 9.999999747378752e-05
70/70 [==============================] - 8s 117ms/step - loss: 0.1064 - accuracy: 0.9695
Epoch 8/10
69/70 [============================>.] - ETA: 0s - loss: 0.0904 - accuracy: 0.9741
Learning rate for epoch 8 is 9.999999747378752e-06
70/70 [==============================] - 7s 103ms/step - loss: 0.0907 - accuracy: 0.9739
Epoch 9/10
69/70 [============================>.] - ETA: 0s - loss: 0.0933 - accuracy: 0.9701
Learning rate for epoch 9 is 9.999999747378752e-06
70/70 [==============================] - 7s 98ms/step - loss: 0.0934 - accuracy: 0.9700
Epoch 10/10
69/70 [============================>.] - ETA: 0s - loss: 0.0862 - accuracy: 0.9735
Learning rate for epoch 10 is 9.999999747378752e-06
70/70 [==============================] - 7s 99ms/step - loss: 0.0863 - accuracy: 0.9735
2020-06-08 05:30:15.889706: W tensorflow/core/kernels/data/generator_dataset_op.cc:103] Error occurred when finalizing GeneratorDataset iterator: Cancelled: Operation was cancelled
2020-06-08 05:30:16.353099: W tensorflow/python/util/util.cc:319] Sets are not currently considered sequences, but this may change in the future, so consider avoiding using them.
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/resource_variable_ops.py:1786: calling BaseResourceVariable.__init__ (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed in a future version
`

issue-label-bot · 2020-06-08T05:34:55Z

Issue-Label Bot is automatically applying the labels:

Label	Probability
area/tfjob	0.73

Please mark this comment with 👍 or 👎 to give our bot feedback!
Links: app homepage, dashboard and code for this bot.

gaocegege · 2020-06-08T06:29:20Z

OK, will dive into it. But seems that it does not affect the training process, right?

LearnKen · 2020-06-08T06:39:24Z

But seems that it does not affect the training process, right?

yes

LearnKen · 2020-06-08T07:38:16Z

This is offical docker file

FROM tensorflow/tensorflow:2.1.0-gpu-py3

RUN pip install tensorflow_datasets

COPY multi_worker_strategy-with-keras.py /
ENTRYPOINT ["python", "/multi_worker_strategy-with-keras.py", "--saved_model_dir", "/train/saved_model/", "--checkpoint_dir", "/train/checkpoint"]

May I ack did tensorflow/tensorflow:2.1.0-gpu-py3 have NCCL ?
Why I can not use NCCL ?

gaocegege · 2020-06-08T08:12:04Z

not sure if you can use NCCL or not. Just to avoid problems about it.

LearnKen · 2020-06-08T09:17:32Z

OK thanks you! If not use NCCL
But how can I do to not affect the training process?

gaocegege · 2020-06-08T09:23:31Z

I think it does not affect the training process. It is just a warning.

LearnKen · 2020-06-08T12:37:22Z

It traing but it still one gpu using
Why...

gaocegege · 2020-06-10T01:47:13Z

Hi, did you solve the problem?

issue-label-bot bot added the kind/bug label Jun 6, 2020

issue-label-bot bot added the area/tfjob label Jun 8, 2020

LearnKen changed the title ~~NCCL WARN Failed to open libibverbs.so[.1]~~ Multi-worker training with Keras only use one GPU Jun 8, 2020

LearnKen changed the title ~~Multi-worker training with Keras only use one GPU~~ NCCL WARN Failed to open libibverbs.so[.1] Jun 9, 2020

LearnKen closed this as completed Jun 9, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NCCL WARN Failed to open libibverbs.so[.1] #1168

NCCL WARN Failed to open libibverbs.so[.1] #1168

LearnKen commented Jun 6, 2020 •

edited

Loading

issue-label-bot bot commented Jun 6, 2020

issue-label-bot bot commented Jun 6, 2020

gaocegege commented Jun 7, 2020

gaocegege commented Jun 7, 2020

LearnKen commented Jun 7, 2020

gaocegege commented Jun 7, 2020

LearnKen commented Jun 7, 2020 •

edited

Loading

gaocegege commented Jun 7, 2020

LearnKen commented Jun 7, 2020

gaocegege commented Jun 8, 2020

LearnKen commented Jun 8, 2020

issue-label-bot bot commented Jun 8, 2020

gaocegege commented Jun 8, 2020

LearnKen commented Jun 8, 2020 •

edited

Loading

LearnKen commented Jun 8, 2020

gaocegege commented Jun 8, 2020

LearnKen commented Jun 8, 2020

gaocegege commented Jun 8, 2020

LearnKen commented Jun 8, 2020

gaocegege commented Jun 10, 2020

NCCL WARN Failed to open libibverbs.so[.1] #1168

NCCL WARN Failed to open libibverbs.so[.1] #1168

Comments

LearnKen commented Jun 6, 2020 • edited Loading

System information

issue-label-bot bot commented Jun 6, 2020

issue-label-bot bot commented Jun 6, 2020

gaocegege commented Jun 7, 2020

gaocegege commented Jun 7, 2020

LearnKen commented Jun 7, 2020

gaocegege commented Jun 7, 2020

LearnKen commented Jun 7, 2020 • edited Loading

gaocegege commented Jun 7, 2020

LearnKen commented Jun 7, 2020

gaocegege commented Jun 8, 2020

LearnKen commented Jun 8, 2020

issue-label-bot bot commented Jun 8, 2020

gaocegege commented Jun 8, 2020

LearnKen commented Jun 8, 2020 • edited Loading

LearnKen commented Jun 8, 2020

gaocegege commented Jun 8, 2020

LearnKen commented Jun 8, 2020

gaocegege commented Jun 8, 2020

LearnKen commented Jun 8, 2020

gaocegege commented Jun 10, 2020

LearnKen commented Jun 6, 2020 •

edited

Loading

LearnKen commented Jun 7, 2020 •

edited

Loading

LearnKen commented Jun 8, 2020 •

edited

Loading