You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the current behavior
I use Multi-worker training with Keras but it only use one Gpu
Error:
1.error: Internal: Complete shape not known for Adam/allreduce/CollectiveReduce_2
2.eval_fn is not passed in. The worker_fn will be used if an "evaluator" task exists in the cluster
3.Most import it only run one gpu
I run the code described below:
TEST 1: (3 machine)
TEST 2 : (2 machine)
Describe the expected behavior
Use Multi gpu
GPU device
$ kubectl get nodes "-o=custom-columns=NAME:.metadata.name,GPU:.status.allocatable.nvidia\.com/gpu"
My Docker File
FROM tensorflow/tensorflow:2.1.0-gpu-py3
RUN apt-get update
RUN apt-get install -y libsm6 libxext6 libxrender-dev
RUN pip install opencv-python
RUN pip install Pillow
RUN mkdir -p /app
ADD tp720_1.py /app/
COPY nspo /app/
2020-06-08 12:52:04.146322: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libnvinfer.so.6
2020-06-08 12:52:04.147962: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libnvinfer_plugin.so.6
2020-06-08 12:52:04.788595: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-06-08 12:52:04.795356: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties:
pciBusID: 0000:05:00.0 name: GeForce GTX 1070 computeCapability: 6.1
coreClock: 1.7715GHz coreCount: 15 deviceMemorySize: 7.93GiB deviceMemoryBandwidth: 238.66GiB/s
2020-06-08 12:52:04.795467: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-06-08 12:52:04.795550: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-06-08 12:52:04.798357: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2020-06-08 12:52:04.799032: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2020-06-08 12:52:04.802641: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2020-06-08 12:52:04.804156: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2020-06-08 12:52:04.804221: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-06-08 12:52:04.805601: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1697] Adding visible gpu devices: 0
2020-06-08 12:52:04.806525: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-06-08 12:52:04.814745: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2198720000 Hz
2020-06-08 12:52:04.815811: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5c54090 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-06-08 12:52:04.815832: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
2020-06-08 12:52:04.933082: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5cb9890 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2020-06-08 12:52:04.933126: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): GeForce GTX 1070, Compute Capability 6.1
2020-06-08 12:52:04.934468: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties:
pciBusID: 0000:05:00.0 name: GeForce GTX 1070 computeCapability: 6.1
coreClock: 1.7715GHz coreCount: 15 deviceMemorySize: 7.93GiB deviceMemoryBandwidth: 238.66GiB/s
2020-06-08 12:52:04.934575: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-06-08 12:52:04.934620: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-06-08 12:52:04.934658: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2020-06-08 12:52:04.934694: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2020-06-08 12:52:04.934746: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2020-06-08 12:52:04.934794: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2020-06-08 12:52:04.934850: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-06-08 12:52:04.938735: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1697] Adding visible gpu devices: 0
2020-06-08 12:52:04.938852: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-06-08 12:52:05.301356: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1096] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-06-08 12:52:05.301383: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1102] 0
2020-06-08 12:52:05.301391: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] 0: N
2020-06-08 12:52:05.302813: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1241] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 6927 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1070, pci bus id: 0000:05:00.0, compute capability: 6.1)
2020-06-08 12:52:05.305432: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties:
pciBusID: 0000:05:00.0 name: GeForce GTX 1070 computeCapability: 6.1
coreClock: 1.7715GHz coreCount: 15 deviceMemorySize: 7.93GiB deviceMemoryBandwidth: 238.66GiB/s
2020-06-08 12:52:05.305510: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-06-08 12:52:05.305552: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-06-08 12:52:05.305590: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2020-06-08 12:52:05.305617: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2020-06-08 12:52:05.305649: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2020-06-08 12:52:05.305686: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2020-06-08 12:52:05.305721: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-06-08 12:52:05.306950: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1697] Adding visible gpu devices: 0
2020-06-08 12:52:05.306975: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1096] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-06-08 12:52:05.306983: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1102] 0
2020-06-08 12:52:05.306988: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] 0: N
2020-06-08 12:52:05.308211: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1241] Created TensorFlow device (/job:worker/replica:0/task:1/device:GPU:0 with 6927 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1070, pci bus id: 0000:05:00.0, compute capability: 6.1)
2020-06-08 12:52:05.314693: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:300] Initialize GrpcChannelCache for job chief -> {0 -> nspo-rice-chief-0.kubeflow.svc:2222}
2020-06-08 12:52:05.314716: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:300] Initialize GrpcChannelCache for job worker -> {0 -> nspo-rice-worker-0.kubeflow.svc:2222, 1 -> localhost:2222}
2020-06-08 12:52:05.315663: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:390] Started server with target: grpc://localhost:2222
WARNING:tensorflow:`eval_fn` is not passed in. The `worker_fn` will be used if an "evaluator" task exists in the cluster.
WARNING:tensorflow:`eval_strategy` is not passed in. No distribution strategy will be used for evaluation.
WARNING:tensorflow:ModelCheckpoint callback is not provided. Workers will need to restart training if any fails.
2020-06-08 12:52:16.277123: W tensorflow/core/grappler/optimizers/scoped_allocator_optimizer.cc:440] error: Internal: Complete shape not known for Adam/allreduce/CollectiveReduce_2
2020-06-08 12:52:16.277150: W tensorflow/core/grappler/optimizers/scoped_allocator_optimizer.cc:1056] error: Internal: Complete shape not known for Adam/allreduce/CollectiveReduce_2
2020-06-08 12:52:16.277227: E tensorflow/core/grappler/optimizers/scoped_allocator_optimizer.cc:1073] ScopedAllocatorOptimizer: Internal: Complete shape not known for Adam/allreduce/CollectiveReduce_2
2020-06-08 12:52:16.277234: W tensorflow/core/grappler/optimizers/scoped_allocator_optimizer.cc:846] error: Internal: Complete shape not known for Adam/allreduce/CollectiveReduce_2
2020-06-08 12:52:16.277886: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:561] scoped_allocator_optimizer failed: Internal: Complete shape not known for Adam/allreduce/CollectiveReduce_2
2020-06-08 12:52:16.672402: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
TensorFlow version: 2.1.0
TF_CONFIG %s {"cluster":{"chief":["nspo-rice-chief-0.kubeflow.svc:2222"],"worker":["nspo-rice-worker-0.kubeflow.svc:2222","nspo-rice-worker-1.kubeflow.svc:2222"]},"task":{"type":"worker","index":1},"environment":"cloud"}
cluster={} job_name={} task_index={}} {'chief': ['nspo-rice-chief-0.kubeflow.svc:2222'], 'worker': ['nspo-rice-worker-0.kubeflow.svc:2222', 'nspo-rice-worker-1.kubeflow.svc:2222']} worker 1
Number of devices: 3
data0: (1760157, 14)
data1: (1839843, 14)
(2519999, 14, 1)
(2519999, 1)
(1080001, 14, 1)
(1080001, 1)
<class 'numpy.ndarray'>
<class 'numpy.ndarray'>
252
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
gru (GRU) (None, 14, 32) 3360
_________________________________________________________________
gru_1 (GRU) (None, 14, 64) 18816
_________________________________________________________________
gru_2 (GRU) (None, 128) 74496
_________________________________________________________________
batch_normalization (BatchNo (None, 128) 512
_________________________________________________________________
dense (Dense) (None, 1) 129
=================================================================
Total params: 97,313
Trainable params: 97,057
Non-trainable params: 256
_________________________________________________________________
Train for 252 steps
Epoch 1/30
252/252 [==============================] - 53s 209ms/step - loss: 0.1431 - acc: 0.8200
Epoch 2/30
252/252 [==============================] - 38s 149ms/step - loss: 0.0953 - acc: 0.8733
Epoch 3/30
252/252 [==============================] - 38s 149ms/step - loss: 0.0890 - acc: 0.8798
The text was updated successfully, but these errors were encountered:
System information
Describe the current behavior
I use Multi-worker training with Keras but it only use one Gpu
Error:
1.error: Internal: Complete shape not known for Adam/allreduce/CollectiveReduce_2
2.
eval_fn
is not passed in. Theworker_fn
will be used if an "evaluator" task exists in the cluster3.Most import it only run one gpu
I run the code described below:
TEST 1: (3 machine)
TEST 2 : (2 machine)
Describe the expected behavior
Use Multi gpu
GPU device
My Docker File
My yaml
My code
Pod logs
The text was updated successfully, but these errors were encountered: