Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

There are non-GPU devices - GPU devices not detected. #37

Open
StormWeaver opened this issue Jun 22, 2022 · 0 comments
Open

There are non-GPU devices - GPU devices not detected. #37

StormWeaver opened this issue Jun 22, 2022 · 0 comments

Comments

@StormWeaver
Copy link

StormWeaver commented Jun 22, 2022

Describe the bug
After running the build docker I receive an error stating that no GPU's were detected along with a failure to run the model_main_tf2.py script.

There were several different suggestions and solutions between your repo and Tensorflow around similar issues so I attempted a few of them...

  • changing the nvidia gpu apt-key (this appeared to be an issue at one point but reverting it recently seemed to not cause any change)
  • disabling gcloud and gsutil commands
  • adding a gpu_device_name check to the mode_main_XX.py

I tried to install the Nvidia-Docker directly with SUDO however a password prompt appeared and my attempts to set a password in the docker-run section or to find a password did not lead to any success.

I have run my current docker files and the originals side by side and seem to get the same effect.

To Reproduce
Steps to reproduce the behavior:

  1. Build docker following instructions on Git and/or blog page (ex : docker build -f research/object_detection/dockerfiles/tf2/Dockerfile -t od . ) (Docker - Linux Containers)
  2. Run docker (ex : docker run -it od)
  3. In docker after creating train.record and test.record successfully, attempt to run learing script (ex : python object_detection/model_main_tf2.py --pipeline_config_path=object_detection/training/ssd_efficientdet_d0_512x512_coco17_tpu-8.config --model_dir=object_detection/training/ --alsologtostderr)
  4. See error listed below.

WARNING:tensorflow:There are non-GPU devices in tf.distribute.Strategy, not using nccl allreduce.
W0622 03:08:07.678852 140015840864064 cross_device_ops.py:1386] There are non-GPU devices in tf.distribute.Strategy, not using nccl allreduce.
INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:CPU:0',)
I0622 03:08:07.691202 140015840864064 mirrored_strategy.py:374] Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:CPU:0',)
INFO:tensorflow:Maybe overwriting train_steps: None
I0622 03:08:07.694155 140015840864064 config_util.py:552] Maybe overwriting train_steps: None
INFO:tensorflow:Maybe overwriting use_bfloat16: False
I0622 03:08:07.694274 140015840864064 config_util.py:552] Maybe overwriting use_bfloat16: False
I0622 03:08:07.699931 140015840864064 ssd_efficientnet_bifpn_feature_extractor.py:145] EfficientDet EfficientNet backbone version: efficientnet-b0
I0622 03:08:07.700029 140015840864064 ssd_efficientnet_bifpn_feature_extractor.py:147] EfficientDet BiFPN num filters: 64
I0622 03:08:07.700090 140015840864064 ssd_efficientnet_bifpn_feature_extractor.py:148] EfficientDet BiFPN num iterations: 3
I0622 03:08:07.702906 140015840864064 efficientnet_model.py:143] round_filter input=32 output=32
I0622 03:08:07.740919 140015840864064 efficientnet_model.py:143] round_filter input=32 output=32
I0622 03:08:07.741045 140015840864064 efficientnet_model.py:143] round_filter input=16 output=16
I0622 03:08:07.798309 140015840864064 efficientnet_model.py:143] round_filter input=16 output=16
I0622 03:08:07.798432 140015840864064 efficientnet_model.py:143] round_filter input=24 output=24
I0622 03:08:07.944522 140015840864064 efficientnet_model.py:143] round_filter input=24 output=24
I0622 03:08:07.944638 140015840864064 efficientnet_model.py:143] round_filter input=40 output=40
I0622 03:08:08.091527 140015840864064 efficientnet_model.py:143] round_filter input=40 output=40
I0622 03:08:08.091642 140015840864064 efficientnet_model.py:143] round_filter input=80 output=80
I0622 03:08:08.317637 140015840864064 efficientnet_model.py:143] round_filter input=80 output=80
I0622 03:08:08.317753 140015840864064 efficientnet_model.py:143] round_filter input=112 output=112
I0622 03:08:08.537171 140015840864064 efficientnet_model.py:143] round_filter input=112 output=112
I0622 03:08:08.537288 140015840864064 efficientnet_model.py:143] round_filter input=192 output=192
I0622 03:08:08.839897 140015840864064 efficientnet_model.py:143] round_filter input=192 output=192
I0622 03:08:08.840018 140015840864064 efficientnet_model.py:143] round_filter input=320 output=320
I0622 03:08:08.912957 140015840864064 efficientnet_model.py:143] round_filter input=1280 output=1280
I0622 03:08:08.947752 140015840864064 efficientnet_model.py:453] Building model efficientnet with params ModelConfig(width_coefficient=1.0, depth_coefficient=1.0, resolution=224, dropout_rate=0.2, blocks=(BlockConfig(input_filters=32, output_filters=16, kernel_size=3, num_repeat=1, expand_ratio=1, strides=(1, 1), se_ratio=0.25, id_skip=True, fused_conv=False, conv_type='depthwise'), BlockConfig(input_filters=16, output_filters=24, kernel_size=3, num_repeat=2, expand_ratio=6, strides=(2, 2), se_ratio=0.25, id_skip=True, fused_conv=False, conv_type='depthwise'), BlockConfig(input_filters=24, output_filters=40, kernel_size=5, num_repeat=2, expand_ratio=6, strides=(2, 2), se_ratio=0.25, id_skip=True, fused_conv=False, conv_type='depthwise'), BlockConfig(input_filters=40, output_filters=80, kernel_size=3, num_repeat=3, expand_ratio=6, strides=(2, 2), se_ratio=0.25, id_skip=True, fused_conv=False, conv_type='depthwise'), BlockConfig(input_filters=80, output_filters=112, kernel_size=5, num_repeat=3, expand_ratio=6, strides=(1, 1), se_ratio=0.25, id_skip=True, fused_conv=False, conv_type='depthwise'), BlockConfig(input_filters=112, output_filters=192, kernel_size=5, num_repeat=4, expand_ratio=6, strides=(2, 2), se_ratio=0.25, id_skip=True, fused_conv=False, conv_type='depthwise'), BlockConfig(input_filters=192, output_filters=320, kernel_size=3, num_repeat=1, expand_ratio=6, strides=(1, 1), se_ratio=0.25, id_skip=True, fused_conv=False, conv_type='depthwise')), stem_base_filters=32, top_base_filters=1280, activation='simple_swish', batch_norm='default', bn_momentum=0.99, bn_epsilon=0.001, weight_decay=5e-06, drop_connect_rate=0.2, depth_divisor=8, min_depth=None, use_se=True, input_channels=3, num_classes=1000, model_name='efficientnet', rescale_input=False, data_format='channels_last', dtype='float32')
WARNING:tensorflow:From /home/tensorflow/.local/lib/python3.8/site-packages/object_detection/model_lib_v2.py:563: StrategyBase.experimental_distribute_datasets_from_function (from tensorflow.python.distribute.distribute_lib) is deprecated and will be removed in a future version.
Instructions for updating:
rename to distribute_datasets_from_function
W0622 03:08:08.973673 140015840864064 deprecation.py:350] From /home/tensorflow/.local/lib/python3.8/site-packages/object_detection/model_lib_v2.py:563: StrategyBase.experimental_distribute_datasets_from_function (from tensorflow.python.distribute.distribute_lib) is deprecated and will be removed in a future version.
Instructions for updating:
rename to distribute_datasets_from_function
INFO:tensorflow:Reading unweighted datasets: ['object_detection/training/train.record']
I0622 03:08:08.980458 140015840864064 dataset_builder.py:162] Reading unweighted datasets: ['object_detection/training/train.record']
INFO:tensorflow:Reading record datasets for input file: ['object_detection/training/train.record']
I0622 03:08:08.980628 140015840864064 dataset_builder.py:79] Reading record datasets for input file: ['object_detection/training/train.record']
INFO:tensorflow:Number of filenames to read: 0
I0622 03:08:08.980718 140015840864064 dataset_builder.py:80] Number of filenames to read: 0
Traceback (most recent call last):
File "object_detection/model_main_tf2.py", line 120, in
tf.compat.v1.app.run()
File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/platform/app.py", line 36, in run
_run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
File "/usr/local/lib/python3.8/dist-packages/absl/app.py", line 312, in run
_run_main(main, args)
File "/usr/local/lib/python3.8/dist-packages/absl/app.py", line 258, in _run_main
sys.exit(main(argv))
File "object_detection/model_main_tf2.py", line 111, in main
model_lib_v2.train_loop(
File "/home/tensorflow/.local/lib/python3.8/site-packages/object_detection/model_lib_v2.py", line 563, in train_loop
train_input = strategy.experimental_distribute_datasets_from_function(
File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/util/deprecation.py", line 357, in new_func
return func(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/distribute/distribute_lib.py", line 1195, in experimental_distribute_datasets_from_function
return self.distribute_datasets_from_function(dataset_fn, options)
File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/distribute/distribute_lib.py", line 1186, in distribute_datasets_from_function
return self._extended._distribute_datasets_from_function( # pylint: disable=protected-access
File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/distribute/mirrored_strategy.py", line 593, in _distribute_datasets_from_function
return input_util.get_distributed_datasets_from_function(
File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/distribute/input_util.py", line 132, in get_distributed_datasets_from_function
return input_lib.DistributedDatasetsFromFunction(
File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/distribute/input_lib.py", line 1372, in init
self.build()
File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/distribute/input_lib.py", line 1393, in build
_create_datasets_from_function_with_input_context(
File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/distribute/input_lib.py", line 1875, in _create_datasets_from_function_with_input_context
dataset = dataset_fn(ctx)
File "/home/tensorflow/.local/lib/python3.8/site-packages/object_detection/model_lib_v2.py", line 554, in train_dataset_fn
train_input = inputs.train_input(
File "/home/tensorflow/.local/lib/python3.8/site-packages/object_detection/inputs.py", line 908, in train_input
dataset = INPUT_BUILDER_UTIL_MAP['dataset_build'](
File "/home/tensorflow/.local/lib/python3.8/site-packages/object_detection/builders/dataset_builder.py", line 243, in build
dataset = read_dataset(
File "/home/tensorflow/.local/lib/python3.8/site-packages/object_detection/builders/dataset_builder.py", line 163, in read_dataset
return _read_dataset_internal(file_read_func, input_files,
File "/home/tensorflow/.local/lib/python3.8/site-packages/object_detection/builders/dataset_builder.py", line 82, in _read_dataset_internal
raise RuntimeError('Did not find any input files matching the glob pattern '
RuntimeError: Did not find any input files matching the glob pattern ['object_detection/training/train.record']

Expected behavior
Based on the instructions I should see some form of image learning begin to occur but instead I receive a series of messages and errors suggesting the process has halted or failed.

**Desktop **

  • Windows 11 Pro
  • Chrome
  • 102.0.5005.115

Additional context
At one point during an attempt I received a slightly different message, however after trying some work arounds to build the Nvidia Docker these messages have not re-appeared in following attempts...

tensorflow@943f2e0f8488:~/models/research$ python object_detection/model_main_tf2.py --pipeline_config_path=object_detection/training/ssd_efficientdet_d0_512x512_coco17_tpu-8.config --model_dir=object_detection/training/ --alsologtostderr
WARNING:tensorflow:There are non-GPU devices in tf.distribute.Strategy, not using nccl allreduce.
W0622 00:20:06.607538 140269787281216 cross_device_ops.py:1386] There are non-GPU devices in tf.distribute.Strategy, not using nccl allreduce.
INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:CPU:0',)
I0622 00:20:06.611269 140269787281216 mirrored_strategy.py:374] Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:CPU:0',)
INFO:tensorflow:Maybe overwriting train_steps: None
I0622 00:20:06.613795 140269787281216 config_util.py:552] Maybe overwriting train_steps: None
INFO:tensorflow:Maybe overwriting use_bfloat16: False
I0622 00:20:06.613889 140269787281216 config_util.py:552] Maybe overwriting use_bfloat16: False
I0622 00:20:06.618684 140269787281216 ssd_efficientnet_bifpn_feature_extractor.py:145] EfficientDet EfficientNet backbone version: efficientnet-b0
I0622 00:20:06.618793 140269787281216 ssd_efficientnet_bifpn_feature_extractor.py:147] EfficientDet BiFPN num filters: 64
I0622 00:20:06.618869 140269787281216 ssd_efficientnet_bifpn_feature_extractor.py:148] EfficientDet BiFPN num iterations: 3
I0622 00:20:06.621786 140269787281216 efficientnet_model.py:143] round_filter input=32 output=32
I0622 00:20:06.710441 140269787281216 efficientnet_model.py:143] round_filter input=32 output=32
I0622 00:20:06.710567 140269787281216 efficientnet_model.py:143] round_filter input=16 output=16
I0622 00:20:06.767254 140269787281216 efficientnet_model.py:143] round_filter input=16 output=16
I0622 00:20:06.767369 140269787281216 efficientnet_model.py:143] round_filter input=24 output=24
I0622 00:20:06.913850 140269787281216 efficientnet_model.py:143] round_filter input=24 output=24
I0622 00:20:06.913977 140269787281216 efficientnet_model.py:143] round_filter input=40 output=40
I0622 00:20:07.055300 140269787281216 efficientnet_model.py:143] round_filter input=40 output=40
I0622 00:20:07.055412 140269787281216 efficientnet_model.py:143] round_filter input=80 output=80
I0622 00:20:07.269554 140269787281216 efficientnet_model.py:143] round_filter input=80 output=80
I0622 00:20:07.269668 140269787281216 efficientnet_model.py:143] round_filter input=112 output=112
I0622 00:20:07.485285 140269787281216 efficientnet_model.py:143] round_filter input=112 output=112
I0622 00:20:07.485399 140269787281216 efficientnet_model.py:143] round_filter input=192 output=192
I0622 00:20:07.789512 140269787281216 efficientnet_model.py:143] round_filter input=192 output=192
I0622 00:20:07.789628 140269787281216 efficientnet_model.py:143] round_filter input=320 output=320
I0622 00:20:07.861017 140269787281216 efficientnet_model.py:143] round_filter input=1280 output=1280
I0622 00:20:07.895739 140269787281216 efficientnet_model.py:453] Building model efficientnet with params ModelConfig(width_coefficient=1.0, depth_coefficient=1.0, resolution=224, dropout_rate=0.2, blocks=(BlockConfig(input_filters=32, output_filters=16, kernel_size=3, num_repeat=1, expand_ratio=1, strides=(1, 1), se_ratio=0.25, id_skip=True, fused_conv=False, conv_type='depthwise'), BlockConfig(input_filters=16, output_filters=24, kernel_size=3, num_repeat=2, expand_ratio=6, strides=(2, 2), se_ratio=0.25, id_skip=True, fused_conv=False, conv_type='depthwise'), BlockConfig(input_filters=24, output_filters=40, kernel_size=5, num_repeat=2, expand_ratio=6, strides=(2, 2), se_ratio=0.25, id_skip=True, fused_conv=False, conv_type='depthwise'), BlockConfig(input_filters=40, output_filters=80, kernel_size=3, num_repeat=3, expand_ratio=6, strides=(2, 2), se_ratio=0.25, id_skip=True, fused_conv=False, conv_type='depthwise'), BlockConfig(input_filters=80, output_filters=112, kernel_size=5, num_repeat=3, expand_ratio=6, strides=(1, 1), se_ratio=0.25, id_skip=True, fused_conv=False, conv_type='depthwise'), BlockConfig(input_filters=112, output_filters=192, kernel_size=5, num_repeat=4, expand_ratio=6, strides=(2, 2), se_ratio=0.25, id_skip=True, fused_conv=False, conv_type='depthwise'), BlockConfig(input_filters=192, output_filters=320, kernel_size=3, num_repeat=1, expand_ratio=6, strides=(1, 1), se_ratio=0.25, id_skip=True, fused_conv=False, conv_type='depthwise')), stem_base_filters=32, top_base_filters=1280, activation='simple_swish', batch_norm='default', bn_momentum=0.99, bn_epsilon=0.001, weight_decay=5e-06, drop_connect_rate=0.2, depth_divisor=8, min_depth=None, use_se=True, input_channels=3, num_classes=1000, model_name='efficientnet', rescale_input=False, data_format='channels_last', dtype='float32')
WARNING:tensorflow:From /home/tensorflow/.local/lib/python3.8/site-packages/object_detection/model_lib_v2.py:563: StrategyBase.experimental_distribute_datasets_from_function (from tensorflow.python.distribute.distribute_lib) is deprecated and will be removed in a future version.
Instructions for updating:
rename to distribute_datasets_from_function
W0622 00:20:07.921413 140269787281216 deprecation.py:350] From /home/tensorflow/.local/lib/python3.8/site-packages/object_detection/model_lib_v2.py:563: StrategyBase.experimental_distribute_datasets_from_function (from tensorflow.python.distribute.distribute_lib) is deprecated and will be removed in a future version.
Instructions for updating:
rename to distribute_datasets_from_function
INFO:tensorflow:Reading unweighted datasets: ['object_detection/training/train.record']
I0622 00:20:07.925122 140269787281216 dataset_builder.py:162] Reading unweighted datasets: ['object_detection/training/train.record']
INFO:tensorflow:Reading record datasets for input file: ['object_detection/training/train.record']
I0622 00:20:07.925260 140269787281216 dataset_builder.py:79] Reading record datasets for input file: ['object_detection/training/train.record']
INFO:tensorflow:Number of filenames to read: 1
I0622 00:20:07.925341 140269787281216 dataset_builder.py:80] Number of filenames to read: 1
WARNING:tensorflow:num_readers has been reduced to 1 to match input file shards.
W0622 00:20:07.925419 140269787281216 dataset_builder.py:86] num_readers has been reduced to 1 to match input file shards.
WARNING:tensorflow:From /home/tensorflow/.local/lib/python3.8/site-packages/object_detection/builders/dataset_builder.py:100: parallel_interleave (from tensorflow.python.data.experimental.ops.interleave_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.data.Dataset.interleave(map_func, cycle_length, block_length, num_parallel_calls=tf.data.AUTOTUNE) instead. If sloppy execution is desired, use tf.data.Options.deterministic.
W0622 00:20:07.926657 140269787281216 deprecation.py:350] From /home/tensorflow/.local/lib/python3.8/site-packages/object_detection/builders/dataset_builder.py:100: parallel_interleave (from tensorflow.python.data.experimental.ops.interleave_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.data.Dataset.interleave(map_func, cycle_length, block_length, num_parallel_calls=tf.data.AUTOTUNE) instead. If sloppy execution is desired, use tf.data.Options.deterministic.
WARNING:tensorflow:From /home/tensorflow/.local/lib/python3.8/site-packages/object_detection/builders/dataset_builder.py:235: DatasetV1.map_with_legacy_function (from tensorflow.python.data.ops.dataset_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.data.Dataset.map() W0622 00:20:07.940346 140269787281216 deprecation.py:350] From /home/tensorflow/.local/lib/python3.8/site-packages/object_detection/builders/dataset_builder.py:235: DatasetV1.map_with_legacy_function (from tensorflow.python.data.ops.dataset_ops) is deprecated and will be removed in a future version. Instructions for updating: Use tf.data.Dataset.map()
WARNING:tensorflow:From /home/tensorflow/.local/lib/python3.8/site-packages/tensorflow/python/util/dispatch.py:1082: sparse_to_dense (from tensorflow.python.ops.sparse_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Create a tf.sparse.SparseTensor and use tf.sparse.to_dense instead.
W0622 00:20:12.002727 140269787281216 deprecation.py:350] From /home/tensorflow/.local/lib/python3.8/site-packages/tensorflow/python/util/dispatch.py:1082: sparse_to_dense (from tensorflow.python.ops.sparse_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Create a tf.sparse.SparseTensor and use tf.sparse.to_dense instead.
WARNING:tensorflow:From /home/tensorflow/.local/lib/python3.8/site-packages/tensorflow/python/util/dispatch.py:1082: to_float (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.cast instead.
W0622 00:20:14.406469 140269787281216 deprecation.py:350] From /home/tensorflow/.local/lib/python3.8/site-packages/tensorflow/python/util/dispatch.py:1082: to_float (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.cast instead.
/home/tensorflow/.local/lib/python3.8/site-packages/keras/backend.py:450: UserWarning: tf.keras.backend.set_learning_phase is deprecated and will be removed after 2020-10-11. To update it, simply pass a True/False value to the training argument of the __call__ method of your layer or model.
warnings.warn('tf.keras.backend.set_learning_phase is deprecated and '
WARNING:tensorflow:From /home/tensorflow/.local/lib/python3.8/site-packages/tensorflow/python/util/deprecation.py:629: calling map_fn_v2 (from tensorflow.python.ops.map_fn) with dtype is deprecated and will be removed in a future version.
Instructions for updating:
Use fn_output_signature instead
W0622 00:20:38.413714 140262147876608 deprecation.py:554] From /home/tensorflow/.local/lib/python3.8/site-packages/tensorflow/python/util/deprecation.py:629: calling map_fn_v2 (from tensorflow.python.ops.map_fn) with dtype is deprecated and will be removed in a future version.
Instructions for updating:
Use fn_output_signature instead
WARNING:tensorflow:Gradients do not exist for variables ['top_bn/gamma:0', 'top_bn/beta:0'] when minimizing the loss. If you're using model.compile(), did you forget to provide a lossargument?
W0622 00:20:45.288741 140262147876608 utils.py:76] Gradients do not exist for variables ['top_bn/gamma:0', 'top_bn/beta:0'] when minimizing the loss. If you're using model.compile(), did you forget to provide a lossargument?
WARNING:tensorflow:Gradients do not exist for variables ['top_bn/gamma:0', 'top_bn/beta:0'] when minimizing the loss. If you're using model.compile(), did you forget to provide a lossargument?
W0622 00:20:53.888063 140262147876608 utils.py:76] Gradients do not exist for variables ['top_bn/gamma:0', 'top_bn/beta:0'] when minimizing the loss. If you're using model.compile(), did you forget to provide a lossargument?
WARNING:tensorflow:Gradients do not exist for variables ['top_bn/gamma:0', 'top_bn/beta:0'] when minimizing the loss. If you're using model.compile(), did you forget to provide a lossargument?
W0622 00:21:02.133720 140262147876608 utils.py:76] Gradients do not exist for variables ['top_bn/gamma:0', 'top_bn/beta:0'] when minimizing the loss. If you're using model.compile(), did you forget to provide a lossargument?
WARNING:tensorflow:Gradients do not exist for variables ['top_bn/gamma:0', 'top_bn/beta:0'] when minimizing the loss. If you're using model.compile(), did you forget to provide a lossargument?
W0622 00:21:12.010699 140262147876608 utils.py:76] Gradients do not exist for variables ['top_bn/gamma:0', 'top_bn/beta:0'] when minimizing the loss. If you're using model.compile(), did you forget to provide a lossargument?
Killed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant