Remove expensive plugin initialization code #1114

jart · 2018-04-09T20:22:02Z

This change does some code cleanup on Beholder to fix #1107, where invoking
TensorFlow routines caused a nontrivial amount of GPU to be reserved.

nfelt

So this change looks like it's mostly refactoring to keep this code from running at construction time. Does that fix #1107? If it does, does this mean TensorBoard will start using GPU again if somebody activates Beholder (or anything else using encode_png() for that matter) and the PersistentOpEvaluator config doesn't actually work?

nfelt · 2018-04-09T23:30:51Z

tensorboard/plugins/beholder/beholder_plugin.py

@@ -137,8 +135,7 @@ def _serve_change_config(self, request):
  def _serve_section_info(self, request):
    path = '{}/{}'.format(
        self.PLUGIN_LOGDIR, shared_config.SECTION_INFO_FILENAME)
-    info = file_system_tools.read_pickle(path, default=self.most_recent_info)
-    self.most_recent_info = info
+    info = file_system_tools.read_pickle(path, default=DEFAULT_INFO)


Again I'm not sure why we're removing the "cache last data shown" behavior. It's not really necessary to fix this bug, since we could still keep most_recent_info and then just default to None here, and then

if info is None: info = self.most_recent_info if self.most_recent_info is not None else DEFAULT_INFO

Could we preserve the caching behavior here too if we're doing it for the frame? Right now init is storing most_recent_info but it's never being used. I think just reverting the delta at these lines would suffice.

nfelt · 2018-04-09T23:37:17Z

tensorboard/plugins/beholder/beholder_plugin.py

-    except (message.DecodeError, IOError, tf.errors.NotFoundError):
-      return self.most_recent_frame
-
+    with self._lock:  # only to put boundaries on workload for now


I don't understand this comment - what does the lock accomplish? This code should be called synchronously from the _frame_generator() helper so the locking only would come into play if we were somehow streaming via two requests at once, which shouldn't normally be possible in the front end logic. Also, if we did want to bound workload during multiple streams, it would make more sense to have the current frame fetching logic occurring in only a single thread. This change would instead result in two request threads competing to fetch frames, so you might end up sort of round-robining frames to the two streams which seems very weird.

_frame_generator() is an endpoint for a multi-threaded HTTP server. I agree that a single thread should be what's computing this for all viewers. Might not be worth the effort quite yet, since it's likely there'll only be a single viewer. We don't have the resources quite yet to make Beholder scale. But we will in the future.

nfelt · 2018-04-09T23:39:20Z

tensorboard/plugins/beholder/beholder_plugin.py

-      return frame
-
-    except (message.DecodeError, IOError, tf.errors.NotFoundError):
-      return self.most_recent_frame


Why are we removing most_recent_frame? That's a behavior change (missing data now immediately goes blank instead of preserving the last frame and section info on screen) and it's not clear to me that this is better for users? It doesn't seem necessary to fix the initialization logic; you can still keep most_recent_frame and if it's unset, then default to the no data PNG.

jart

PTAL @nfelt

jart · 2018-04-16T23:16:39Z

tensorboard/plugins/beholder/beholder_plugin.py

-    except (message.DecodeError, IOError, tf.errors.NotFoundError):
-      return self.most_recent_frame
-
+    with self._lock:  # only to put boundaries on workload for now


_frame_generator() is an endpoint for a multi-threaded HTTP server. I agree that a single thread should be what's computing this for all viewers. Might not be worth the effort quite yet, since it's likely there'll only be a single viewer. We don't have the resources quite yet to make Beholder scale. But we will in the future.

jart · 2018-04-16T23:23:48Z

tensorboard/plugins/beholder/beholder_plugin.py

-      return frame
-
-    except (message.DecodeError, IOError, tf.errors.NotFoundError):
-      return self.most_recent_frame


jart · 2018-04-26T18:10:54Z

PTAL @nfelt this is blocking #1155.

nfelt · 2018-04-26T19:30:36Z

tensorboard/util.py

@@ -419,7 +419,8 @@ def _lazily_initialize(self):
      graph = tf.Graph()
      with graph.as_default():
        self.initialize_graph()
-      self._session = tf.Session(graph=graph)
+      config = tf.ConfigProto(device_count={'GPU': 0})


Maybe add a comment here noting that we're deliberately keeping this off the GPU?

nfelt · 2018-04-26T19:32:01Z

tensorboard/plugins/beholder/beholder_plugin.py

@@ -137,8 +135,7 @@ def _serve_change_config(self, request):
  def _serve_section_info(self, request):
    path = '{}/{}'.format(
        self.PLUGIN_LOGDIR, shared_config.SECTION_INFO_FILENAME)
-    info = file_system_tools.read_pickle(path, default=self.most_recent_info)
-    self.most_recent_info = info
+    info = file_system_tools.read_pickle(path, default=DEFAULT_INFO)


Could we preserve the caching behavior here too if we're doing it for the frame? Right now init is storing most_recent_info but it's never being used. I think just reverting the delta at these lines would suffice.

This change does some code cleanup on Beholder to fix tensorflow#1107, where invoking TensorFlow routines caused a nontrivial amount of GPU to be reserved.

jart requested a review from nfelt April 9, 2018 20:22

nfelt suggested changes Apr 9, 2018

View reviewed changes

nfelt added the plugin:beholder label Apr 9, 2018

jart force-pushed the disable-gpu branch 2 times, most recently from 8a5d91c to 769dd26 Compare April 16, 2018 23:26

jart commented Apr 16, 2018

View reviewed changes

jart mentioned this pull request Apr 26, 2018

TensorBoard requires GPU memory to run #1155

Closed

nfelt approved these changes Apr 26, 2018

View reviewed changes

jart force-pushed the disable-gpu branch from 769dd26 to 075e163 Compare May 17, 2018 23:40

Remove expensive plugin initialization code

1e49774

This change does some code cleanup on Beholder to fix tensorflow#1107, where invoking TensorFlow routines caused a nontrivial amount of GPU to be reserved.

jart force-pushed the disable-gpu branch from 075e163 to 1e49774 Compare May 17, 2018 23:41

jart merged commit 2b7bb8b into tensorflow:master May 17, 2018

jart deleted the disable-gpu branch May 17, 2018 23:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove expensive plugin initialization code #1114

Remove expensive plugin initialization code #1114

jart commented Apr 9, 2018

nfelt left a comment

nfelt Apr 9, 2018

nfelt Apr 26, 2018

jart May 17, 2018

nfelt Apr 9, 2018

jart Apr 16, 2018

nfelt Apr 9, 2018

jart Apr 16, 2018

jart left a comment

jart Apr 16, 2018

jart Apr 16, 2018

jart commented Apr 26, 2018

nfelt Apr 26, 2018

jart May 17, 2018

nfelt Apr 26, 2018

Remove expensive plugin initialization code #1114

Remove expensive plugin initialization code #1114

Conversation

jart commented Apr 9, 2018

nfelt left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jart left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jart commented Apr 26, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment