Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize the /data/plugins_listing response. #625

Closed
chihuahua opened this issue Oct 5, 2017 · 1 comment
Closed

Optimize the /data/plugins_listing response. #625

chihuahua opened this issue Oct 5, 2017 · 1 comment
Assignees

Comments

@chihuahua
Copy link
Member

chihuahua commented Oct 5, 2017

TensorBoard only renders dashboards after it determines which ones are active. To do the latter, TensorBoard issues a request to the /data/plugins_listing endpoint.

.then(updateActiveDashboards, onFailure);

Hence, the /data/plugins_listing endpoint must return quickly - any time we shave off of the response time maps directly to time reduced for loading TensorBoard.

The server logic for that endpoint serially calls the is_active method for all plugins (active or not):

{plugin.plugin_name: plugin.is_active() for plugin in self._plugins},

In the past, optimizing the is_active method for plugins has made the difference between say a 1s load vs a 30s load ... vs a frontend that fails to load due to outlandish response times.

Some Action Items

  1. is_active methods that check for presence of data should short-circuit when any data is found. See Make scalars is_active short circuit if apt #621 for example.
  2. is_active methods that read from disk should consider not only short-circuiting, but also (1) computing on a separate thread and (2) caching after the plugin is found to be active. See Make is_active for projector return quickly #326 for example. Peruse the profile and text dashboards for potential optimizations.
@nfelt nfelt self-assigned this Oct 19, 2017
nfelt added a commit that referenced this issue Oct 19, 2017
This is a preliminary step in addressing #625 that just adds logging (at INFO) for the amount of time taken by the is_active() call for each plugin loaded by TensorBoard when the plugins listing endpoint is invoked. This way as we make fixes we can monitor for regressions, and it also can help isolate the problematic plugin if we get reports from users that TensorBoard is taking a long time to respond with the list of active plugins (which is a prerequisite for basically any other usage of TensorBoard).

I also bundled in a couple small tweaks to other logging lines that can help diagnose slow plugin behavior.
nfelt added a commit that referenced this issue Oct 26, 2017
Addresses #625 by changing TextPlugin to avoid blocking its `is_active()` method on reading from the filesystem.  Note that "reading from the filesystem" is not only reading once, but reading in all plugin assets from all runs, which could be dozens of reads per run, potentially over the network if using a remote filesystem.  In my testing this could easily cause `is_active()` to take 30+ seconds that kind of scenario, which in turn blocks the `/plugins_listing` response for the whole of TensorBoard.

Since the TextPlugin's `is_active()` is based on its `index_impl()` (which backs its `/tags` route) and the expensive filesystem-checking logic is shared, this change puts the entirety of `index_impl()` on a background thread.  This way the `/tags` route can also benefit from not blocking.

The semantics are now that `is_active()` will return False and `index_impl()` will return a placeholder empty response when first invoked, but they'll kick off at most one background thread at a time to do the computation and store the resulting index response.  Once the plugin is detected to be active, `is_active()` will no longer kick off new background threads when invoked, but `index_impl()` will still kick off a new background thread when called (if one is not running already) so that refreshes will pick up new data as it gets written to the logdir.
jart pushed a commit to jart/tensorboard that referenced this issue Oct 31, 2017
)

Addresses tensorflow#625 by changing TextPlugin to avoid blocking its `is_active()` method on reading from the filesystem.  Note that "reading from the filesystem" is not only reading once, but reading in all plugin assets from all runs, which could be dozens of reads per run, potentially over the network if using a remote filesystem.  In my testing this could easily cause `is_active()` to take 30+ seconds that kind of scenario, which in turn blocks the `/plugins_listing` response for the whole of TensorBoard.

Since the TextPlugin's `is_active()` is based on its `index_impl()` (which backs its `/tags` route) and the expensive filesystem-checking logic is shared, this change puts the entirety of `index_impl()` on a background thread.  This way the `/tags` route can also benefit from not blocking.

The semantics are now that `is_active()` will return False and `index_impl()` will return a placeholder empty response when first invoked, but they'll kick off at most one background thread at a time to do the computation and store the resulting index response.  Once the plugin is detected to be active, `is_active()` will no longer kick off new background threads when invoked, but `index_impl()` will still kick off a new background thread when called (if one is not running already) so that refreshes will pick up new data as it gets written to the logdir.
jart pushed a commit to jart/tensorboard that referenced this issue Oct 31, 2017
This is a preliminary step in addressing tensorflow#625 that just adds logging (at INFO) for the amount of time taken by the is_active() call for each plugin loaded by TensorBoard when the plugins listing endpoint is invoked. This way as we make fixes we can monitor for regressions, and it also can help isolate the problematic plugin if we get reports from users that TensorBoard is taking a long time to respond with the list of active plugins (which is a prerequisite for basically any other usage of TensorBoard).

I also bundled in a couple small tweaks to other logging lines that can help diagnose slow plugin behavior.
jart pushed a commit that referenced this issue Nov 1, 2017
Addresses #625 by changing TextPlugin to avoid blocking its `is_active()` method on reading from the filesystem.  Note that "reading from the filesystem" is not only reading once, but reading in all plugin assets from all runs, which could be dozens of reads per run, potentially over the network if using a remote filesystem.  In my testing this could easily cause `is_active()` to take 30+ seconds that kind of scenario, which in turn blocks the `/plugins_listing` response for the whole of TensorBoard.

Since the TextPlugin's `is_active()` is based on its `index_impl()` (which backs its `/tags` route) and the expensive filesystem-checking logic is shared, this change puts the entirety of `index_impl()` on a background thread.  This way the `/tags` route can also benefit from not blocking.

The semantics are now that `is_active()` will return False and `index_impl()` will return a placeholder empty response when first invoked, but they'll kick off at most one background thread at a time to do the computation and store the resulting index response.  Once the plugin is detected to be active, `is_active()` will no longer kick off new background threads when invoked, but `index_impl()` will still kick off a new background thread when called (if one is not running already) so that refreshes will pick up new data as it gets written to the logdir.
jart pushed a commit that referenced this issue Nov 1, 2017
This is a preliminary step in addressing #625 that just adds logging (at INFO) for the amount of time taken by the is_active() call for each plugin loaded by TensorBoard when the plugins listing endpoint is invoked. This way as we make fixes we can monitor for regressions, and it also can help isolate the problematic plugin if we get reports from users that TensorBoard is taking a long time to respond with the list of active plugins (which is a prerequisite for basically any other usage of TensorBoard).

I also bundled in a couple small tweaks to other logging lines that can help diagnose slow plugin behavior.
chihuahua added a commit that referenced this issue Mar 15, 2018
…st before checking plugin assets. (#1021)

We make the text plugin's is_active method return true if it detects
relevant data present within the multiplexer. Subsequently, there would
be no more need to check for data stored within plugin assets.

This solves an issue: The text plugin used to be inactive when the user
initially loads into it because the thread for computing whether the plugin
is active would not have completed executing yet.

Part of #625.

Test plan:

Run the [text demo](https://github.com/tensorflow/tensorboard/blob/master/tensorboard/plugins/text/text_demo.py). Verify behavior. The text plugin should be active on
the first load.
@nfelt
Copy link
Contributor

nfelt commented Dec 17, 2019

I think we can mark this done; we moved the expensive is_active() calls into background threads. This introduces new problems as described in #2102 (plugins don't load initially and seem like they're just absent) but we can track that there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants