TensorBoard very slow on GCS #158

teamdandelion · 2017-06-26T19:24:13Z

Loading (O(200MB)) of logs via TensorBoard on GCS takes more than ten minutes, but if they copy the event files locally, it loads in seconds.

I am guessing we are running into an issue where on GCS we are loading the events without readahead chunks, so we are doing round trip requests incredibly frequently, maybe as often as for every individual event. This makes for horrible performance.

The parameter "reload_frequency" has changed to reload_interval some time between 0.8 and 1.2.1 This change makes it a consistent now - there has been and inconsistency - in configuration files and run_tensorboard.sh it was still RELOAD_FREQUENCY, where in the example-app there was already (unused) RELOAD_INTERVAL enviroment variable. This commit fixes it and makes it RELOAD_INTERVAL everywhere, together with fixing the version of tensorboard in the Dockerfile (using latest in such dockerfile is a bad practice - in case of such incompatible changes in parameter values, it might simply silently stop working properly as it did this time). Also this commit changes the default value of the RELOAD_INTERVAL parameter. Due to the issue: tensorflow/tensorboard#158 it seems that accessing GCS directly causes a lot of costs connected with high GCP API count usage, therefore if you have thousands of log files (which is not a lot) it is very easy to overcharge your GCP account with millions of requests every day just having tensorboard idling and checking for new data. In our case we got about 4 USD/day for around 3000 files which is quite incredible.

potiuk · 2017-08-07T04:24:10Z

Related to that (and more important) - we have also found that using tensorboard to read directly from GCS using default reload intervals incurs incredible costs.

There is a cost related wth count of API calls made to access GCS and it seems that every reload directly from GCP causes every single file to be accessed separately via the GCS api . when you have bigger (but not outrageous) number of log files (few thousands is enough) It can very quickly get out of control - and with the default reload frequency of tensorboard it might get your GCS costs unreasonably high.

In our case it was about 4 USD/day while having just around 3000 files in the logs !!!). We got about 4 millions of API requests generated in 4 days (!) just by running one tensorboard instance. We are changing strategy now to sync the GCS data locally (using gsutil rsync) and read it from there because of that.

The parameter "reload_frequency" has changed to reload_interval some time between 0.8 and 1.2.1 This change makes it a consistent now - there has been and inconsistency - in configuration files and run_tensorboard.sh it was still RELOAD_FREQUENCY, where in the example-app there was already (unused) RELOAD_INTERVAL enviroment variable. This commit fixes it and makes it RELOAD_INTERVAL everywhere, together with fixing the version of tensorboard in the Dockerfile (using latest in such dockerfile is a bad practice - in case of such incompatible changes in parameter values, it might simply silently stop working properly as it did this time). Also this commit changes the default value of the RELOAD_INTERVAL parameter. Due to the issue: tensorflow/tensorboard#158 it seems that accessing GCS directly causes a lot of costs connected with high GCP API count usage, therefore if you have thousands of log files (which is not a lot) it is very easy to overcharge your GCP account with millions of requests every day just having tensorboard idling and checking for new data. In our case we got about 4 USD/day for around 3000 files which is quite incredible.

potiuk · 2017-08-12T10:11:02Z

Just a comment - gsutil rsync has another problem (it deletes and recreates event files) - I opened a new issue for it: #349

MtDersvan · 2017-09-11T16:20:57Z

Are there any updates on this issue?
After switching to GCS for storing checkpoints and events, it is now nearly unusable to refresh logs.
When launching tensorboard, it will fetch logs from a selected directory, but it won't update it or fetch anything after. In our case even leaving tensorboard for hours won't refresh stats unless fully restarted.
Neither specifying GCS bucket directly:

tensorboard --logdir=${GCS_LOGS_BUCKET} --reload_interval=2

, or using gcsfuse to synchronize local directory with a bucket and then point tensorboard to local directory, works, and in both cases it is necessary to restart tensorboard to refresh logs.
Any fix or workaround would be much appreciated.

carlthome · 2018-03-12T15:07:35Z

Status?

This change fixes a Google Cloud Storage performance issue where TensorBoard would read event logs multiple times. This was caused by code checking if the Run directory was deleted, but GCS does not have real directories. Progress on tensorflow#158

nfelt · 2018-06-05T20:42:02Z

We discovered an issue (#1225) involving bad interaction between TensorBoard and the underlying TensorFlow tf.gfile API when running against GCS logdirs, and this issue causes excessive network usage and API calls, as @potiuk describes above in #158 (comment). PR #1226 should address this in the upcoming 1.9 release of TensorBoard as long as you are also using TensorFlow 1.9+.

This issue should hopefully address some of the general slowness involved in GCS logdirs, but there may still be other sources of unoptimized slow performance.

The best way to ensure good performance and low network bandwidth when using a GCS logdir is to run TensorBoard within the same Google Cloud Platform location where GCS egress traffic is free of charge. For example, you can run TensorBoard on a GCE instance in the same region as your GCS bucket, and optionally port forward using SSH if you want to continue to access TensorBoard at localhost:6006.

jart · 2018-06-05T21:30:32Z

Performance aside, if you want to know whether or not GCS network egress is free, here's a helpful shell function:

is_gcs_free() {
  GCS_BUCKET=$1
  GCE_ZONE=$(curl -sfL metadata.google.internal/0.1/meta-data/zone) || { echo Not in GCE fleet >&2; return 1; }
  GCE_REGION=$(printf %s\\n "${GCE_ZONE}" | sed -n 's!.*/\([^-/]*-[^-/]*\).*!\1!p')
  GCP_TOKEN=$(curl -sfLH Metadata-Flavor:Google metadata.google.internal/computeMetadata/v1/instance/service-accounts/default/token | sed -n 's/.*"access_token": *"\([^"]*\)".*/\1/p')
  GCS_REGION=$(curl -sfLH "Authorization:Bearer ${GCP_TOKEN}" "https://www.googleapis.com/storage/v1/b/$1" | sed -n 's/.*"location": *"\([^"]*\)".*/\1/p' | tr A-Z a-z)
  [ $(expr "${GCE_REGION}" : "${GCS_REGION}") -gt 0 ]
}

I've been considering possible integrating that into TensorFlow, so we can show a warning, since many people will run TensorBoard on their local machines, rather than something like gcloud compute ssh instance -- -I 6006:localhost:6006 tensorboard --logdir=gs://foo/bar. Would something like this be useful?

nfelt · 2018-06-07T21:01:57Z

I think #1087 should also address a lot of the slowness involved in scanning the GCS logdirs for event files. That PR should also be included in TensorBoard 1.9.

amygdala · 2018-06-11T16:53:49Z

@jart : IMO integrating that GCS egress check into TB would be very helpful to our users.

nfelt · 2019-12-17T23:26:31Z

I'm hoping that the fixes from June 2018 were sufficient to address the problems, since I haven't seen any followup reports on this issue, so I'm going to close it out.

If you see any problems where using TensorBoard with GCS is drastically slower than using it against local files, please file a new issue and we'll take a look.

teamdandelion added core:backend type:bug labels Jun 26, 2017

teamdandelion mentioned this issue Jun 26, 2017

Refresh Backend when frontend refreshes #80

Open

jart mentioned this issue Jul 7, 2017

Introduce RecordReader and BufferedRecordReader #188

Merged

potiuk mentioned this issue Aug 7, 2017

Fixing the reload_frequency to reload_interval and changed default GoogleCloudPlatform/appengine-tensorboard#1

Merged

potiuk mentioned this issue Aug 12, 2017

Fails to show new data when event files are replaced (e.g. w/ rsync) #349

Open

geyang mentioned this issue Aug 22, 2017

graph zoom and resize resets after browser refresh #401

Closed

chihuahua mentioned this issue Feb 26, 2018

Entire log directory fetched from GCS during each periodic reload #1005

Closed

jart mentioned this issue May 14, 2018

Avoid superfluous loads on GCS #1182

Closed

nfelt mentioned this issue Jun 4, 2018

Fix bad interaction with GcsFileSystem caching #1225

Closed

nfelt closed this as completed Dec 17, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TensorBoard very slow on GCS #158

TensorBoard very slow on GCS #158

teamdandelion commented Jun 26, 2017

potiuk commented Aug 7, 2017 •

edited

Loading

potiuk commented Aug 12, 2017 •

edited

Loading

MtDersvan commented Sep 11, 2017

carlthome commented Mar 12, 2018

nfelt commented Jun 5, 2018

jart commented Jun 5, 2018

nfelt commented Jun 7, 2018

amygdala commented Jun 11, 2018

nfelt commented Dec 17, 2019

TensorBoard very slow on GCS #158

TensorBoard very slow on GCS #158

Comments

teamdandelion commented Jun 26, 2017

potiuk commented Aug 7, 2017 • edited Loading

potiuk commented Aug 12, 2017 • edited Loading

MtDersvan commented Sep 11, 2017

carlthome commented Mar 12, 2018

nfelt commented Jun 5, 2018

jart commented Jun 5, 2018

nfelt commented Jun 7, 2018

amygdala commented Jun 11, 2018

nfelt commented Dec 17, 2019

potiuk commented Aug 7, 2017 •

edited

Loading

potiuk commented Aug 12, 2017 •

edited

Loading