Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TensorBoard very slow on GCS #158

Closed
teamdandelion opened this issue Jun 26, 2017 · 9 comments
Closed

TensorBoard very slow on GCS #158

teamdandelion opened this issue Jun 26, 2017 · 9 comments

Comments

@teamdandelion
Copy link
Contributor

From @bw4sz on #80:

Loading (O(200MB)) of logs via TensorBoard on GCS takes more than ten minutes, but if they copy the event files locally, it loads in seconds.

I am guessing we are running into an issue where on GCS we are loading the events without readahead chunks, so we are doing round trip requests incredibly frequently, maybe as often as for every individual event. This makes for horrible performance.

potiuk added a commit to potiuk/appengine-tensorboard that referenced this issue Aug 7, 2017
The parameter "reload_frequency" has changed to reload_interval
some time between 0.8 and 1.2.1 This change makes it a consistent
now - there has been and inconsistency - in configuration files
and run_tensorboard.sh it was still RELOAD_FREQUENCY,
 where in the example-app there was
already (unused) RELOAD_INTERVAL enviroment variable.

This commit fixes it and makes it RELOAD_INTERVAL everywhere, together
with fixing the version of tensorboard in the Dockerfile (using latest
in such dockerfile is a bad practice - in case of such incompatible
changes in parameter values, it might simply silently stop working
properly as it did this time).

Also this commit changes the default value of the RELOAD_INTERVAL
parameter. Due to the issue:
tensorflow/tensorboard#158 it seems
that accessing GCS directly causes a lot of costs connected with
high GCP API count usage, therefore if you have thousands of log files
(which is not a lot) it is very easy to overcharge your GCP account with
millions of requests every day just having tensorboard idling and
checking for new data. In our case we got about 4 USD/day for around
3000 files which is quite incredible.
@potiuk
Copy link

potiuk commented Aug 7, 2017

Related to that (and more important) - we have also found that using tensorboard to read directly from GCS using default reload intervals incurs incredible costs.

There is a cost related wth count of API calls made to access GCS and it seems that every reload directly from GCP causes every single file to be accessed separately via the GCS api . when you have bigger (but not outrageous) number of log files (few thousands is enough) It can very quickly get out of control - and with the default reload frequency of tensorboard it might get your GCS costs unreasonably high.

In our case it was about 4 USD/day while having just around 3000 files in the logs !!!). We got about 4 millions of API requests generated in 4 days (!) just by running one tensorboard instance. We are changing strategy now to sync the GCS data locally (using gsutil rsync) and read it from there because of that.

elibixby pushed a commit to GoogleCloudPlatform/appengine-tensorboard that referenced this issue Aug 7, 2017
The parameter "reload_frequency" has changed to reload_interval
some time between 0.8 and 1.2.1 This change makes it a consistent
now - there has been and inconsistency - in configuration files
and run_tensorboard.sh it was still RELOAD_FREQUENCY,
 where in the example-app there was
already (unused) RELOAD_INTERVAL enviroment variable.

This commit fixes it and makes it RELOAD_INTERVAL everywhere, together
with fixing the version of tensorboard in the Dockerfile (using latest
in such dockerfile is a bad practice - in case of such incompatible
changes in parameter values, it might simply silently stop working
properly as it did this time).

Also this commit changes the default value of the RELOAD_INTERVAL
parameter. Due to the issue:
tensorflow/tensorboard#158 it seems
that accessing GCS directly causes a lot of costs connected with
high GCP API count usage, therefore if you have thousands of log files
(which is not a lot) it is very easy to overcharge your GCP account with
millions of requests every day just having tensorboard idling and
checking for new data. In our case we got about 4 USD/day for around
3000 files which is quite incredible.
@potiuk
Copy link

potiuk commented Aug 12, 2017

Just a comment - gsutil rsync has another problem (it deletes and recreates event files) - I opened a new issue for it: #349

@MtDersvan
Copy link

Are there any updates on this issue?
After switching to GCS for storing checkpoints and events, it is now nearly unusable to refresh logs.
When launching tensorboard, it will fetch logs from a selected directory, but it won't update it or fetch anything after. In our case even leaving tensorboard for hours won't refresh stats unless fully restarted.
Neither specifying GCS bucket directly:

tensorboard --logdir=${GCS_LOGS_BUCKET} --reload_interval=2

, or using gcsfuse to synchronize local directory with a bucket and then point tensorboard to local directory, works, and in both cases it is necessary to restart tensorboard to refresh logs.
Any fix or workaround would be much appreciated.

@carlthome
Copy link

Status?

jart added a commit to jart/tensorboard that referenced this issue May 14, 2018
This change fixes a Google Cloud Storage performance issue where TensorBoard
would read event logs multiple times. This was caused by code checking if the
Run directory was deleted, but GCS does not have real directories.

Progress on tensorflow#158
@nfelt
Copy link
Contributor

nfelt commented Jun 5, 2018

We discovered an issue (#1225) involving bad interaction between TensorBoard and the underlying TensorFlow tf.gfile API when running against GCS logdirs, and this issue causes excessive network usage and API calls, as @potiuk describes above in #158 (comment). PR #1226 should address this in the upcoming 1.9 release of TensorBoard as long as you are also using TensorFlow 1.9+.

This issue should hopefully address some of the general slowness involved in GCS logdirs, but there may still be other sources of unoptimized slow performance.

The best way to ensure good performance and low network bandwidth when using a GCS logdir is to run TensorBoard within the same Google Cloud Platform location where GCS egress traffic is free of charge. For example, you can run TensorBoard on a GCE instance in the same region as your GCS bucket, and optionally port forward using SSH if you want to continue to access TensorBoard at localhost:6006.

@jart
Copy link
Contributor

jart commented Jun 5, 2018

Performance aside, if you want to know whether or not GCS network egress is free, here's a helpful shell function:

is_gcs_free() {
  GCS_BUCKET=$1
  GCE_ZONE=$(curl -sfL metadata.google.internal/0.1/meta-data/zone) || { echo Not in GCE fleet >&2; return 1; }
  GCE_REGION=$(printf %s\\n "${GCE_ZONE}" | sed -n 's!.*/\([^-/]*-[^-/]*\).*!\1!p')
  GCP_TOKEN=$(curl -sfLH Metadata-Flavor:Google metadata.google.internal/computeMetadata/v1/instance/service-accounts/default/token | sed -n 's/.*"access_token": *"\([^"]*\)".*/\1/p')
  GCS_REGION=$(curl -sfLH "Authorization:Bearer ${GCP_TOKEN}" "https://www.googleapis.com/storage/v1/b/$1" | sed -n 's/.*"location": *"\([^"]*\)".*/\1/p' | tr A-Z a-z)
  [ $(expr "${GCE_REGION}" : "${GCS_REGION}") -gt 0 ]
}

I've been considering possible integrating that into TensorFlow, so we can show a warning, since many people will run TensorBoard on their local machines, rather than something like gcloud compute ssh instance -- -I 6006:localhost:6006 tensorboard --logdir=gs://foo/bar. Would something like this be useful?

@nfelt
Copy link
Contributor

nfelt commented Jun 7, 2018

I think #1087 should also address a lot of the slowness involved in scanning the GCS logdirs for event files. That PR should also be included in TensorBoard 1.9.

@amygdala
Copy link

@jart : IMO integrating that GCS egress check into TB would be very helpful to our users.

@nfelt
Copy link
Contributor

nfelt commented Dec 17, 2019

I'm hoping that the fixes from June 2018 were sufficient to address the problems, since I haven't seen any followup reports on this issue, so I'm going to close it out.

If you see any problems where using TensorBoard with GCS is drastically slower than using it against local files, please file a new issue and we'll take a look.

@nfelt nfelt closed this as completed Dec 17, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants