-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TensorBoard very slow on GCS #158
Comments
The parameter "reload_frequency" has changed to reload_interval some time between 0.8 and 1.2.1 This change makes it a consistent now - there has been and inconsistency - in configuration files and run_tensorboard.sh it was still RELOAD_FREQUENCY, where in the example-app there was already (unused) RELOAD_INTERVAL enviroment variable. This commit fixes it and makes it RELOAD_INTERVAL everywhere, together with fixing the version of tensorboard in the Dockerfile (using latest in such dockerfile is a bad practice - in case of such incompatible changes in parameter values, it might simply silently stop working properly as it did this time). Also this commit changes the default value of the RELOAD_INTERVAL parameter. Due to the issue: tensorflow/tensorboard#158 it seems that accessing GCS directly causes a lot of costs connected with high GCP API count usage, therefore if you have thousands of log files (which is not a lot) it is very easy to overcharge your GCP account with millions of requests every day just having tensorboard idling and checking for new data. In our case we got about 4 USD/day for around 3000 files which is quite incredible.
Related to that (and more important) - we have also found that using tensorboard to read directly from GCS using default reload intervals incurs incredible costs. There is a cost related wth count of API calls made to access GCS and it seems that every reload directly from GCP causes every single file to be accessed separately via the GCS api . when you have bigger (but not outrageous) number of log files (few thousands is enough) It can very quickly get out of control - and with the default reload frequency of tensorboard it might get your GCS costs unreasonably high. In our case it was about 4 USD/day while having just around 3000 files in the logs !!!). We got about 4 millions of API requests generated in 4 days (!) just by running one tensorboard instance. We are changing strategy now to sync the GCS data locally (using gsutil rsync) and read it from there because of that. |
The parameter "reload_frequency" has changed to reload_interval some time between 0.8 and 1.2.1 This change makes it a consistent now - there has been and inconsistency - in configuration files and run_tensorboard.sh it was still RELOAD_FREQUENCY, where in the example-app there was already (unused) RELOAD_INTERVAL enviroment variable. This commit fixes it and makes it RELOAD_INTERVAL everywhere, together with fixing the version of tensorboard in the Dockerfile (using latest in such dockerfile is a bad practice - in case of such incompatible changes in parameter values, it might simply silently stop working properly as it did this time). Also this commit changes the default value of the RELOAD_INTERVAL parameter. Due to the issue: tensorflow/tensorboard#158 it seems that accessing GCS directly causes a lot of costs connected with high GCP API count usage, therefore if you have thousands of log files (which is not a lot) it is very easy to overcharge your GCP account with millions of requests every day just having tensorboard idling and checking for new data. In our case we got about 4 USD/day for around 3000 files which is quite incredible.
Just a comment - gsutil rsync has another problem (it deletes and recreates event files) - I opened a new issue for it: #349 |
Are there any updates on this issue?
, or using |
Status? |
This change fixes a Google Cloud Storage performance issue where TensorBoard would read event logs multiple times. This was caused by code checking if the Run directory was deleted, but GCS does not have real directories. Progress on tensorflow#158
We discovered an issue (#1225) involving bad interaction between TensorBoard and the underlying TensorFlow This issue should hopefully address some of the general slowness involved in GCS logdirs, but there may still be other sources of unoptimized slow performance. The best way to ensure good performance and low network bandwidth when using a GCS logdir is to run TensorBoard within the same Google Cloud Platform location where GCS egress traffic is free of charge. For example, you can run TensorBoard on a GCE instance in the same region as your GCS bucket, and optionally port forward using SSH if you want to continue to access TensorBoard at |
Performance aside, if you want to know whether or not GCS network egress is free, here's a helpful shell function: is_gcs_free() {
GCS_BUCKET=$1
GCE_ZONE=$(curl -sfL metadata.google.internal/0.1/meta-data/zone) || { echo Not in GCE fleet >&2; return 1; }
GCE_REGION=$(printf %s\\n "${GCE_ZONE}" | sed -n 's!.*/\([^-/]*-[^-/]*\).*!\1!p')
GCP_TOKEN=$(curl -sfLH Metadata-Flavor:Google metadata.google.internal/computeMetadata/v1/instance/service-accounts/default/token | sed -n 's/.*"access_token": *"\([^"]*\)".*/\1/p')
GCS_REGION=$(curl -sfLH "Authorization:Bearer ${GCP_TOKEN}" "https://www.googleapis.com/storage/v1/b/$1" | sed -n 's/.*"location": *"\([^"]*\)".*/\1/p' | tr A-Z a-z)
[ $(expr "${GCE_REGION}" : "${GCS_REGION}") -gt 0 ]
} I've been considering possible integrating that into TensorFlow, so we can show a warning, since many people will run TensorBoard on their local machines, rather than something like |
I think #1087 should also address a lot of the slowness involved in scanning the GCS logdirs for event files. That PR should also be included in TensorBoard 1.9. |
@jart : IMO integrating that GCS egress check into TB would be very helpful to our users. |
I'm hoping that the fixes from June 2018 were sufficient to address the problems, since I haven't seen any followup reports on this issue, so I'm going to close it out. If you see any problems where using TensorBoard with GCS is drastically slower than using it against local files, please file a new issue and we'll take a look. |
From @bw4sz on #80:
Loading (O(200MB)) of logs via TensorBoard on GCS takes more than ten minutes, but if they copy the event files locally, it loads in seconds.
I am guessing we are running into an issue where on GCS we are loading the events without readahead chunks, so we are doing round trip requests incredibly frequently, maybe as often as for every individual event. This makes for horrible performance.
The text was updated successfully, but these errors were encountered: