-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TensorBoard doesn't load new data when summaries are stored in GCS. #867
Comments
Do you happen to have the TensorBoard logs stored anywhere? And is the problem reproducible or did it just happen this once after waiting a long time? |
I have been experiencing this for the last two hours and it's still on going . Where can I get TensorBoard log? |
TensorBoard should log to standard out by default, so if you started it manually on Compute Engine it would just be logging to the console. If it's more customized it might be logging to How big is your logdir - is it fairly large? Sometimes very large or deeply nested logdirs can take so long to scan through that TensorBoard essentially appears to be frozen after it loads data for the first time. |
The logdir is not large at all. The deepest level is two. I have no more than 200 files in total since I just started the training. The event file is about 80MB at this moment. I used default TensorBoard installation on Compute Engine, nothing customized. I don't see any interesting logs other than this in console:
And this is all I have in
|
Also, I should mention that this happens even when I run tensorboard on my local machine. |
When TensorBoard reads individual event log files, it polls the end of the file in case records get appended. That behavior isn't possible on GCS since objects are sort of immutable and atomic. Out of curiosity, when you store the newer summaries, do you put them in a second event log file, with a higher timestamp? |
Here is the code I use to write summaries. Does this answer your question? It seems a new event file gets created only when current event file has grown to a certain size. Maybe tensorboard only updates when the second event file is created. However, my impression is that tensorboard updates much more frequently than what I'm seeing right now back in December last year. |
I think the multiple events files are due to restarting training. Assuming that’s the case, I’m fairly certain that I have seen tensorboard correctly reloading new data from a single event file stored on GCS as recent as December last year. |
Are you training with multiple threads/processes? If so, could you ensure that only 1 thread is writing summaries to disk? Specifically, ensure that only 1 thread has Also, that the TensorBoard fails to pick up new summaries if multiple |
I can take a closer look at the the script. The object detection project comes with samples that write events to GCS. I'm just using their scripts. However I want to emphasize that this "used to work" as recent as last month. I haven't changed anything in training script. Is it possible that something in GCS has changed that causes this? |
Hmm, you noted that this happens even when running tensorboard on a local machine, right? Furthermore, when you restart TensorBoard, you do see the most recent events reflected, right? To me, that seems to hint at a problem on the data-reading side. |
Yes to both of you questions. By "data-reading" side, do you mean tensorboard or even GCS SDK? |
Oh ... maybe. By data-reading, I mean TensorBoard logic that reads data from disk to render in the web app. GCS logic seems related, but it seems less of a culprit if we can restart TensorBoard and see updated data. That means we can successfully read from GCS. |
The default TensorBoard version on Compute Engine hasn't changed between December and January this year, right? |
Could you please note how you are running the scripts? Which shell commands? Are several commands running at once? |
True. The bug in which TensorBoard does not pick up data from multiple event writers has existed for a long time though. |
I will post the exact command line shortly. Here's the job description ML engine sees. I'm pretty certain that I only submitted the job once.
|
Thank you. One thing that I'm trying to figure out is where this Could we try something? At this call to Could we pass it a if is_chief:
summary_writer = tf.summary.FileWriter(train_dir, graph=tf.get_default_graph())
else:
summary_writer = None |
Thank you for looking into this. I will try it out. It may take me a couple of days as I will be traveling. Will report back. |
Oh it's possible you might have accidentally installed |
Is this likely to be resolved? I'm experiencing this failure mode (TensorBoard does not detect new scalar information on GCS) even with the latest tb-nightly. |
@robieta We've been bug scrubbing this week to improve GCS support. TensorBoard nightly users can expect to see @nfelt Could this be related to the truncation issue you encountered a few days ago in the TF GCS impl? If so, please reopen and assign. |
@robieta There's a known issue where in some circumstances TensorBoard won't load fresh data from GCS. This would happen only in a case where you have 1 or 2 separate run subdirectories containing event files within your logdir (if you have 3 or more, you hit a different problem, but it shouldn't get stuck loading). If that matches your situation and you're willing to use If that still doesn't help, could you add the environment variable |
@nfelt Thanks for the response. I am using tf-nightly as well as tb-nightly. Setting
The reason I think this might be useful to you is that Let me know if you need anything else. |
The problem still exists on Tensorboard 1.10.0. The log files are generated on the remote server. I mounted the log path to local machine via cifs and opened a tensorboard server also on the local machine. A stupid yet effecticve workaround is to restart tensorboard server every #!/bin/bash
trap ctrl_c INT
function ctrl_c() {
echo
echo "Ctrl-C by user"
kill -9 $pid
exit
}
args=$@
while true
do
eval "tensorboard $args > /dev/null 2>&1 &"
pid=$!
echo "tensorboard running on $pid"
sleep 20
kill -9 $pid
done Name this file as |
@DeanChan I'm not sure how cifs works and I'm also not sure how you're using it with GCS - more detail would be helpful. But if you're not reading directly from GCS itself, e.g. with It seems more likely that the problem you're seeing is #349, a known issue where TensorBoard doesn't detect if event files it already has opened are replaced with files containing new data. If that's the case, the only fix right now is changing the syncing/mounting logic to ensure that the same files are used, e.g. for rsync specifying the If you think it's not issue #349, please open a new GitHub issue for this problem specifically. |
@nfelt Many thanks for your reply. I was not reading directly from GCS but reading the log files as they were on my local machine, i.e. Currently I'm not sure if cifs supports inplace like rsync does. I'll give it a try. |
I think I am having the same issue with s3 files with multiple model. |
Is this issue still monitored despite being closed? Do I need to open a new issue? I have encountered the same issue. I am using Tensorflow 1.13 in a Google Datalab notebook run on a vm on the compute engine. I initialize TensorBord the following way: from google.datalab.ml import TensorBoard I am assuming that google.datalab.ml is calling the TensorBoard version installed with TensorFlow 1.13, right? An estimator is run on AI platform via gcloud and it stores its summary etc in a directory on Google Storage specified as its output dir and I start tensorboard specifying this output dir. It works but it never refreshes and I have to start a new instance to see the newest steps. Thanks in advance for any help |
@KennethKJ Do you have multiple "tfevents" summary files generated in a single directory by your estimator? If so you may be hitting #1063 where only the last summary file to be created will be polled for new data after the initial pass over the files. If that doesn't sound like it, then please open a new issue. |
Thanks a lot for your quick reply! Yes, looks like that is indeed the case. It seems that the default behavior of the Estimator is to store multiple tfevent files directly in the specified output directory. I will refer to issue #1063 instead and comment there if necessary for my use case. Thanks! |
I have been storing summaries in GCS bucket, and running TensorBoard on a Compute Engine instance. This set up worked fairly well. Coming back from holiday break, I noticed that TensorBoard no longer reload new summary events. The only way for me to see new data is to restart TensorBoard process.
The text was updated successfully, but these errors were encountered: