backend: implement event file replacement detection #5529
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This implements an optional mode in our Python
EventFileLoader
that will attempt to detect and handle the case when an event file is replaced entirely with a new version of the file containing additional data (rather than the new data being appended to the existing file, the normal behavior we expect).This situation occurs for example when using
rsync
as described in #349. When run without the--inplace
option,rsync
will transfer data to a temporary file and then swap it in for the final file, but this means that TensorBoard will only ever see the first version of that file and never shows any updates to it until it's restarted. A similar issue happens for internal users - Googlers, please see b/201113906 for context.The new logic uses
stat()
to monitor the size of the file, and if it grows, we will re-open the underlying iterator for that filename, resuming at the prior read offset. Care has been taken to try to avoid adding too much filesystem I/O overhead, but it's possible that this new mode is more expensive (since it re-opens the file potentially many times), hence the reason to leave it off by default, at least for now.This functionality requires that the underlying iterator supports re-opening, which is under development for
tf.compat.v1.summary.tf_record_iterator()
(Googlers, please see cl/423871645). It is not currently supported for the no-TF mode stub iterator implementation (although I suspect, but have not confirmed, that the stub implementation is not affected by the original issue since it has a simpler albeit likely less efficient implementation that just re-opens files each time it reads). Also, since this code change is currently only in our Python event loading code, it has no effect on RustBoard (our rust data server). If there's demand, however, it should be possible to port this logic.Note that there is currently no way to activate this mode - #5530 is the followup PR with the CLI flag and plumbing.
Tested: ran TensorBoard with these changes enabled (see followup PR) and confirmed that it now picks up new data when replacing an event file entirely rather than appending to it, writes the appropriate log messages, etc.