Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

backend: implement event file replacement detection #5529

Merged
merged 1 commit into from
Jan 31, 2022

Conversation

nfelt
Copy link
Contributor

@nfelt nfelt commented Jan 25, 2022

This implements an optional mode in our Python EventFileLoader that will attempt to detect and handle the case when an event file is replaced entirely with a new version of the file containing additional data (rather than the new data being appended to the existing file, the normal behavior we expect).

This situation occurs for example when using rsync as described in #349. When run without the --inplace option, rsync will transfer data to a temporary file and then swap it in for the final file, but this means that TensorBoard will only ever see the first version of that file and never shows any updates to it until it's restarted. A similar issue happens for internal users - Googlers, please see b/201113906 for context.

The new logic uses stat() to monitor the size of the file, and if it grows, we will re-open the underlying iterator for that filename, resuming at the prior read offset. Care has been taken to try to avoid adding too much filesystem I/O overhead, but it's possible that this new mode is more expensive (since it re-opens the file potentially many times), hence the reason to leave it off by default, at least for now.

This functionality requires that the underlying iterator supports re-opening, which is under development for tf.compat.v1.summary.tf_record_iterator() (Googlers, please see cl/423871645). It is not currently supported for the no-TF mode stub iterator implementation (although I suspect, but have not confirmed, that the stub implementation is not affected by the original issue since it has a simpler albeit likely less efficient implementation that just re-opens files each time it reads). Also, since this code change is currently only in our Python event loading code, it has no effect on RustBoard (our rust data server). If there's demand, however, it should be possible to port this logic.

Note that there is currently no way to activate this mode - #5530 is the followup PR with the CLI flag and plumbing.

Tested: ran TensorBoard with these changes enabled (see followup PR) and confirmed that it now picks up new data when replacing an event file entirely rather than appending to it, writes the appropriate log messages, etc.

@nfelt nfelt marked this pull request as draft January 25, 2022 00:11
@nfelt nfelt requested a review from bileschi January 25, 2022 00:27
@nfelt nfelt marked this pull request as ready for review January 25, 2022 00:27
@nfelt
Copy link
Contributor Author

nfelt commented Jan 25, 2022

FYI @bileschi - the CI test failures are expected until tf-nightly contains the changes to tf_record_iterator (see cl/423871645), but I thought I'd send it now anyway in case you have questions about the approach. I can re-ping when CI is fully passing.

Copy link
Collaborator

@bileschi bileschi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice tests

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants