Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Debugger Plugin] Exit TensorBoard on SIGINT even with debugger enabled #975

Merged
merged 5 commits into from
Feb 16, 2018
Merged

Conversation

caisq
Copy link
Contributor

@caisq caisq commented Feb 15, 2018

No description provided.

Copy link
Member

@chihuahua chihuahua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @caisq for fixing this. I just have some nits!

2. Send `SIGINT` to the TensorBoard process running the Debugger Plugin, e.g.,
by using `Ctrl+C`.

# Limitations and Unknown Issues
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: These are known (not unknown) issues.

* Hitting Ctrl+C (issuing a SIGINT signal) might fail to terminate execution for a model that is instrumented with
`TensorBoardDebugWrapperSession` or its corresponding hook. The same limitation may be present in the tensorboard
process as well. In those cases, the user must manually
[kill](https://www.linux.com/learn/intro-to-linux/2017/5/how-kill-process-command-line) the processes.
* The debugger dashboard does not yet support multiple users debugging at once.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe change this phrasing to "debugger plugin" to match other mentions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


# Limitations and Unknown Issues

The Debugger Plugin has the following limitations and known issues. We plan to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Maybe lowercase debugger plugin to match other uses.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

# Limitations
# Frequently Asked Questions (FAQ)

## Q: How to exit the debugging?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe: How to exit debugging?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

# let the debugged tensorflow runtime proceed past the current Session.run
# in the C++ layer and return to the Python layer, so the SIGINT handler
# registered there may be triggered.
for _ in xrange(len(self._debugger_data_server.breakpoints) + 1):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Extra space before 1.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

# registered there may be triggered.
for _ in xrange(len(self._debugger_data_server.breakpoints) + 1):
self._debugger_data_server.put_incoming_message(True)
# while True: # DEBUG
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we get rid of these debug comments?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Contributor Author

@caisq caisq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review!

# Limitations
# Frequently Asked Questions (FAQ)

## Q: How to exit the debugging?
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


# Limitations and Unknown Issues

The Debugger Plugin has the following limitations and known issues. We plan to
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

* Hitting Ctrl+C (issuing a SIGINT signal) might fail to terminate execution for a model that is instrumented with
`TensorBoardDebugWrapperSession` or its corresponding hook. The same limitation may be present in the tensorboard
process as well. In those cases, the user must manually
[kill](https://www.linux.com/learn/intro-to-linux/2017/5/how-kill-process-command-line) the processes.
* The debugger dashboard does not yet support multiple users debugging at once.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

# registered there may be triggered.
for _ in xrange(len(self._debugger_data_server.breakpoints) + 1):
self._debugger_data_server.put_incoming_message(True)
# while True: # DEBUG
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

# let the debugged tensorflow runtime proceed past the current Session.run
# in the C++ layer and return to the Python layer, so the SIGINT handler
# registered there may be triggered.
for _ in xrange(len(self._debugger_data_server.breakpoints) + 1):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Contributor

@nfelt nfelt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW it appears based on grpc/grpc#3820 (comment) that we can make Ctrl+C successfully kill TensorBoard without the two-stage shutdown process by adding a signal handler for SIGINT that instead does os.kill(os.getpid(), signal.SIGTERM) i.e. it basically tells the OS to resend the signal as SIGTERM instead of SIGINT. I tried this locally and it does successfully kill TensorBoard with a single Ctrl+C (albeit now with a Terminated message, from the shell I think).

It's maybe worth checking if we can that this works on Windows, but overall I think it would be good to preserve the ability of a single Ctrl+C to kill TensorBoard if it all possible. It's less graceful of a shutdown but it's much easier for users. If we aren't going to do that, we should probably add logic to the startup message so that if the debugger plugin is active, the message no longer says Ctrl+C alone will kill TensorBoard - otherwise I think users will rightfully consider it a bug.

@@ -524,6 +524,9 @@ def start_the_debugger_data_receiving_server(self):
"""
self.run_server()

def stop_the_debugger_data_receiving_server(self):
self.stop_server()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this alias really necessary? The callsite could just call stop_server() directly, no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Removed the stop_the_debugger_data_receiving_server and start_the_debugger_data_receiving_server methods.


signal.signal(signal.SIGINT, self.signal_handler)

def signal_handler(self, unused_signal, unused_frame):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be good to note somewhere that this is required because of arguably a deficiency in grpc/python 2.7:
grpc/grpc#6999 (They marked it as wontfix somewhat unreasonably, I think, given that the debugger plugin is exactly that kind of "rare case" where this has an effect in the real world.) E.g. in python 3, Ctrl+C will successfully kill TensorBoard (actually, it requires two of them in a row, but still it's closer).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added comments. Thanks.

@nfelt
Copy link
Contributor

nfelt commented Feb 15, 2018

One other thing: should we also be adding this handler to the non-interactive debugger_plugin.py? Or is it not affected by this issue?

@caisq
Copy link
Contributor Author

caisq commented Feb 15, 2018

@nfelt, you wrote:

I tried this locally and it does successfully kill TensorBoard with a single Ctrl+C (albeit now with a Terminated message, from the shell I think).

Did you try it from a tensorboard binary (without bazel), or with bazel run? I think bazel run handles SIGINT differently, it'll run the binary as a subprocess and kill it no matter what.

@nfelt
Copy link
Contributor

nfelt commented Feb 15, 2018

I was running bazel-bin/tensorboard/tensorboard directly, not under bazel run.

@caisq
Copy link
Contributor Author

caisq commented Feb 16, 2018

@nfelt

About the non-interactive debugger_plugin.py, those are used mainly during the debugging of long-running (possibly remotely running training) jobs. So interactive termination of those are not as important.

Thanks for the info about the os.kill(os.getpid(), signal.SIGTERM) approach. I prefer the current sys.exit() approach because it leads to a more orderly shutdown (without any scary-looking stack traces, for one thing).

@caisq
Copy link
Contributor Author

caisq commented Feb 16, 2018

Thank you both for the review, @chihuahua and @nfelt. I'm going to merge the PR now. If you have any remaining comments, please let me know and I'll be happy to address them in future PRs.

@caisq caisq merged commit 8b30876 into tensorflow:master Feb 16, 2018
@nfelt
Copy link
Contributor

nfelt commented Feb 16, 2018

@caisq In general, I think it's important that if starting TensorBoard advertises that you can kill it with Ctrl+C, we need to be able to kill TensorBoard with Ctrl+C in all expected use cases, without a special procedure that you need to read the FAQ (of a specific plugin, no less) in order to discover. (And for the same reason, I am inclined to think the non-interactive debugger_plugin.py matters too.)

Testing out this PR, however, I'm able to Ctrl+C from TensorBoard only and it seems to work without having to Ctrl+C the TF job as well (this is using your tdp_client binary in experimental). But it sounds like there are circumstances where the TF job will get stuck? If there's a way to avoid those cases so that Ctrl+C on the TF job isn't required in any case, that's still what I would prefer.

If we can't do that, I'm still kind of inclined to do os.kill(), maybe as a fallback (say, we wait 3 seconds after handling the SIGINT to see if we can shutdown gracefully, otherwise we do SIGTERM). At least in my testing I'm not seeing that producing any stacktraces, though if it did that also isn't a dealbreaker in my opinion (it's typical for Ctrl+C on a python process to produce a stacktrace from the KeyboardInterrupt bubbling up; this happens even with Ctrl+C on python -c 'while True: pass'). And I'm not sure I'm understanding if there are concrete benefits to being more "orderly" about the shutdown - unless an abrupt shutdown might actually disrupt the TF job, but then that seems like a more general problem, since TensorBoard might still crash for other reasons (OOM, etc.) and bringing down the TF job with it a result seems rather fragile.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants