-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Debugger Plugin] Exit TensorBoard on SIGINT even with debugger enabled #975
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @caisq for fixing this. I just have some nits!
2. Send `SIGINT` to the TensorBoard process running the Debugger Plugin, e.g., | ||
by using `Ctrl+C`. | ||
|
||
# Limitations and Unknown Issues |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: These are known (not unknown) issues.
* Hitting Ctrl+C (issuing a SIGINT signal) might fail to terminate execution for a model that is instrumented with | ||
`TensorBoardDebugWrapperSession` or its corresponding hook. The same limitation may be present in the tensorboard | ||
process as well. In those cases, the user must manually | ||
[kill](https://www.linux.com/learn/intro-to-linux/2017/5/how-kill-process-command-line) the processes. | ||
* The debugger dashboard does not yet support multiple users debugging at once. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe change this phrasing to "debugger plugin" to match other mentions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
|
||
# Limitations and Unknown Issues | ||
|
||
The Debugger Plugin has the following limitations and known issues. We plan to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Maybe lowercase debugger plugin to match other uses.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
# Limitations | ||
# Frequently Asked Questions (FAQ) | ||
|
||
## Q: How to exit the debugging? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe: How to exit debugging?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
# let the debugged tensorflow runtime proceed past the current Session.run | ||
# in the C++ layer and return to the Python layer, so the SIGINT handler | ||
# registered there may be triggered. | ||
for _ in xrange(len(self._debugger_data_server.breakpoints) + 1): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Extra space before 1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
# registered there may be triggered. | ||
for _ in xrange(len(self._debugger_data_server.breakpoints) + 1): | ||
self._debugger_data_server.put_incoming_message(True) | ||
# while True: # DEBUG |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we get rid of these debug comments?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review!
# Limitations | ||
# Frequently Asked Questions (FAQ) | ||
|
||
## Q: How to exit the debugging? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
|
||
# Limitations and Unknown Issues | ||
|
||
The Debugger Plugin has the following limitations and known issues. We plan to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
* Hitting Ctrl+C (issuing a SIGINT signal) might fail to terminate execution for a model that is instrumented with | ||
`TensorBoardDebugWrapperSession` or its corresponding hook. The same limitation may be present in the tensorboard | ||
process as well. In those cases, the user must manually | ||
[kill](https://www.linux.com/learn/intro-to-linux/2017/5/how-kill-process-command-line) the processes. | ||
* The debugger dashboard does not yet support multiple users debugging at once. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
# registered there may be triggered. | ||
for _ in xrange(len(self._debugger_data_server.breakpoints) + 1): | ||
self._debugger_data_server.put_incoming_message(True) | ||
# while True: # DEBUG |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
# let the debugged tensorflow runtime proceed past the current Session.run | ||
# in the C++ layer and return to the Python layer, so the SIGINT handler | ||
# registered there may be triggered. | ||
for _ in xrange(len(self._debugger_data_server.breakpoints) + 1): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW it appears based on grpc/grpc#3820 (comment) that we can make Ctrl+C successfully kill TensorBoard without the two-stage shutdown process by adding a signal handler for SIGINT
that instead does os.kill(os.getpid(), signal.SIGTERM)
i.e. it basically tells the OS to resend the signal as SIGTERM instead of SIGINT. I tried this locally and it does successfully kill TensorBoard with a single Ctrl+C (albeit now with a Terminated
message, from the shell I think).
It's maybe worth checking if we can that this works on Windows, but overall I think it would be good to preserve the ability of a single Ctrl+C to kill TensorBoard if it all possible. It's less graceful of a shutdown but it's much easier for users. If we aren't going to do that, we should probably add logic to the startup message so that if the debugger plugin is active, the message no longer says Ctrl+C alone will kill TensorBoard - otherwise I think users will rightfully consider it a bug.
@@ -524,6 +524,9 @@ def start_the_debugger_data_receiving_server(self): | |||
""" | |||
self.run_server() | |||
|
|||
def stop_the_debugger_data_receiving_server(self): | |||
self.stop_server() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this alias really necessary? The callsite could just call stop_server() directly, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Removed the stop_the_debugger_data_receiving_server
and start_the_debugger_data_receiving_server
methods.
|
||
signal.signal(signal.SIGINT, self.signal_handler) | ||
|
||
def signal_handler(self, unused_signal, unused_frame): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be good to note somewhere that this is required because of arguably a deficiency in grpc/python 2.7:
grpc/grpc#6999 (They marked it as wontfix somewhat unreasonably, I think, given that the debugger plugin is exactly that kind of "rare case" where this has an effect in the real world.) E.g. in python 3, Ctrl+C will successfully kill TensorBoard (actually, it requires two of them in a row, but still it's closer).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added comments. Thanks.
One other thing: should we also be adding this handler to the non-interactive |
@nfelt, you wrote:
Did you try it from a tensorboard binary (without bazel), or with bazel run? I think bazel run handles SIGINT differently, it'll run the binary as a subprocess and kill it no matter what. |
I was running |
About the non-interactive Thanks for the info about the |
Thank you both for the review, @chihuahua and @nfelt. I'm going to merge the PR now. If you have any remaining comments, please let me know and I'll be happy to address them in future PRs. |
@caisq In general, I think it's important that if starting TensorBoard advertises that you can kill it with Ctrl+C, we need to be able to kill TensorBoard with Ctrl+C in all expected use cases, without a special procedure that you need to read the FAQ (of a specific plugin, no less) in order to discover. (And for the same reason, I am inclined to think the non-interactive Testing out this PR, however, I'm able to Ctrl+C from TensorBoard only and it seems to work without having to Ctrl+C the TF job as well (this is using your tdp_client binary in experimental). But it sounds like there are circumstances where the TF job will get stuck? If there's a way to avoid those cases so that Ctrl+C on the TF job isn't required in any case, that's still what I would prefer. If we can't do that, I'm still kind of inclined to do |
No description provided.