Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

jupyter_client v6.20 (and maybe also v7.0.0?) interrupts jupyter-enterprise-gateway on kernel change, instead of just interrupting the kernel #682

Closed
telamonian opened this issue Aug 19, 2021 · 1 comment · Fixed by #683

Comments

@telamonian
Copy link

telamonian commented Aug 19, 2021

I ran into a strange issue this morning: I upgraded the version of jupyter_client in my jupyter-enterprise-gateway (JEG) deployment to v6.2.0 and then became unable to switch kernels. I could still start as many kernels as I liked normally, but the moment I tried to restart, stop, or change a kernel the gateway instance would crash. This was on a work machine, so unfortunately I can't copy/paste the error message, but essentially the os.killpg line here:

if self.has_kernel:
if hasattr(os, "getpgid") and hasattr(os, "killpg"):
try:
pgid = os.getpgid(self.kernel.pid) # type: ignore
os.killpg(pgid, signum) # type: ignore
return

was somehow sending SIGINT to JEG whenever a kernel change was requested, which then "gracefully" shut down due to "keyboard interrupt". Apparently the process group id of JEG itself is somehow getting passed into os.killpg.

For me the fix was to downgrade to jupyter_client v6.1.12, so I'm fairly certain that this is a jupyter_client issue and not a JEG problem. Since jupyter_client v6.2.0 has already been yanked from pypi for an unrelated version conflict issue, the question now is whether the problem is still present in jupyter_client v7.0.0. Though it moved around a bit, the relevant os.killpg command is still present in the latest HEAD:

# Prefer process-group over process
if self.pgid and hasattr(os, "killpg"):
try:
os.killpg(self.pgid, signum)
return

so it may still cause problems for JEG.

For right now, this can't be directly tested as the latest version of JEG is pinned jupyter_client<7, since JEG is currently incompatible with the kernel provisioner stuff. So this issue is mostly just something to keep an eye out for in the near future.

pinging @kevin-bates

@kevin-bates
Copy link
Member

Thanks for the heads-up @telamonian - I think I see the issue.

Enterprise Gateway's RemoteKernelManager fully overrides AsyncKernelManager.signal_kernel() which is called from other methods within AsyncKernelManager. This override does not issue killpg. PR #623 did some fairly radical refactoring that nicely eliminated a lot of duplication. These changes introduced an aliasing approach that essentially mapped the class's method name to an internal implementation. Initially, the internal methods were called from other methods and I raised this approach as problematic for subclass implementations and many were addressed and I even added some tests to ensure subclass call sequences because of this. Unfortunately, those tests didn't go far enough.

It looks like there are a few _async internal methods being called from other methods that need to be replaced with their "alias" - one of which is signal_kernel(). The primary culprit you're running into is relative to interrupt_kernel - which shutdown_kernel now (post 6.1.12) calls. As you can see, interrupt_kernel is calling _async_signal_kernel() rather than signal_kernel() - which then skips EG's override.

This will be an issue in 7.0 for subclasses of KernelManager which happen to override these remaining, lower-level, methods. Not so much from the killpg standpoint, but more from the fact that overrides are not getting called. That said, I don't think this will be an issue for Enterprise Gateway once it is compatible with 7.0 because EG will no longer need to override methods within its RemoteKernelManager to that degree (and perhaps at all).

I will spend time, first reproducing this by extending the test I added, then patch the remaining instances of unaliased methods such that the test passes.

Thanks again for bringing this to our attention.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants