Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix race condition with async kernel management #5875

Merged
merged 1 commit into from
Dec 23, 2020

Conversation

kevin-bates
Copy link
Member

With AsyncMappingKernelManager enabled, a race condition can occur between the shutdown of a kernel and a "current" fetch of active kernels or even a fetch of that kernel. This is because the kernel shutdown method removes the kernel_id key from the dictionary of _kernel_connections prior to awaiting the call to the superclass shutdown method. Since some shutdowns can take some time (esp with remote kernels), there is a window where the front-end (esp Jupyter Lab) is polling the list of active kernels, then polling each active kernel (every 10 seconds or so) - which can result in a KeyError when accessing the _kernel_connections dictionary for the kernel associated with the awaited shutdown.

This change moves the removal of the kernel_id key from _kernel_connections to after the superclass shutdown method has been awaited, eliminating this race condition. It also continues building the list of active kernels should a (now) non-existent kernel exception be encountered - rather than terminating the collection of active kernels.

This isn't an issue with the synchronous shutdown method, but I also moved the dictionary pop statement as well for future maintainability.

Here are examples of these occurrences:

[E 201117 15:48:35 base_events:1285] Future exception was never retrieved
    future: <Future finished exception=KeyError('03b698c4-a8cd-45d6-bd2c-a7b630440130',)>
    Traceback (most recent call last):
      File "/opt/anaconda2/envs/py3/lib/python3.6/site-packages/tornado/gen.py", line 326, in wrapper
        yielded = next(result)
      File "/opt/anaconda2/envs/py3/lib/python3.6/site-packages/notebook/services/kernels/handlers.py", line 31, in get
        kernels = yield maybe_future(km.list_kernels())
      File "/opt/anaconda2/envs/py3/lib/python3.6/site-packages/notebook/services/kernels/kernelmanager.py", line 379, in list_kernels
        model = self.kernel_model(kernel_id)
      File "/opt/anaconda2/envs/py3/lib/python3.6/site-packages/notebook/services/kernels/kernelmanager.py", line 370, in kernel_model
        "connections": self._kernel_connections[kernel_id],
    KeyError: '03b698c4-a8cd-45d6-bd2c-a7b630440130'
[I 201117 15:48:35 web:2162] 200 GET /api/kernels (9.163.38.87) 5.97ms


[E 201117 16:06:28 web:1670] Uncaught exception GET /api/kernels/10def553-b82e-4996-a4c3-5768afdcb922 (9.163.38.87)
    HTTPServerRequest(protocol='http', host='yarn-eg-node-1.fyre.ibm.com:8888', method='GET', uri='/api/kernels/10def553-b82e-4996-a4c3-5768afdcb922', version='HTTP/1.1', remote_ip='9.163.38.87')
    Traceback (most recent call last):
      File "/opt/anaconda2/envs/py3/lib/python3.6/site-packages/tornado/web.py", line 1590, in _execute
        result = method(*self.path_args, **self.path_kwargs)
      File "/opt/anaconda2/envs/py3/lib/python3.6/site-packages/tornado/web.py", line 3006, in wrapper
        return method(self, *args, **kwargs)
      File "/opt/anaconda2/envs/py3/lib/python3.6/site-packages/enterprise_gateway/services/kernels/handlers.py", line 134, in get
        model = km.kernel_model(kernel_id)
      File "/opt/anaconda2/envs/py3/lib/python3.6/site-packages/notebook/services/kernels/kernelmanager.py", line 370, in kernel_model
        "connections": self._kernel_connections[kernel_id],
    KeyError: '10def553-b82e-4996-a4c3-5768afdcb922'
[E 201117 16:06:28 web:2162] 500 GET /api/kernels/10def553-b82e-4996-a4c3-5768afdcb922 (9.163.38.87) 5.58ms

This should be ported to jupyter_server once merged.

@Zsailer Zsailer self-requested a review December 10, 2020 00:32
@Zsailer Zsailer merged commit 6345008 into jupyter:master Dec 23, 2020
@Zsailer
Copy link
Member

Zsailer commented Dec 23, 2020

Similar fix in jupyter-server/jupyter_server#365

@kevin-bates kevin-bates deleted the kernel-list-race-condition branch January 29, 2021 23:44
@blink1073 blink1073 added this to the 6.2 milestone Mar 18, 2021
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 15, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants