Adjust buffer replay logic for restarts w/ new ports #5450

kevin-bates · 2020-05-16T00:29:06Z

This change derives from the issues jupyterlab/jupyterlab#8013 and jupyter-server/enterprise_gateway#756. The problem is that the server generally assumes that a kernel's ports remain the same across restarts. The Notebook UI reestablished its Websocket connection to the server on restarts, while Lab did not. However, even if Lab adds a reconnect (see jupyterlab/jupyterlab#8432) it does so using the same client id - which is reconnecting to the old channels (ports).

This change (proposed by @jasongrout - thank you!) preserves any buffer replay logic that might correspond to the client id, but unconditionally establishes the channels, irrespective of whether the underlying ports were changed, thereby restoring communication to the restarted kernel.

Co-authored-by: Jason Grout <[email protected]>

blink1073 · 2020-05-16T08:47:57Z

Test failure looks legitimate:

FAIL "function () {

        return this.evaluate(function (events) {

            return IPython._events_triggered.length >= events.length;

        }, [events]);

    }" did not evaluate to something truthy in 10000ms

#    type: uncaughtError

#    file: /home/travis/build/jupyter/notebook/notebook/tests/services/kernel.js

#    error: "function () {

        return this.evaluate(function (events) {

            return IPython._events_triggered.length >= events.length;

        }, [events]);

    }" did not evaluate to something truthy in 10000ms

#    stack: not provided

Captured console.log:

    Welcome to Project Jupyter! Explore the various tools available and their corresponding documentation. If you are interested in contributing to the platform, please visit the community resources section at http://jupyter.org/community.html.

    Loaded moment locale en

    actions jupyter-notebook:find-and-replace does not exist, still binding it in case it will be defined later...

    Loaded moment locale en

    Widgets are not available.  Please install widgetsnbextension or ipywidgets 4.0

    Session: kernel_created (1e271404-e8d0-45b0-afd9-4efeea19cbb9)

    Starting WebSockets: ws://localhost:8888/a@b/api/kernels/54407a5a-e1e2-46a9-9187-5bfbdaad77dc

    Kernel: kernel_connected (54407a5a-e1e2-46a9-9187-5bfbdaad77dc)

    Kernel: kernel_ready (54407a5a-e1e2-46a9-9187-5bfbdaad77dc)

    Kernel: kernel_interrupting (54407a5a-e1e2-46a9-9187-5bfbdaad77dc)

    Kernel: kernel_restarting (54407a5a-e1e2-46a9-9187-5bfbdaad77dc)

    Kernel: kernel_created (54407a5a-e1e2-46a9-9187-5bfbdaad77dc)

    Starting WebSockets: ws://localhost:8888/a@b/api/kernels/54407a5a-e1e2-46a9-9187-5bfbdaad77dc

    Kernel: kernel_connected (54407a5a-e1e2-46a9-9187-5bfbdaad77dc)

    Kernel: kernel_starting (54407a5a-e1e2-46a9-9187-5bfbdaad77dc)

    Kernel: kernel_ready (54407a5a-e1e2-46a9-9187-5bfbdaad77dc)

    Kernel: kernel_ready (54407a5a-e1e2-46a9-9187-5bfbdaad77dc)

    Kernel: kernel_reconnecting (54407a5a-e1e2-46a9-9187-5bfbdaad77dc)

    Starting WebSockets: ws://localhost:8888/a@b/api/kernels/54407a5a-e1e2-46a9-9187-5bfbdaad77dc

    Kernel: kernel_connected (54407a5a-e1e2-46a9-9187-5bfbdaad77dc)

    Kernel: kernel_ready (54407a5a-e1e2-46a9-9187-5bfbdaad77dc)

    Kernel: kernel_killed (54407a5a-e1e2-46a9-9187-5bfbdaad77dc)

    Kernel: kernel_starting (54407a5a-e1e2-46a9-9187-5bfbdaad77dc)

    Kernel: kernel_created (690611e1-1269-4ee4-b424-fb4bbda12c21)

    Starting WebSockets: ws://localhost:8888/a@b/api/kernels/690611e1-1269-4ee4-b424-fb4bbda12c21

    Kernel: kernel_connected (690611e1-1269-4ee4-b424-fb4bbda12c21)

    Kernel: kernel_ready (690611e1-1269-4ee4-b424-fb4bbda12c21)

    Kernel: kernel_killed (690611e1-1269-4ee4-b424-fb4bbda12c21)

    Kernel: kernel_starting (690611e1-1269-4ee4-b424-fb4bbda12c21)

    Kernel: kernel_created (96f841b7-29ff-49ad-bf40-e546e59ccb9e)

    Starting WebSockets: ws://localhost:8888/a@b/api/kernels/96f841b7-29ff-49ad-bf40-e546e59ccb9e

    Kernel: kernel_connected (96f841b7-29ff-49ad-bf40-e546e59ccb9e)

    Kernel: kernel_ready (96f841b7-29ff-49ad-bf40-e546e59ccb9e)

    Kernel: kernel_killed (96f841b7-29ff-49ad-bf40-e546e59ccb9e)

    Kernel: kernel_starting (96f841b7-29ff-49ad-bf40-e546e59ccb9e)

    Kernel: kernel_created (5e3ca932-8c9b-4a89-bf61-a9f230a0b341)

    Starting WebSockets: ws://localhost:8888/a@b/api/kernels/5e3ca932-8c9b-4a89-bf61-a9f230a0b341

    Kernel: kernel_connected (5e3ca932-8c9b-4a89-bf61-a9f230a0b341)

    Kernel: kernel_ready (5e3ca932-8c9b-4a89-bf61-a9f230a0b341)

    Kernel: kernel_reconnecting (5e3ca932-8c9b-4a89-bf61-a9f230a0b341)

    Starting WebSockets: ws://localhost:8888/a@b/api/kernels/5e3ca932-8c9b-4a89-bf61-a9f230a0b341

    Kernel: kernel_connected (5e3ca932-8c9b-4a89-bf61-a9f230a0b341)

    Kernel: kernel_ready (5e3ca932-8c9b-4a89-bf61-a9f230a0b341)

    Kernel: kernel_restarting (5e3ca932-8c9b-4a89-bf61-a9f230a0b341)

    Kernel: kernel_created (5e3ca932-8c9b-4a89-bf61-a9f230a0b341)

    Starting WebSockets: ws://localhost:8888/a@b/api/kernels/5e3ca932-8c9b-4a89-bf61-a9f230a0b341

    Kernel: kernel_connected (5e3ca932-8c9b-4a89-bf61-a9f230a0b341)

    WebSocket closed early Object {

      "code": 1006,

      "reason": "",

      "wasClean": false,

      "returnValue": true,

      "timeStamp": 1589589147812,

      "eventPhase": 2,

      "target":

        "readyState": 3,

        "bufferedAmount": 0,

        "extensions": "",

        "url": "ws://localhost:8888/a@b/api/kernels/5e3ca932-8c9b-4a89-bf61-a9f230a0b341/channels?session_id=9f54db2268d34c779bac45be05294444",

        "binaryType": "blob",

        "protocol": "",

        "URL": "ws://localhost:8888/a@b/api/kernels/5e3ca932-8c9b-4a89-bf61-a9f230a0b341/channels?session_id=9f54db2268d34c779bac45be05294444"

      "defaultPrevented": false,

      "srcElement":

        "readyState": 3,

        "bufferedAmount": 0,

        "extensions": "",

        "url": "ws://localhost:8888/a@b/api/kernels/5e3ca932-8c9b-4a89-bf61-a9f230a0b341/channels?session_id=9f54db2268d34c779bac45be05294444",

        "binaryType": "blob",

        "protocol": "",

        "URL": "ws://localhost:8888/a@b/api/kernels/5e3ca932-8c9b-4a89-bf61-a9f230a0b341/channels?session_id=9f54db2268d34c779bac45be05294444"

      "type": "close",

      "cancelable": false,

      "currentTarget":

        "readyState": 3,

        "bufferedAmount": 0,

        "extensions": "",

        "url": "ws://localhost:8888/a@b/api/kernels/5e3ca932-8c9b-4a89-bf61-a9f230a0b341/channels?session_id=9f54db2268d34c779bac45be05294444",

        "binaryType": "blob",

        "protocol": "",

        "URL": "ws://localhost:8888/a@b/api/kernels/5e3ca932-8c9b-4a89-bf61-a9f230a0b341/channels?session_id=9f54db2268d34c779bac45be05294444"

      "bubbles": false,

      "cancelBubble": false

    }

    Kernel: kernel_disconnected (5e3ca932-8c9b-4a89-bf61-a9f230a0b341)

    Connection lost, reconnecting in 1 seconds.

    Kernel: kernel_reconnecting (5e3ca932-8c9b-4a89-bf61-a9f230a0b341)

    Starting WebSockets: ws://localhost:8888/a@b/api/kernels/5e3ca932-8c9b-4a89-bf61-a9f230a0b341

kevin-bates · 2020-05-16T14:51:37Z

I agree - this test failure looks legit and I should have noticed this yesterday (too anxious to get the weekend started). Unfortunately, I have no clue how that particular test works. I suspect these changes may introduce another "event" since a reconnect to the ports (via create_stream()) now occurs irrespective of session. Are you able to determine if this is just a matter of updating a test that may be too tight or an indication of an incompatible change?

To be honest, these server changes aren't required for EG's remote kernels since a new session_key is introduced anyway (due to the way the kernel requests are proxied), which results in open() always calling self.create_stream() anyway. As a result, the logic in this PR primarily addresses kernels spawned from the current server that recreate their ports on restarts - which, I suppose, can exist.

I've verified that restarts from Lab with EG kernels now work with only the fix provided by @jasongrout in jupyterlab/jupyterlab#8432. I also checked restarts with local kernels (that retain ports across restarts) still work using only the Lab change. So, IMHO, this PR could probably be deferred to jupyter_server - where kernel providers will definitely introduce port changes on restarts.

blink1073 · 2020-05-16T15:31:13Z

I don't have the bandwidth to dive into this, I'd say defer it

jasongrout · 2020-05-16T15:50:05Z

Iirc, I needed this when I was I insisting that the ports change on restart by explicitly setting the port caching to false (see the original issue)

kevin-bates · 2020-05-16T17:59:17Z

Thank you for the responses and clarification (Jason). Since this runs a non-zero chance of destabilizing behaviors due to the assumption that ports are static across restarts, I'm in favor of closing this PR and addressing this in jupyter_server - where the wiggle room for changes like this is a bit wider.

kevin-bates · 2020-05-17T17:55:27Z

Spent a bit of time with this and see that restarts produce the following for local kernels via the Notebook UI:

[E 08:49:40.345 LabApp] Uncaught exception GET /api/kernels/ab29e63f-fcc9-4ad0-add0-d3d45aea7800/channels?session_id=d2da0011331244768b766f9a1c488c31 (127.0.0.1)
    HTTPServerRequest(protocol='http', host='0.0.0.0:8888', method='GET', uri='/api/kernels/ab29e63f-fcc9-4ad0-add0-d3d45aea7800/channels?session_id=d2da0011331244768b766f9a1c488c31', version='HTTP/1.1', remote_ip='127.0.0.1')
    Traceback (most recent call last):
      File "/opt/anaconda3/envs/notebook-dev/lib/python3.7/site-packages/tornado/websocket.py", line 956, in _accept_connection
        open_result = handler.open(*handler.open_args, **handler.open_kwargs)
      File "/Users/kbates/repos/oss/jupyter/notebook/notebook/services/kernels/handlers.py", line 270, in open
        self._on_zmq_reply(stream, msg_list)
      File "/Users/kbates/repos/oss/jupyter/notebook/notebook/services/kernels/handlers.py", line 423, in _on_zmq_reply
        super(ZMQChannelsHandler, self)._on_zmq_reply(stream, msg)
      File "/Users/kbates/repos/oss/jupyter/notebook/notebook/base/zmqhandlers.py", line 242, in _on_zmq_reply
        if self.ws_connection is None or stream.closed():
    AttributeError: 'dict' object has no attribute 'closed'

This does not happen using the Lab UI.

The reason this works via the Notebook UI against an EG server is due to the lack of handling of the session key, which winds up skipping the replay logic where the issue occurs. It also appears that buffer replay is slightly different via the Lab UI since there no messages get buffered when the Lab UI is used. I suspect Notebook UI must be triggering a close() that Lab Ui doesn't - thereby triggering the buffering of the status messages across a restart.

Here's the log sequence for a restart via Lab:

[D 09:06:32.823 LabApp] Accepting token-authenticated connection from 127.0.0.1
[D 09:06:33.147 LabApp] Starting kernel: ['/opt/anaconda3/envs/notebook-dev/bin/python', '-m', 'ipykernel_launcher', '-f', '/Users/kbates/Library/Jupyter/runtime/kernel-04f76333-de41-40ab-b15b-76177a895328.json']
[D 09:06:33.164 LabApp] Connecting to: tcp://127.0.0.1:57764
[I 09:06:33.166 LabApp] Kernel restarted: 04f76333-de41-40ab-b15b-76177a895328
[D 09:06:33.167 LabApp] Connecting to: tcp://127.0.0.1:57760
[D 09:06:33.177 LabApp] 200 POST /api/kernels/04f76333-de41-40ab-b15b-76177a895328/restart?1589731592813 (127.0.0.1) 358.23ms
[D 09:06:33.203 LabApp] activity on 04f76333-de41-40ab-b15b-76177a895328: status (busy)
[D 09:06:33.227 LabApp] activity on 04f76333-de41-40ab-b15b-76177a895328: shutdown_reply
[D 09:06:33.251 LabApp] activity on 04f76333-de41-40ab-b15b-76177a895328: status (idle)
[D 09:06:33.253 LabApp] Websocket closed 04f76333-de41-40ab-b15b-76177a895328:b757a8b6-1bfa-4776-8462-bee15dc8601c
[D 09:06:33.264 LabApp] Initializing websocket connection /api/kernels/04f76333-de41-40ab-b15b-76177a895328/channels

And that of restart via Notebook:

[D 09:09:59.479 LabApp] Websocket closed fc9693cd-f3a8-43fe-962d-b6222143c307:9325611663dd49f995fc3150124b5559
[I 09:09:59.481 LabApp] Starting buffering for fc9693cd-f3a8-43fe-962d-b6222143c307:9325611663dd49f995fc3150124b5559
[D 09:09:59.482 LabApp] Clearing buffer for fc9693cd-f3a8-43fe-962d-b6222143c307
[D 09:09:59.817 LabApp] Starting kernel: ['/opt/anaconda3/envs/notebook-dev/bin/python', '-m', 'ipykernel_launcher', '-f', '/Users/kbates/Library/Jupyter/runtime/kernel-fc9693cd-f3a8-43fe-962d-b6222143c307.json']
[D 09:09:59.828 LabApp] Connecting to: tcp://127.0.0.1:53672
[I 09:09:59.830 LabApp] Kernel restarted: fc9693cd-f3a8-43fe-962d-b6222143c307
[D 09:09:59.832 LabApp] Connecting to: tcp://127.0.0.1:53668
[D 09:09:59.844 LabApp] 200 POST /api/kernels/fc9693cd-f3a8-43fe-962d-b6222143c307/restart (127.0.0.1) 354.31ms
[D 09:09:59.853 LabApp] activity on fc9693cd-f3a8-43fe-962d-b6222143c307: status (busy)
[D 09:09:59.855 LabApp] Buffering msg on fc9693cd-f3a8-43fe-962d-b6222143c307:iopub
[D 09:09:59.865 LabApp] activity on fc9693cd-f3a8-43fe-962d-b6222143c307: shutdown_reply
[D 09:09:59.866 LabApp] Buffering msg on fc9693cd-f3a8-43fe-962d-b6222143c307:iopub
[D 09:09:59.877 LabApp] activity on fc9693cd-f3a8-43fe-962d-b6222143c307: status (idle)
[D 09:09:59.878 LabApp] Buffering msg on fc9693cd-f3a8-43fe-962d-b6222143c307:iopub
[D 09:09:59.885 LabApp] Initializing websocket connection /api/kernels/fc9693cd-f3a8-43fe-962d-b6222143c307/channels
[I 09:09:59.902 LabApp] Adapting from protocol version 5.1 (kernel fc9693cd-f3a8-43fe-962d-b6222143c307) to 5.3 (client).
[D 09:09:59.908 LabApp] 101 GET /api/kernels/fc9693cd-f3a8-43fe-962d-b6222143c307/channels?session_id=9325611663dd49f995fc3150124b5559 (127.0.0.1) 26.94ms
[D 09:09:59.915 LabApp] Opening websocket /api/kernels/fc9693cd-f3a8-43fe-962d-b6222143c307/channels
[D 09:10:03.267 LabApp] Getting buffer for fc9693cd-f3a8-43fe-962d-b6222143c307
[I 09:10:17.076 LabApp] Restoring connection for fc9693cd-f3a8-43fe-962d-b6222143c307:9325611663dd49f995fc3150124b5559
[I 09:10:19.688 LabApp] Replaying 3 buffered messages
[E 09:10:39.691 LabApp] Uncaught exception GET /api/kernels/fc9693cd-f3a8-43fe-962d-b6222143c307/channels?session_id=9325611663dd49f995fc3150124b5559 (127.0.0.1)
    HTTPServerRequest(protocol='http', host='0.0.0.0:8888', method='GET', uri='/api/kernels/fc9693cd-f3a8-43fe-962d-b6222143c307/channels?session_id=9325611663dd49f995fc3150124b5559', version='HTTP/1.1', remote_ip='127.0.0.1')
    Traceback (most recent call last):
      File "/opt/anaconda3/envs/notebook-dev/lib/python3.7/site-packages/tornado/websocket.py", line 956, in _accept_connection
        open_result = handler.open(*handler.open_args, **handler.open_kwargs)
      File "/Users/kbates/repos/oss/jupyter/notebook/notebook/services/kernels/handlers.py", line 270, in open
        self._on_zmq_reply(stream, msg_list)
      File "/Users/kbates/repos/oss/jupyter/notebook/notebook/services/kernels/handlers.py", line 423, in _on_zmq_reply
        super(ZMQChannelsHandler, self)._on_zmq_reply(stream, msg)
      File "/Users/kbates/repos/oss/jupyter/notebook/notebook/base/zmqhandlers.py", line 242, in _on_zmq_reply
        if self.ws_connection is None or stream.closed():
    AttributeError: 'dict' object has no attribute 'closed'

I think the proper change is that stream = buffer_info['channels'] should be:

     stream = buffer_info['channels'][channel]

When that is done, restarts also work via Notebook when replays are active. However, the kernel.js test still has failures, but of a different nature: https://travis-ci.org/github/kevin-bates/notebook/jobs/688097176 and I just don't know that particular test mechanics to determine if it's just an overzealous test or a real side-affecting issue.

Adjust buffer replay logic for restarts w/ new ports

a49e037

Co-authored-by: Jason Grout <[email protected]>

kevin-bates requested a review from blink1073 May 16, 2020 00:29

kevin-bates mentioned this pull request May 16, 2020

Release 6.1.0 #5423

Closed

24 tasks

kevin-bates closed this May 16, 2020

suryag10 mentioned this pull request Aug 19, 2020

JupyterHub+JupyterLab+JupyterEnterpriseGateway Restart Kernel hangs jupyterlab/jupyterlab#8013

Closed

github-actions bot added the status:resolved-locked label Mar 25, 2021

github-actions bot locked as resolved and limited conversation to collaborators Mar 25, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adjust buffer replay logic for restarts w/ new ports #5450

Adjust buffer replay logic for restarts w/ new ports #5450

kevin-bates commented May 16, 2020

blink1073 commented May 16, 2020

kevin-bates commented May 16, 2020

blink1073 commented May 16, 2020

jasongrout commented May 16, 2020

kevin-bates commented May 16, 2020

kevin-bates commented May 17, 2020

Adjust buffer replay logic for restarts w/ new ports #5450

Adjust buffer replay logic for restarts w/ new ports #5450

Conversation

kevin-bates commented May 16, 2020

blink1073 commented May 16, 2020

kevin-bates commented May 16, 2020

blink1073 commented May 16, 2020

jasongrout commented May 16, 2020

kevin-bates commented May 16, 2020

kevin-bates commented May 17, 2020