Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adjust buffer replay logic for restarts w/ new ports #5450

Closed
wants to merge 1 commit into from

Conversation

kevin-bates
Copy link
Member

This change derives from the issues jupyterlab/jupyterlab#8013 and jupyter-server/enterprise_gateway#756. The problem is that the server generally assumes that a kernel's ports remain the same across restarts. The Notebook UI reestablished its Websocket connection to the server on restarts, while Lab did not. However, even if Lab adds a reconnect (see jupyterlab/jupyterlab#8432) it does so using the same client id - which is reconnecting to the old channels (ports).

This change (proposed by @jasongrout - thank you!) preserves any buffer replay logic that might correspond to the client id, but unconditionally establishes the channels, irrespective of whether the underlying ports were changed, thereby restoring communication to the restarted kernel.

@kevin-bates kevin-bates mentioned this pull request May 16, 2020
24 tasks
@blink1073
Copy link
Contributor

Test failure looks legitimate:

FAIL "function () {

        return this.evaluate(function (events) {

            return IPython._events_triggered.length >= events.length;

        }, [events]);

    }" did not evaluate to something truthy in 10000ms

#    type: uncaughtError

#    file: /home/travis/build/jupyter/notebook/notebook/tests/services/kernel.js

#    error: "function () {

        return this.evaluate(function (events) {

            return IPython._events_triggered.length >= events.length;

        }, [events]);

    }" did not evaluate to something truthy in 10000ms

#    stack: not provided

Captured console.log:

    Welcome to Project Jupyter! Explore the various tools available and their corresponding documentation. If you are interested in contributing to the platform, please visit the community resources section at http://jupyter.org/community.html.

    Loaded moment locale en

    actions jupyter-notebook:find-and-replace does not exist, still binding it in case it will be defined later...

    Loaded moment locale en

    Widgets are not available.  Please install widgetsnbextension or ipywidgets 4.0

    Session: kernel_created (1e271404-e8d0-45b0-afd9-4efeea19cbb9)

    Starting WebSockets: ws://localhost:8888/a@b/api/kernels/54407a5a-e1e2-46a9-9187-5bfbdaad77dc

    Kernel: kernel_connected (54407a5a-e1e2-46a9-9187-5bfbdaad77dc)

    Kernel: kernel_ready (54407a5a-e1e2-46a9-9187-5bfbdaad77dc)

    Kernel: kernel_interrupting (54407a5a-e1e2-46a9-9187-5bfbdaad77dc)

    Kernel: kernel_restarting (54407a5a-e1e2-46a9-9187-5bfbdaad77dc)

    Kernel: kernel_created (54407a5a-e1e2-46a9-9187-5bfbdaad77dc)

    Starting WebSockets: ws://localhost:8888/a@b/api/kernels/54407a5a-e1e2-46a9-9187-5bfbdaad77dc

    Kernel: kernel_connected (54407a5a-e1e2-46a9-9187-5bfbdaad77dc)

    Kernel: kernel_starting (54407a5a-e1e2-46a9-9187-5bfbdaad77dc)

    Kernel: kernel_ready (54407a5a-e1e2-46a9-9187-5bfbdaad77dc)

    Kernel: kernel_ready (54407a5a-e1e2-46a9-9187-5bfbdaad77dc)

    Kernel: kernel_reconnecting (54407a5a-e1e2-46a9-9187-5bfbdaad77dc)

    Starting WebSockets: ws://localhost:8888/a@b/api/kernels/54407a5a-e1e2-46a9-9187-5bfbdaad77dc

    Kernel: kernel_connected (54407a5a-e1e2-46a9-9187-5bfbdaad77dc)

    Kernel: kernel_ready (54407a5a-e1e2-46a9-9187-5bfbdaad77dc)

    Kernel: kernel_killed (54407a5a-e1e2-46a9-9187-5bfbdaad77dc)

    Kernel: kernel_starting (54407a5a-e1e2-46a9-9187-5bfbdaad77dc)

    Kernel: kernel_created (690611e1-1269-4ee4-b424-fb4bbda12c21)

    Starting WebSockets: ws://localhost:8888/a@b/api/kernels/690611e1-1269-4ee4-b424-fb4bbda12c21

    Kernel: kernel_connected (690611e1-1269-4ee4-b424-fb4bbda12c21)

    Kernel: kernel_ready (690611e1-1269-4ee4-b424-fb4bbda12c21)

    Kernel: kernel_killed (690611e1-1269-4ee4-b424-fb4bbda12c21)

    Kernel: kernel_starting (690611e1-1269-4ee4-b424-fb4bbda12c21)

    Kernel: kernel_created (96f841b7-29ff-49ad-bf40-e546e59ccb9e)

    Starting WebSockets: ws://localhost:8888/a@b/api/kernels/96f841b7-29ff-49ad-bf40-e546e59ccb9e

    Kernel: kernel_connected (96f841b7-29ff-49ad-bf40-e546e59ccb9e)

    Kernel: kernel_ready (96f841b7-29ff-49ad-bf40-e546e59ccb9e)

    Kernel: kernel_killed (96f841b7-29ff-49ad-bf40-e546e59ccb9e)

    Kernel: kernel_starting (96f841b7-29ff-49ad-bf40-e546e59ccb9e)

    Kernel: kernel_created (5e3ca932-8c9b-4a89-bf61-a9f230a0b341)

    Starting WebSockets: ws://localhost:8888/a@b/api/kernels/5e3ca932-8c9b-4a89-bf61-a9f230a0b341

    Kernel: kernel_connected (5e3ca932-8c9b-4a89-bf61-a9f230a0b341)

    Kernel: kernel_ready (5e3ca932-8c9b-4a89-bf61-a9f230a0b341)

    Kernel: kernel_reconnecting (5e3ca932-8c9b-4a89-bf61-a9f230a0b341)

    Starting WebSockets: ws://localhost:8888/a@b/api/kernels/5e3ca932-8c9b-4a89-bf61-a9f230a0b341

    Kernel: kernel_connected (5e3ca932-8c9b-4a89-bf61-a9f230a0b341)

    Kernel: kernel_ready (5e3ca932-8c9b-4a89-bf61-a9f230a0b341)

    Kernel: kernel_restarting (5e3ca932-8c9b-4a89-bf61-a9f230a0b341)

    Kernel: kernel_created (5e3ca932-8c9b-4a89-bf61-a9f230a0b341)

    Starting WebSockets: ws://localhost:8888/a@b/api/kernels/5e3ca932-8c9b-4a89-bf61-a9f230a0b341

    Kernel: kernel_connected (5e3ca932-8c9b-4a89-bf61-a9f230a0b341)

    WebSocket closed early Object {

      "code": 1006,

      "reason": "",

      "wasClean": false,

      "returnValue": true,

      "timeStamp": 1589589147812,

      "eventPhase": 2,

      "target":

        "readyState": 3,

        "bufferedAmount": 0,

        "extensions": "",

        "url": "ws://localhost:8888/a@b/api/kernels/5e3ca932-8c9b-4a89-bf61-a9f230a0b341/channels?session_id=9f54db2268d34c779bac45be05294444",

        "binaryType": "blob",

        "protocol": "",

        "URL": "ws://localhost:8888/a@b/api/kernels/5e3ca932-8c9b-4a89-bf61-a9f230a0b341/channels?session_id=9f54db2268d34c779bac45be05294444"

      "defaultPrevented": false,

      "srcElement":

        "readyState": 3,

        "bufferedAmount": 0,

        "extensions": "",

        "url": "ws://localhost:8888/a@b/api/kernels/5e3ca932-8c9b-4a89-bf61-a9f230a0b341/channels?session_id=9f54db2268d34c779bac45be05294444",

        "binaryType": "blob",

        "protocol": "",

        "URL": "ws://localhost:8888/a@b/api/kernels/5e3ca932-8c9b-4a89-bf61-a9f230a0b341/channels?session_id=9f54db2268d34c779bac45be05294444"

      "type": "close",

      "cancelable": false,

      "currentTarget":

        "readyState": 3,

        "bufferedAmount": 0,

        "extensions": "",

        "url": "ws://localhost:8888/a@b/api/kernels/5e3ca932-8c9b-4a89-bf61-a9f230a0b341/channels?session_id=9f54db2268d34c779bac45be05294444",

        "binaryType": "blob",

        "protocol": "",

        "URL": "ws://localhost:8888/a@b/api/kernels/5e3ca932-8c9b-4a89-bf61-a9f230a0b341/channels?session_id=9f54db2268d34c779bac45be05294444"

      "bubbles": false,

      "cancelBubble": false

    }

    Kernel: kernel_disconnected (5e3ca932-8c9b-4a89-bf61-a9f230a0b341)

    Connection lost, reconnecting in 1 seconds.

    Kernel: kernel_reconnecting (5e3ca932-8c9b-4a89-bf61-a9f230a0b341)

    Starting WebSockets: ws://localhost:8888/a@b/api/kernels/5e3ca932-8c9b-4a89-bf61-a9f230a0b341

@kevin-bates
Copy link
Member Author

I agree - this test failure looks legit and I should have noticed this yesterday (too anxious to get the weekend started). Unfortunately, I have no clue how that particular test works. I suspect these changes may introduce another "event" since a reconnect to the ports (via create_stream()) now occurs irrespective of session. Are you able to determine if this is just a matter of updating a test that may be too tight or an indication of an incompatible change?

To be honest, these server changes aren't required for EG's remote kernels since a new session_key is introduced anyway (due to the way the kernel requests are proxied), which results in open() always calling self.create_stream() anyway. As a result, the logic in this PR primarily addresses kernels spawned from the current server that recreate their ports on restarts - which, I suppose, can exist.

I've verified that restarts from Lab with EG kernels now work with only the fix provided by @jasongrout in jupyterlab/jupyterlab#8432. I also checked restarts with local kernels (that retain ports across restarts) still work using only the Lab change. So, IMHO, this PR could probably be deferred to jupyter_server - where kernel providers will definitely introduce port changes on restarts.

@blink1073
Copy link
Contributor

I don't have the bandwidth to dive into this, I'd say defer it

@jasongrout
Copy link
Member

Iirc, I needed this when I was I insisting that the ports change on restart by explicitly setting the port caching to false (see the original issue)

@kevin-bates
Copy link
Member Author

Thank you for the responses and clarification (Jason). Since this runs a non-zero chance of destabilizing behaviors due to the assumption that ports are static across restarts, I'm in favor of closing this PR and addressing this in jupyter_server - where the wiggle room for changes like this is a bit wider.

@kevin-bates
Copy link
Member Author

Spent a bit of time with this and see that restarts produce the following for local kernels via the Notebook UI:

[E 08:49:40.345 LabApp] Uncaught exception GET /api/kernels/ab29e63f-fcc9-4ad0-add0-d3d45aea7800/channels?session_id=d2da0011331244768b766f9a1c488c31 (127.0.0.1)
    HTTPServerRequest(protocol='http', host='0.0.0.0:8888', method='GET', uri='/api/kernels/ab29e63f-fcc9-4ad0-add0-d3d45aea7800/channels?session_id=d2da0011331244768b766f9a1c488c31', version='HTTP/1.1', remote_ip='127.0.0.1')
    Traceback (most recent call last):
      File "/opt/anaconda3/envs/notebook-dev/lib/python3.7/site-packages/tornado/websocket.py", line 956, in _accept_connection
        open_result = handler.open(*handler.open_args, **handler.open_kwargs)
      File "/Users/kbates/repos/oss/jupyter/notebook/notebook/services/kernels/handlers.py", line 270, in open
        self._on_zmq_reply(stream, msg_list)
      File "/Users/kbates/repos/oss/jupyter/notebook/notebook/services/kernels/handlers.py", line 423, in _on_zmq_reply
        super(ZMQChannelsHandler, self)._on_zmq_reply(stream, msg)
      File "/Users/kbates/repos/oss/jupyter/notebook/notebook/base/zmqhandlers.py", line 242, in _on_zmq_reply
        if self.ws_connection is None or stream.closed():
    AttributeError: 'dict' object has no attribute 'closed'

This does not happen using the Lab UI.

The reason this works via the Notebook UI against an EG server is due to the lack of handling of the session key, which winds up skipping the replay logic where the issue occurs. It also appears that buffer replay is slightly different via the Lab UI since there no messages get buffered when the Lab UI is used. I suspect Notebook UI must be triggering a close() that Lab Ui doesn't - thereby triggering the buffering of the status messages across a restart.

Here's the log sequence for a restart via Lab:

[D 09:06:32.823 LabApp] Accepting token-authenticated connection from 127.0.0.1
[D 09:06:33.147 LabApp] Starting kernel: ['/opt/anaconda3/envs/notebook-dev/bin/python', '-m', 'ipykernel_launcher', '-f', '/Users/kbates/Library/Jupyter/runtime/kernel-04f76333-de41-40ab-b15b-76177a895328.json']
[D 09:06:33.164 LabApp] Connecting to: tcp://127.0.0.1:57764
[I 09:06:33.166 LabApp] Kernel restarted: 04f76333-de41-40ab-b15b-76177a895328
[D 09:06:33.167 LabApp] Connecting to: tcp://127.0.0.1:57760
[D 09:06:33.177 LabApp] 200 POST /api/kernels/04f76333-de41-40ab-b15b-76177a895328/restart?1589731592813 (127.0.0.1) 358.23ms
[D 09:06:33.203 LabApp] activity on 04f76333-de41-40ab-b15b-76177a895328: status (busy)
[D 09:06:33.227 LabApp] activity on 04f76333-de41-40ab-b15b-76177a895328: shutdown_reply
[D 09:06:33.251 LabApp] activity on 04f76333-de41-40ab-b15b-76177a895328: status (idle)
[D 09:06:33.253 LabApp] Websocket closed 04f76333-de41-40ab-b15b-76177a895328:b757a8b6-1bfa-4776-8462-bee15dc8601c
[D 09:06:33.264 LabApp] Initializing websocket connection /api/kernels/04f76333-de41-40ab-b15b-76177a895328/channels

And that of restart via Notebook:

[D 09:09:59.479 LabApp] Websocket closed fc9693cd-f3a8-43fe-962d-b6222143c307:9325611663dd49f995fc3150124b5559
[I 09:09:59.481 LabApp] Starting buffering for fc9693cd-f3a8-43fe-962d-b6222143c307:9325611663dd49f995fc3150124b5559
[D 09:09:59.482 LabApp] Clearing buffer for fc9693cd-f3a8-43fe-962d-b6222143c307
[D 09:09:59.817 LabApp] Starting kernel: ['/opt/anaconda3/envs/notebook-dev/bin/python', '-m', 'ipykernel_launcher', '-f', '/Users/kbates/Library/Jupyter/runtime/kernel-fc9693cd-f3a8-43fe-962d-b6222143c307.json']
[D 09:09:59.828 LabApp] Connecting to: tcp://127.0.0.1:53672
[I 09:09:59.830 LabApp] Kernel restarted: fc9693cd-f3a8-43fe-962d-b6222143c307
[D 09:09:59.832 LabApp] Connecting to: tcp://127.0.0.1:53668
[D 09:09:59.844 LabApp] 200 POST /api/kernels/fc9693cd-f3a8-43fe-962d-b6222143c307/restart (127.0.0.1) 354.31ms
[D 09:09:59.853 LabApp] activity on fc9693cd-f3a8-43fe-962d-b6222143c307: status (busy)
[D 09:09:59.855 LabApp] Buffering msg on fc9693cd-f3a8-43fe-962d-b6222143c307:iopub
[D 09:09:59.865 LabApp] activity on fc9693cd-f3a8-43fe-962d-b6222143c307: shutdown_reply
[D 09:09:59.866 LabApp] Buffering msg on fc9693cd-f3a8-43fe-962d-b6222143c307:iopub
[D 09:09:59.877 LabApp] activity on fc9693cd-f3a8-43fe-962d-b6222143c307: status (idle)
[D 09:09:59.878 LabApp] Buffering msg on fc9693cd-f3a8-43fe-962d-b6222143c307:iopub
[D 09:09:59.885 LabApp] Initializing websocket connection /api/kernels/fc9693cd-f3a8-43fe-962d-b6222143c307/channels
[I 09:09:59.902 LabApp] Adapting from protocol version 5.1 (kernel fc9693cd-f3a8-43fe-962d-b6222143c307) to 5.3 (client).
[D 09:09:59.908 LabApp] 101 GET /api/kernels/fc9693cd-f3a8-43fe-962d-b6222143c307/channels?session_id=9325611663dd49f995fc3150124b5559 (127.0.0.1) 26.94ms
[D 09:09:59.915 LabApp] Opening websocket /api/kernels/fc9693cd-f3a8-43fe-962d-b6222143c307/channels
[D 09:10:03.267 LabApp] Getting buffer for fc9693cd-f3a8-43fe-962d-b6222143c307
[I 09:10:17.076 LabApp] Restoring connection for fc9693cd-f3a8-43fe-962d-b6222143c307:9325611663dd49f995fc3150124b5559
[I 09:10:19.688 LabApp] Replaying 3 buffered messages
[E 09:10:39.691 LabApp] Uncaught exception GET /api/kernels/fc9693cd-f3a8-43fe-962d-b6222143c307/channels?session_id=9325611663dd49f995fc3150124b5559 (127.0.0.1)
    HTTPServerRequest(protocol='http', host='0.0.0.0:8888', method='GET', uri='/api/kernels/fc9693cd-f3a8-43fe-962d-b6222143c307/channels?session_id=9325611663dd49f995fc3150124b5559', version='HTTP/1.1', remote_ip='127.0.0.1')
    Traceback (most recent call last):
      File "/opt/anaconda3/envs/notebook-dev/lib/python3.7/site-packages/tornado/websocket.py", line 956, in _accept_connection
        open_result = handler.open(*handler.open_args, **handler.open_kwargs)
      File "/Users/kbates/repos/oss/jupyter/notebook/notebook/services/kernels/handlers.py", line 270, in open
        self._on_zmq_reply(stream, msg_list)
      File "/Users/kbates/repos/oss/jupyter/notebook/notebook/services/kernels/handlers.py", line 423, in _on_zmq_reply
        super(ZMQChannelsHandler, self)._on_zmq_reply(stream, msg)
      File "/Users/kbates/repos/oss/jupyter/notebook/notebook/base/zmqhandlers.py", line 242, in _on_zmq_reply
        if self.ws_connection is None or stream.closed():
    AttributeError: 'dict' object has no attribute 'closed'

I think the proper change is that stream = buffer_info['channels'] should be:

     stream = buffer_info['channels'][channel]

When that is done, restarts also work via Notebook when replays are active. However, the kernel.js test still has failures, but of a different nature: https://travis-ci.org/github/kevin-bates/notebook/jobs/688097176 and I just don't know that particular test mechanics to determine if it's just an overzealous test or a real side-affecting issue.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants