Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Execute Requests Immediately Following and Error are Aborted in 5.5 #609

Closed
MSeal opened this issue Mar 21, 2021 · 9 comments · Fixed by #618
Closed

Execute Requests Immediately Following and Error are Aborted in 5.5 #609

MSeal opened this issue Mar 21, 2021 · 9 comments · Fixed by #618
Milestone

Comments

@MSeal
Copy link
Contributor

MSeal commented Mar 21, 2021

In debugging nteract/testbook#88, which was traced to an issue in testbook behavior causing empty execute response messages with ipykernel 5.5 in several tests, I believe we have tacked down the change in behavior to this commit. It appears to have changed the error abort interaction such that any additional execute requests within stop_on_error_timeout after an error occurs are aborted.

The docs state Requests that arrive within this [stop_on_error_timeout] window after an error will be cancelled.. What was the intention for this feature and why did it change behavior in 5.5? The message sent are also odd if this was intentional behavior because there's no indicator in the execute_result that the intended execution was aborted. I feel an error message or status of sorts would be expected unless I am missing something here?

I believe this is unintended behavior, as 5.4.3 does not have this behavior and the 5.5.0 changelog has no mention of it. Additionally notebooks with allow-error flags could also hit this issue unexpectedly, depending on how the notebook executor was queuing / processing the requests. Testbook certainly hits this because it's often testing that something failed or trying optimistic execution paths. For now I think the only way around the issue is to wait after failed executions or set the stop_on_error_timeout to 0.

I've drafted the simplest example I could to demonstrate behavior with jupyter_client, please let me know if any additional details are needed / if this was intended and incidental behavior change:

In [1]: from jupyter_client.manager import start_new_kernel

In [2]: km, kc = start_new_kernel(kernel_name='python3')

# This is a dumbed down version of what nbclient does to poll until a cell has finished executing
In [3]: def poll_output_msg(parent_msg_id):
   ...:     msgs = []
   ...:     while True:
   ...:         msg = kc.iopub_channel.get_msg()
   ...:         if msg['parent_header'].get('msg_id') == parent_msg_id:
   ...:             msgs.append(msg)
   ...:         if msg['msg_type'] == 'status' and msg['content']['execution_state'] == 'idle':
   ...:             break
   ...:     return msgs
   ...: 

In [4]: def execute(req):
   ...:     return poll_output_msg(kc.execute(req))
   ...: 

# The first execution will result an a NameError, the second will not print post-error because it is executed immediately after
In [5]: a, b = execute("error_req"), execute("print('post-error')")

Here's the message contents showing the first cell a's error

In [6]: a
Out[6]: 
[{'header': {'msg_id': 'a6355c9a-25ad0390854e5e5912019c1e_4',
   'msg_type': 'status',
   'username': 'mseal',
   'session': 'a6355c9a-25ad0390854e5e5912019c1e',
   'date': datetime.datetime(2021, 3, 21, 22, 8, 31, 518706, tzinfo=tzutc()),
   'version': '5.3'},
  'msg_id': 'a6355c9a-25ad0390854e5e5912019c1e_4',
  'msg_type': 'status',
  'parent_header': {'msg_id': 'df05c2aa-0ec98f3fa0453ce00aff4699_1',
   'msg_type': 'execute_request',
   'username': 'mseal',
   'session': 'df05c2aa-0ec98f3fa0453ce00aff4699',
   'date': datetime.datetime(2021, 3, 21, 22, 8, 31, 517670, tzinfo=tzutc()),
   'version': '5.3'},
  'metadata': {},
  'content': {'execution_state': 'busy'},
  'buffers': []},
 {'header': {'msg_id': 'a6355c9a-25ad0390854e5e5912019c1e_5',
   'msg_type': 'execute_input',
   'username': 'mseal',
   'session': 'a6355c9a-25ad0390854e5e5912019c1e',
   'date': datetime.datetime(2021, 3, 21, 22, 8, 31, 518983, tzinfo=tzutc()),
   'version': '5.3'},
  'msg_id': 'a6355c9a-25ad0390854e5e5912019c1e_5',
  'msg_type': 'execute_input',
  'parent_header': {'msg_id': 'df05c2aa-0ec98f3fa0453ce00aff4699_1',
   'msg_type': 'execute_request',
   'username': 'mseal',
   'session': 'df05c2aa-0ec98f3fa0453ce00aff4699',
   'date': datetime.datetime(2021, 3, 21, 22, 8, 31, 517670, tzinfo=tzutc()),
   'version': '5.3'},
  'metadata': {},
  'content': {'code': 'error_req', 'execution_count': 1},
  'buffers': []},
 {'header': {'msg_id': 'a6355c9a-25ad0390854e5e5912019c1e_6',
   'msg_type': 'error',
   'username': 'mseal',
   'session': 'a6355c9a-25ad0390854e5e5912019c1e',
   'date': datetime.datetime(2021, 3, 21, 22, 8, 31, 570318, tzinfo=tzutc()),
   'version': '5.3'},
  'msg_id': 'a6355c9a-25ad0390854e5e5912019c1e_6',
  'msg_type': 'error',
  'parent_header': {'msg_id': 'df05c2aa-0ec98f3fa0453ce00aff4699_1',
   'msg_type': 'execute_request',
   'username': 'mseal',
   'session': 'df05c2aa-0ec98f3fa0453ce00aff4699',
   'date': datetime.datetime(2021, 3, 21, 22, 8, 31, 517670, tzinfo=tzutc()),
   'version': '5.3'},
  'metadata': {},
  'content': {'traceback': ['\x1b[0;31m---------------------------------------------------------------------------\x1b[0m',
    '\x1b[0;31mNameError\x1b[0m                                 Traceback (most recent call last)',
    '\x1b[0;32m<ipython-input-1-88f24347380f>\x1b[0m in \x1b[0;36m<module>\x1b[0;34m\x1b[0m\n\x1b[0;32m----> 1\x1b[0;31m \x1b[0merror_req\x1b[0m\x1b[0;34m\x1b[0m\x1b[0;34m\x1b[0m\x1b[0m\n\x1b[0m',
    "\x1b[0;31mNameError\x1b[0m: name 'error_req' is not defined"],
   'ename': 'NameError',
   'evalue': "name 'error_req' is not defined"},
  'buffers': []},
 {'header': {'msg_id': 'a6355c9a-25ad0390854e5e5912019c1e_8',
   'msg_type': 'status',
   'username': 'mseal',
   'session': 'a6355c9a-25ad0390854e5e5912019c1e',
   'date': datetime.datetime(2021, 3, 21, 22, 8, 31, 572037, tzinfo=tzutc()),
   'version': '5.3'},
  'msg_id': 'a6355c9a-25ad0390854e5e5912019c1e_8',
  'msg_type': 'status',
  'parent_header': {'msg_id': 'df05c2aa-0ec98f3fa0453ce00aff4699_1',
   'msg_type': 'execute_request',
   'username': 'mseal',
   'session': 'df05c2aa-0ec98f3fa0453ce00aff4699',
   'date': datetime.datetime(2021, 3, 21, 22, 8, 31, 517670, tzinfo=tzutc()),
   'version': '5.3'},
  'metadata': {},
  'content': {'execution_state': 'idle'},
  'buffers': []}]

and here's cell b where no print message is made.

In [7]: b
Out[7]: 
[{'header': {'msg_id': 'a6355c9a-25ad0390854e5e5912019c1e_9',
   'msg_type': 'status',
   'username': 'mseal',
   'session': 'a6355c9a-25ad0390854e5e5912019c1e',
   'date': datetime.datetime(2021, 3, 21, 22, 8, 31, 573157, tzinfo=tzutc()),
   'version': '5.3'},
  'msg_id': 'a6355c9a-25ad0390854e5e5912019c1e_9',
  'msg_type': 'status',
  'parent_header': {'msg_id': 'df05c2aa-0ec98f3fa0453ce00aff4699_2',
   'msg_type': 'execute_request',
   'username': 'mseal',
   'session': 'df05c2aa-0ec98f3fa0453ce00aff4699',
   'date': datetime.datetime(2021, 3, 21, 22, 8, 31, 572676, tzinfo=tzutc()),
   'version': '5.3'},
  'metadata': {},
  'content': {'execution_state': 'busy'},
  'buffers': []},
 {'header': {'msg_id': 'a6355c9a-25ad0390854e5e5912019c1e_11',
   'msg_type': 'status',
   'username': 'mseal',
   'session': 'a6355c9a-25ad0390854e5e5912019c1e',
   'date': datetime.datetime(2021, 3, 21, 22, 8, 31, 573335, tzinfo=tzutc()),
   'version': '5.3'},
  'msg_id': 'a6355c9a-25ad0390854e5e5912019c1e_11',
  'msg_type': 'status',
  'parent_header': {'msg_id': 'df05c2aa-0ec98f3fa0453ce00aff4699_2',
   'msg_type': 'execute_request',
   'username': 'mseal',
   'session': 'df05c2aa-0ec98f3fa0453ce00aff4699',
   'date': datetime.datetime(2021, 3, 21, 22, 8, 31, 572676, tzinfo=tzutc()),
   'version': '5.3'},
  'metadata': {},
  'content': {'execution_state': 'idle'},
  'buffers': []}]

If instead we add a sleep(0.2) between the execution requests, the print request correctly outputs a message:

In [12]: import time

In [13]: a, _, b = execute("error_req"), time.sleep(0.2), execute("print('post-error')")

In [14]: b
Out[14]: 
[{'header': {'msg_id': 'a6355c9a-25ad0390854e5e5912019c1e_27',
   'msg_type': 'status',
   'username': 'mseal',
   'session': 'a6355c9a-25ad0390854e5e5912019c1e',
   'date': datetime.datetime(2021, 3, 21, 22, 19, 2, 303236, tzinfo=tzutc()),
   'version': '5.3'},
  'msg_id': 'a6355c9a-25ad0390854e5e5912019c1e_27',
  'msg_type': 'status',
  'parent_header': {'msg_id': 'df05c2aa-0ec98f3fa0453ce00aff4699_6',
   'msg_type': 'execute_request',
   'username': 'mseal',
   'session': 'df05c2aa-0ec98f3fa0453ce00aff4699',
   'date': datetime.datetime(2021, 3, 21, 22, 19, 2, 300607, tzinfo=tzutc()),
   'version': '5.3'},
  'metadata': {},
  'content': {'execution_state': 'busy'},
  'buffers': []},
 {'header': {'msg_id': 'a6355c9a-25ad0390854e5e5912019c1e_28',
   'msg_type': 'execute_input',
   'username': 'mseal',
   'session': 'a6355c9a-25ad0390854e5e5912019c1e',
   'date': datetime.datetime(2021, 3, 21, 22, 19, 2, 303942, tzinfo=tzutc()),
   'version': '5.3'},
  'msg_id': 'a6355c9a-25ad0390854e5e5912019c1e_28',
  'msg_type': 'execute_input',
  'parent_header': {'msg_id': 'df05c2aa-0ec98f3fa0453ce00aff4699_6',
   'msg_type': 'execute_request',
   'username': 'mseal',
   'session': 'df05c2aa-0ec98f3fa0453ce00aff4699',
   'date': datetime.datetime(2021, 3, 21, 22, 19, 2, 300607, tzinfo=tzutc()),
   'version': '5.3'},
  'metadata': {},
  'content': {'code': "print('post-error')", 'execution_count': 5},
  'buffers': []},
 {'header': {'msg_id': 'a6355c9a-25ad0390854e5e5912019c1e_29',
   'msg_type': 'stream',
   'username': 'mseal',
   'session': 'a6355c9a-25ad0390854e5e5912019c1e',
   'date': datetime.datetime(2021, 3, 21, 22, 19, 2, 307224, tzinfo=tzutc()),
   'version': '5.3'},
  'msg_id': 'a6355c9a-25ad0390854e5e5912019c1e_29',
  'msg_type': 'stream',
  'parent_header': {'msg_id': 'df05c2aa-0ec98f3fa0453ce00aff4699_6',
   'msg_type': 'execute_request',
   'username': 'mseal',
   'session': 'df05c2aa-0ec98f3fa0453ce00aff4699',
   'date': datetime.datetime(2021, 3, 21, 22, 19, 2, 300607, tzinfo=tzutc()),
   'version': '5.3'},
  'metadata': {},
  'content': {'name': 'stdout', 'text': 'post-error\n'},
  'buffers': []},
 {'header': {'msg_id': 'a6355c9a-25ad0390854e5e5912019c1e_31',
   'msg_type': 'status',
   'username': 'mseal',
   'session': 'a6355c9a-25ad0390854e5e5912019c1e',
   'date': datetime.datetime(2021, 3, 21, 22, 19, 2, 309692, tzinfo=tzutc()),
   'version': '5.3'},
  'msg_id': 'a6355c9a-25ad0390854e5e5912019c1e_31',
  'msg_type': 'status',
  'parent_header': {'msg_id': 'df05c2aa-0ec98f3fa0453ce00aff4699_6',
   'msg_type': 'execute_request',
   'username': 'mseal',
   'session': 'df05c2aa-0ec98f3fa0453ce00aff4699',
   'date': datetime.datetime(2021, 3, 21, 22, 19, 2, 300607, tzinfo=tzutc()),
   'version': '5.3'},
  'metadata': {},
  'content': {'execution_state': 'idle'},
  'buffers': []}]
@blink1073
Copy link
Contributor

@glentakahashi do you mind taking a look into this?

@blink1073 blink1073 added this to the 5.5 milestone Mar 22, 2021
@glentakahashi
Copy link
Contributor

glentakahashi commented Mar 22, 2021

To shed some light on the setting, the main use case is for slow network connections or for large notebooks. The docs have it written:

time (in seconds) to wait for messages to arrive when aborting queued requests after an error.
Requests that arrive within this window after an error will be cancelled.
Increase in the event of unusually slow network causing significant delays, which can manifest as e.g. “Run all” in a notebook aborting some, but not all, messages after an error.

We run ipykernel in situations with slow network, so we noticed the exact behavior that when hitting "Run all" it would randomly run some of the cells near the end of notebooks even though the above errors had issues.

This has been a feature for ~3yrs but had an issue before which caused it to not function as intended and not properly respect the timeout. Between the releases you noted, I pushed a fix which now causes it to respect the timeout.

As for not getting aborted messages, are you listening on the shell channel as well? I'm seeing aborted messages come through properly in the backend. (And the frontend behavior works as well with jupyterlab/notebook as well). https://jupyter-client.readthedocs.io/en/stable/messaging.html#execution-results

An example:
First cell:
Code:

raise Exception("throw error")

Messages:

{
  header: {
    msg_id: '80697e55-7b17ddc90e9cee2ca3d00189_21',
    msg_type: 'status',
    username: 'glen',
    session: '80697e55-7b17ddc90e9cee2ca3d00189',
    date: '2021-03-22T17:27:46.659032Z',
    version: '5.3'
  },
  parent_header: {
    msg_id: 'ad6d622e-26d4-4167-8372-2000ffe229f6',
    date: '2021-03-22T17:27:46.653000Z',
    version: '5.2',
    msg_type: 'execute_request',
    username: 'glen',
    session: '915e0e60-2991-47c9-83b3-8a3f23c8865c',
    traceId: '8e5141f171d24b80bfd92e0c06ab3f43'
  },
  metadata: {},
  content: { execution_state: 'busy' },
  buffers: [],
  channel: 'iopub'
}
{
  header: {
    msg_id: '80697e55-7b17ddc90e9cee2ca3d00189_22',
    msg_type: 'execute_input',
    username: 'glen',
    session: '80697e55-7b17ddc90e9cee2ca3d00189',
    date: '2021-03-22T17:27:46.659744Z',
    version: '5.3'
  },
  parent_header: {
    msg_id: 'ad6d622e-26d4-4167-8372-2000ffe229f6',
    date: '2021-03-22T17:27:46.653000Z',
    version: '5.2',
    msg_type: 'execute_request',
    username: 'glen',
    session: '915e0e60-2991-47c9-83b3-8a3f23c8865c',
    traceId: '8e5141f171d24b80bfd92e0c06ab3f43'
  },
  metadata: {},
  content: { code: 'raise Exception("throw error")', execution_count: 4 },
  buffers: [],
  channel: 'iopub'
}
{
  header: {
    msg_id: '80697e55-7b17ddc90e9cee2ca3d00189_23',
    msg_type: 'error',
    username: 'glen',
    session: '80697e55-7b17ddc90e9cee2ca3d00189',
    date: '2021-03-22T17:27:46.924699Z',
    version: '5.3'
  },
  parent_header: {
    msg_id: 'ad6d622e-26d4-4167-8372-2000ffe229f6',
    date: '2021-03-22T17:27:46.653000Z',
    version: '5.2',
    msg_type: 'execute_request',
    username: 'glen',
    session: '915e0e60-2991-47c9-83b3-8a3f23c8865c',
    traceId: '8e5141f171d24b80bfd92e0c06ab3f43'
  },
  metadata: {},
  content: {
    traceback: [
      '\u001b[0;31m---------------------------------------------------------------------------\u001b[0m',
      '\u001b[0;31mException\u001b[0m                                 Traceback (most recent call last)',
      '\u001b[0;32m<ipython-input-4-71db4764d9cb>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n' +
        '\u001b[0;32m----> 1\u001b[0;31m \u001b[0;32mraise\u001b[0m \u001b[0mException\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m"throw error"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n' +
        '\u001b[0m',
      '\u001b[0;31mException\u001b[0m: throw error'
    ],
    ename: 'Exception',
    evalue: 'throw error'
  },
  buffers: [],
  channel: 'iopub'
}
{
  header: {
    msg_id: '80697e55-7b17ddc90e9cee2ca3d00189_24',
    msg_type: 'execute_reply',
    username: 'glen',
    session: '80697e55-7b17ddc90e9cee2ca3d00189',
    date: '2021-03-22T17:27:46.926239Z',
    version: '5.3'
  },
  parent_header: {
    msg_id: 'ad6d622e-26d4-4167-8372-2000ffe229f6',
    date: '2021-03-22T17:27:46.653000Z',
    version: '5.2',
    msg_type: 'execute_request',
    username: 'glen',
    session: '915e0e60-2991-47c9-83b3-8a3f23c8865c',
    traceId: '8e5141f171d24b80bfd92e0c06ab3f43'
  },
  metadata: {
    started: '2021-03-22T17:27:46.659670Z',
    dependencies_met: true,
    engine: 'b6cdd4f1-9f9b-439e-be12-286ce8769df4',
    status: 'error'
  },
  content: {
    status: 'error',
    traceback: [
      '\u001b[0;31m---------------------------------------------------------------------------\u001b[0m',
      '\u001b[0;31mException\u001b[0m                                 Traceback (most recent call last)',
      '\u001b[0;32m<ipython-input-4-71db4764d9cb>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n' +
        '\u001b[0;32m----> 1\u001b[0;31m \u001b[0;32mraise\u001b[0m \u001b[0mException\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m"throw error"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n' +
        '\u001b[0m',
      '\u001b[0;31mException\u001b[0m: throw error'
    ],
    ename: 'Exception',
    evalue: 'throw error',
    engine_info: {
      engine_uuid: 'b6cdd4f1-9f9b-439e-be12-286ce8769df4',
      engine_id: -1,
      method: 'execute'
    },
    execution_count: 4,
    user_expressions: {},
    payload: []
  },
  buffers: [],
  channel: 'shell'
}
{
  header: {
    msg_id: '80697e55-7b17ddc90e9cee2ca3d00189_25',
    msg_type: 'status',
    username: 'glen',
    session: '80697e55-7b17ddc90e9cee2ca3d00189',
    date: '2021-03-22T17:27:46.927386Z',
    version: '5.3'
  },
  parent_header: {
    msg_id: 'ad6d622e-26d4-4167-8372-2000ffe229f6',
    date: '2021-03-22T17:27:46.653000Z',
    version: '5.2',
    msg_type: 'execute_request',
    username: 'glen',
    session: '915e0e60-2991-47c9-83b3-8a3f23c8865c',
    traceId: '8e5141f171d24b80bfd92e0c06ab3f43'
  },
  metadata: {},
  content: { execution_state: 'idle' },
  buffers: [],
  channel: 'iopub'
}

Run next cell
Code:

print("hello world")

Messages:

{
  header: {
    msg_id: '80697e55-7b17ddc90e9cee2ca3d00189_26',
    msg_type: 'status',
    username: 'glen',
    session: '80697e55-7b17ddc90e9cee2ca3d00189',
    date: '2021-03-22T17:27:46.928822Z',
    version: '5.3'
  },
  parent_header: {
    msg_id: '4f577826-8280-464e-954f-021a824f27de',
    date: '2021-03-22T17:27:46.656000Z',
    version: '5.2',
    msg_type: 'execute_request',
    username: 'glen',
    session: '915e0e60-2991-47c9-83b3-8a3f23c8865c',
    traceId: '8e5141f171d24b80bfd92e0c06ab3f43'
  },
  metadata: {},
  content: { execution_state: 'busy' },
  buffers: [],
  channel: 'iopub'
}
{
  header: {
    msg_id: '80697e55-7b17ddc90e9cee2ca3d00189_28',
    msg_type: 'status',
    username: 'glen',
    session: '80697e55-7b17ddc90e9cee2ca3d00189',
    date: '2021-03-22T17:27:46.929407Z',
    version: '5.3'
  },
  parent_header: {
    msg_id: '4f577826-8280-464e-954f-021a824f27de',
    date: '2021-03-22T17:27:46.656000Z',
    version: '5.2',
    msg_type: 'execute_request',
    username: 'glen',
    session: '915e0e60-2991-47c9-83b3-8a3f23c8865c',
    traceId: '8e5141f171d24b80bfd92e0c06ab3f43'
  },
  metadata: {},
  content: { execution_state: 'idle' },
  buffers: [],
  channel: 'iopub'
}
{
  header: {
    msg_id: '80697e55-7b17ddc90e9cee2ca3d00189_27',
    msg_type: 'execute_reply',
    username: 'glen',
    session: '80697e55-7b17ddc90e9cee2ca3d00189',
    date: '2021-03-22T17:27:46.929229Z',
    version: '5.3'
  },
  parent_header: {
    msg_id: '4f577826-8280-464e-954f-021a824f27de',
    date: '2021-03-22T17:27:46.656000Z',
    version: '5.2',
    msg_type: 'execute_request',
    username: 'glen',
    session: '915e0e60-2991-47c9-83b3-8a3f23c8865c',
    traceId: '8e5141f171d24b80bfd92e0c06ab3f43'
  },
  metadata: { engine: 'b6cdd4f1-9f9b-439e-be12-286ce8769df4', status: 'aborted' },
  content: { status: 'aborted' },
  buffers: [],
  channel: 'shell'
}

@MSeal
Copy link
Contributor Author

MSeal commented Mar 23, 2021

We run ipykernel in situations with slow network, so we noticed the exact behavior that when hitting "Run all" it would randomly run some of the cells near the end of notebooks even though the above errors had issues.

A 100ms reject window is a bit of a rough solution to this as it forces interfaces to now be aware of safe windows to send requests depending on underlying kernels / versions of kernels. I could imagine sometimes network latency in a bad connection is worse than 100ms for the underlying problem too. Not saying I see a perfect solution but there is some bleeding of abstraction responsibilities in a fuzzy way here that were unexpected (from my pov).

This has been a feature for ~3yrs but had an issue before which caused it to not function as intended and not properly respect the timeout. Between the releases you noted, I pushed a fix which now causes it to respect the timeout.

I somewhat pieced that together, but I wasn't sure what the intended vs actual behavior was given this was not the behavior in 5.4.3. It would have saved a fair bit of debugging time if this had made the changlog for the release fyi.

As for not getting aborted messages, are you listening on the shell channel as well?

Hmm I'm not sure why my example didn't pick up the shell channel message off-hand. Apologies I didn't notice that before posting.

@glentakahashi
Copy link
Contributor

I can see that on your first point RE bleeding abstractions. I think that in a perfect world, this wouldn't be necessary and instead the Jupyter protocol allowed for queueing multiple source/cell requests in a single request, or potentially queueing them up before submitting, but I think that's unlikely given how big of a change it might be. (I wouldn't have the time to dedicate to such a thing)

The other thing that I see that could make this less leaky is just removing the feature altogether and instead have clients implement this aborting behavior, but since that's a breaking change it would have to be done in stages. I do think it would also be a better long term solution, as the clients themselves would have more knowledge about what the intended queueing/execution behavior is, and could choose how they wanted to implement the stop_on_error timeout themselves.

Also my fault on the changelog, I should have made it more clear what the implications of the fix were and what to look out for in tests.

@MSeal what do you suggest the best path forward here is? I don't think we should revert anything with regards to the fix, and I think that the fix you implemented in testbook is a perfectly reasonable fix, as that's why the variable is meant to be configured. If testbook depends on continuing to execute code even after errors, you could set the stop_on_error: false in all the execute_requests that need this behavior or keep using stop_on_error_timeout

@MSeal
Copy link
Contributor Author

MSeal commented Mar 23, 2021

Yeah I had similar initial thoughts on the Jupyter server (or other interfaces) should be where the change is actually made to manage queue requests. But that's a bit of a larger upstream change.

In the shorter term (maybe for the 6.0 release?) maybe it makes sense to change the timeout value? Would it make sense to set stop_on_error to 0.0 in future releases so interfaces can opt-in to the behavior instead of having to opt-out?

All good on missing changelog, I've made similar misses in the past but I thought I'd bring it up for future changes. I'd recommend retroactively add the behavioral change to the changelog in case someone else hits the issue unexpectedly (e.g. colab folks).

@glentakahashi
Copy link
Contributor

I created a PR for the changelog here: #613

But in terms of short term fix, I don't think the right option is to change the default stop_on_error_timeout to 0 either. stop_on_error has actually worked for a long time, but the timeout that it waited was just not accurate before. Many other products depend on this being the default (e.g. jupyter notebook) so I think it would be breaking to remove. If anything, maybe we could just reduce the timeout to be slightly less, but still non-zero? E.g. 0.05 or 0.01? I would want to think through the implications of that first though to ensure it doesn't break any existing workflows in apps like jupterlab or notebook.

@MSeal
Copy link
Contributor Author

MSeal commented Mar 23, 2021

Thanks for updating the changelog.

Maybe it makes sense for issues / conversation to be in the lab / notebook server repos then? There are lots of other kernels (albeit ipykernel is the most common) that run in those environments that don't have a stop_on_error_timeout so the behavior from a notebook interface is going to be inconsistent at a minimum.

Reducing the delay doesn't really buy any gains in guarantee unless it's turned off completely imho. It just makes the potential races win/lose at different spots where interfaces are not aware or make non-timing based assumptions.

@glentakahashi
Copy link
Contributor

Actually I was wrong. Now that I think about it, the behavior of how stop_on_error_timeout worked before I implemented the fix was equivalent to if stop_on_error_timeout was 0.0 (see #539). So actually I think its reasonable to set the default to 0.0, although I'd still be wary and want to make sure it doesn't break anything.

I think if you were to submit a PR to set the default to 0.0 and add a short circuit here to skip the future if the timeout is <=0, that would make sense to me. I would want to see that the new version still has tests pass in notebook and jupyterlab and maybe some other common jupyter clients.

@MSeal
Copy link
Contributor Author

MSeal commented Mar 24, 2021

@glentakahashi Sounds good. Let me take a stab at it this weekend and see how far I can get in testing different interfaces I maintain or work adjacent to with such a change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants