Improve the busy/idle execution state tracking for kernels. #1429

ojarjur · 2024-06-05T22:39:22Z

This PR makes two adjustments/enhancements for the way kernel execution state is tracked:

Distinguish between "busy"/"idle" status messages with different parent IDs.

A "busy" or "idle" status message is not sent in isolation, but rather in response to a specific parent message, and
the kernel might still be busy with one parent message even after reporting that it is done (e.g. has an "idle" status)
with a different parent message.

Accordingly, the overall kernel execution state is only switched to "idle" once all previously seen "busy" status
messages have had a corresponding "idle" message sent with the same parent message ID.
Distinguish between different types of parent message when determining whether or not the overall kernel is "busy".

Not every message type corresponds to a user action, so not all of them should be included in the logic for determining
that a kernel is "busy". This is especially true for the case of kernel culling where the distinction between "idle" and
"busy" is used to decide whether or not to cull the kernels.

Since there might be different setups where the server admin might want to include/exclude different types of
messages from this calculation, the set of message types used for status tracking is configurable.

This commit changes the way Jupyter server tracks both activity and execution state so that both of those are based solely on user actions rather than also incorporating control messages and other channels. This is important for both correctly tracking the state of the kernel, and also for culling kernels. Previously, the last activity timestamp was updated on every message on the iopub channel regardless of the type of message, and execution state was based on every `status` message on that channel regardless of what other message the `status` was in response to. That behavior causes incorrect behavior for both kernel culling and kernel status. In particular, it can result in both kernels being culled too early and too late (depending on what the user is doing). For example, if a user ran a long running code cell, and then switched tabs within the JupyterLab UI, then the JupyterLab UI would send a control message to the kernel and the kernel would respond with a status message referencing that control message as its parent. As a result of that, the Jupyter server would update its execution state to be `idle` even though the long running code cell was still executing. This could cause the kernel to be culled too soon. Alternatively, if the user was not running anything and just switched tabs within the JupyterLab UI, then the last activity timestamp would be updated even though the user didn't run any code cells. That would cause the kernel to be culled too late. This change fixes both of those scenarios by making sure that it is only user actions (and kernel messages in response to those user actions) that cause the execution state and last activity timestamps to be updated. This initial commit contains just the code change so that I can solicit feedback on this proposed change early. The core of the change will have to be new tests, which will come in a subsequent commit as part of the same pull request.

…st the corresponding busy status message.

… into ojarjur/fix-kernel-status

…ory cache of message IDs

krassowski · 2024-06-06T18:26:30Z

jupyter_server/services/kernels/kernelmanager.py

+            "execute_input",
+            "execute_reply",
+            "execute_request",
+            "inspect_request",


I see it was also discussed in depth in #1360 (comment). I am not sure if there was a resolution though.

I am not sure if there was a resolution though.

We didn't come to a consensus on the set of message types to include, but I think we did get a pretty clear consensus that the set of message types should be configurable so that server admins can override whatever default we choose.

I don't know if we can get a consensus on the set of types to leave in the default value for the config, but if we do get a consensus that something should be added to (or removed from) this list, then please let me know.

As discussed on the call today, it might be more conservative to change this list to a not_tracked_message_types, as it would be less risk of accidentally forgetting that a certain message type should be counted as kernel activity. It would also mean that the risk changes from "culling a kernel that shouldn't be culled" (bad because of potential data loss) to "not culling a kernel that can be culled" (bad because of potential cost / resource use), where the former seems more disruptive.

I've put together a tentative list of what I think would be the default set of not-tracked message types.

I got this by going through this page, and selecting all of the message types that were either:

Not sent on the shell channel, or

Of the form *_info_(request|reply).

This is almost disjoint from the proposed list for tracked message types. The difference is that the execute_input message that was going to be tracked is included here in the not-tracked list because it is sent on the IOPub channel instead of the shell channel.

The full list is:

comm_info_request,
comm_info_reply,
kernel_info_request,
kernel_info_reply,
shutdown_request,
shutdown_reply,
interrupt_request,
interrupt_reply,
debug_request,
debug_reply,
stream,
display_data,
update_display_data,
execute_input,
execute_result,
error,
status,
clear_output,
debug_event,
input_request,
input_reply

@krassowski @vidartf @Zsailer Can you all please help me double check this?

In particular:

Is my methodology described above the right approach to take?

Does the list I came up with look right?

Thanks, in advance!

Thanks @ojarjur. I'll give this a closer look later today.

This list looks right to me.

Thanks, @ojarjur. I believe this is good to go.

ojarjur · 2024-06-06T20:08:02Z

It looks like the newly added test is flaky and I was able to reproduce that on my local machine; 6 out of 100 runs failed with the same error.

I'll try to eliminate the flakyness and then update this PR

ojarjur · 2024-06-06T20:49:55Z

It looks like the newly added test is flaky and I was able to reproduce that on my local machine; 6 out of 100 runs failed with the same error.

I'll try to eliminate the flakyness and then update this PR

Fixed now; I ran the test 100 times locally and it passed every time.

… the threshold for determining that the kernel state is consistent

ojarjur · 2024-06-07T21:47:10Z

It looks like the newly added test is flaky and I was able to reproduce that on my local machine; 6 out of 100 runs failed with the same error.
I'll try to eliminate the flakyness and then update this PR

Fixed now; I ran the test 100 times locally and it passed every time.

I think there's a second race condition contributing to the test flakiness; the second one isn't an issue in the test but rather a preexisting race condition in the code that the test uncovered.

Specifically, this code can wind up being run after this code if the kernel starts up quickly enough...

That, in turn, can result in a correctly set "busy" and/or "idle" kernel execution state being overwritten with an erroneous state of "starting".

I'm not sure if the kernel execution state should be set at all in the _async_start_kernel method (it seems to only matter if the kernel manager's ready method raises an exception), but I'm quite certain that if it is set then it must be set before the _finish_kernel_start method is invoked.

…ad of retrying the call to get the kernel state just mark the whole test as flaky so it gets retried

…monitoring to a list of untracked message types

…lakiness caused by race conditions

Zsailer · 2024-07-18T16:33:59Z

Amazing work, @ojarjur! Merging 🚀

ojarjur added 11 commits November 17, 2023 00:16

Improve kernel status tracking by matching idle status messages again…

30b45db

…st the corresponding busy status message.

Merge branch 'main' of https://github.com/jupyter-server/jupyter_server…

d2a952d

… into ojarjur/fix-kernel-status

Test fixes

4ce5539

Reduce the diff against the current code

fadf5e8

Add a test for execute_state

bdb9814

Fix lint warnings

f37557c

Fix race conditions and deadlocks in the test_execution_state tests

f1eb5a6

Remove sleep that caused test failures on some jobs

c870f7e

Revert unexpected behavior change that affected tests on Windows

50c5f9c

Restore accidentally deleted pydoc

982cb1e

Zsailer mentioned this pull request Jun 6, 2024

Meeting Notes 2024 jupyter-server/team-compass#57

Open

Reduce the diff against the main branch and drop the unneeded, in-mem…

4921682

…ory cache of message IDs

krassowski added the bug label Jun 6, 2024

krassowski reviewed Jun 6, 2024

View reviewed changes

Respect status messages that explicitly report a status of "starting"

6257ca1

Fix flakiness in thekernel test_execution_state test

170cde0

ojarjur added 3 commits June 6, 2024 22:19

Make the kernel test_execution_state test more reliable by increasing…

e583410

… the threshold for determining that the kernel state is consistent

Make kernel execution state test more reliable

3ecc26d

Make the test/test_utils.py test pass on Windows

bcc2f1e

ojarjur added 2 commits June 7, 2024 21:49

Fix a race condition in setting the initial kernel execution state

300008e

Simplify the retry logic for the kernel execution state test... inste…

101ebed

…ad of retrying the call to get the kernel state just mark the whole test as flaky so it gets retried

ianthomas23 mentioned this pull request Jun 13, 2024

Kernel subshells (JEP91) implementation ipython/ipykernel#1249

Merged

ojarjur and others added 3 commits July 10, 2024 16:04

Merge branch 'main' into ojarjur/fix-kernel-status

0e8bd97

Switch from having a list of tracked message types for user activity …

54ea903

…monitoring to a list of untracked message types

Re-introduce retries in the execution status test to further reduce f…

1b3ea06

…lakiness caused by race conditions

Zsailer approved these changes Jul 18, 2024

View reviewed changes

Zsailer merged commit a6d2d35 into jupyter-server:main Jul 18, 2024
36 checks passed

ojarjur deleted the ojarjur/fix-kernel-status branch July 30, 2024 16:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve the busy/idle execution state tracking for kernels. #1429

Improve the busy/idle execution state tracking for kernels. #1429

ojarjur commented Jun 5, 2024

This comment was marked as resolved.

krassowski Jun 6, 2024

ojarjur Jun 13, 2024

vidartf Jul 11, 2024

ojarjur Jul 12, 2024

Zsailer Jul 12, 2024

Zsailer Jul 18, 2024

ojarjur commented Jun 6, 2024

ojarjur commented Jun 6, 2024

ojarjur commented Jun 7, 2024

Zsailer commented Jul 18, 2024

Improve the busy/idle execution state tracking for kernels. #1429

Improve the busy/idle execution state tracking for kernels. #1429

Conversation

ojarjur commented Jun 5, 2024

This comment was marked as resolved.

krassowski Jun 6, 2024

Choose a reason for hiding this comment

ojarjur Jun 13, 2024

Choose a reason for hiding this comment

vidartf Jul 11, 2024

Choose a reason for hiding this comment

ojarjur Jul 12, 2024

Choose a reason for hiding this comment

Zsailer Jul 12, 2024

Choose a reason for hiding this comment

Zsailer Jul 18, 2024

Choose a reason for hiding this comment

ojarjur commented Jun 6, 2024

ojarjur commented Jun 6, 2024

ojarjur commented Jun 7, 2024

Zsailer commented Jul 18, 2024