-
-
Notifications
You must be signed in to change notification settings - Fork 372
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Execute Requests Immediately Following and Error are Aborted in 5.5 #609
Comments
@glentakahashi do you mind taking a look into this? |
To shed some light on the setting, the main use case is for slow network connections or for large notebooks. The docs have it written:
We run This has been a feature for ~3yrs but had an issue before which caused it to not function as intended and not properly respect the timeout. Between the releases you noted, I pushed a fix which now causes it to respect the timeout. As for not getting aborted messages, are you listening on the shell channel as well? I'm seeing aborted messages come through properly in the backend. (And the frontend behavior works as well with jupyterlab/notebook as well). https://jupyter-client.readthedocs.io/en/stable/messaging.html#execution-results An example:
Messages:
Run next cell
Messages:
|
A 100ms reject window is a bit of a rough solution to this as it forces interfaces to now be aware of safe windows to send requests depending on underlying kernels / versions of kernels. I could imagine sometimes network latency in a bad connection is worse than 100ms for the underlying problem too. Not saying I see a perfect solution but there is some bleeding of abstraction responsibilities in a fuzzy way here that were unexpected (from my pov).
I somewhat pieced that together, but I wasn't sure what the intended vs actual behavior was given this was not the behavior in 5.4.3. It would have saved a fair bit of debugging time if this had made the changlog for the release fyi.
Hmm I'm not sure why my example didn't pick up the shell channel message off-hand. Apologies I didn't notice that before posting. |
I can see that on your first point RE bleeding abstractions. I think that in a perfect world, this wouldn't be necessary and instead the Jupyter protocol allowed for queueing multiple source/cell requests in a single request, or potentially queueing them up before submitting, but I think that's unlikely given how big of a change it might be. (I wouldn't have the time to dedicate to such a thing) The other thing that I see that could make this less leaky is just removing the feature altogether and instead have clients implement this aborting behavior, but since that's a breaking change it would have to be done in stages. I do think it would also be a better long term solution, as the clients themselves would have more knowledge about what the intended queueing/execution behavior is, and could choose how they wanted to implement the Also my fault on the changelog, I should have made it more clear what the implications of the fix were and what to look out for in tests. @MSeal what do you suggest the best path forward here is? I don't think we should revert anything with regards to the fix, and I think that the fix you implemented in |
Yeah I had similar initial thoughts on the Jupyter server (or other interfaces) should be where the change is actually made to manage queue requests. But that's a bit of a larger upstream change. In the shorter term (maybe for the 6.0 release?) maybe it makes sense to change the timeout value? Would it make sense to set All good on missing changelog, I've made similar misses in the past but I thought I'd bring it up for future changes. I'd recommend retroactively add the behavioral change to the changelog in case someone else hits the issue unexpectedly (e.g. colab folks). |
I created a PR for the changelog here: #613 But in terms of short term fix, I don't think the right option is to change the default |
Thanks for updating the changelog. Maybe it makes sense for issues / conversation to be in the lab / notebook server repos then? There are lots of other kernels (albeit ipykernel is the most common) that run in those environments that don't have a Reducing the delay doesn't really buy any gains in guarantee unless it's turned off completely imho. It just makes the potential races win/lose at different spots where interfaces are not aware or make non-timing based assumptions. |
Actually I was wrong. Now that I think about it, the behavior of how I think if you were to submit a PR to set the default to 0.0 and add a short circuit here to skip the future if the timeout is <=0, that would make sense to me. I would want to see that the new version still has tests pass in |
@glentakahashi Sounds good. Let me take a stab at it this weekend and see how far I can get in testing different interfaces I maintain or work adjacent to with such a change. |
In debugging nteract/testbook#88, which was traced to an issue in testbook behavior causing empty execute response messages with ipykernel 5.5 in several tests, I believe we have tacked down the change in behavior to this commit. It appears to have changed the error abort interaction such that any additional execute requests within
stop_on_error_timeout
after an error occurs are aborted.The docs state
Requests that arrive within this [stop_on_error_timeout] window after an error will be cancelled.
. What was the intention for this feature and why did it change behavior in 5.5? The message sent are also odd if this was intentional behavior because there's no indicator in the execute_result that the intended execution was aborted. I feel an error message or status of sorts would be expected unless I am missing something here?I believe this is unintended behavior, as
5.4.3
does not have this behavior and the5.5.0
changelog has no mention of it. Additionally notebooks withallow-error
flags could also hit this issue unexpectedly, depending on how the notebook executor was queuing / processing the requests. Testbook certainly hits this because it's often testing that something failed or trying optimistic execution paths. For now I think the only way around the issue is to wait after failed executions or set thestop_on_error_timeout
to 0.I've drafted the simplest example I could to demonstrate behavior with
jupyter_client
, please let me know if any additional details are needed / if this was intended and incidental behavior change:Here's the message contents showing the first cell a's error
and here's cell b where no print message is made.
If instead we add a
sleep(0.2)
between the execution requests, the print request correctly outputs a message:The text was updated successfully, but these errors were encountered: