Fixing #659 (output cutoff and iopub timeout) #994

djherbis · 2019-04-20T14:33:55Z

@MSeal @mpacer This is probably a little rough but figured this might be a good place to start.

I still need to add a test showing that this fixes my example. #659

I'll follow up with changing things towards this #659 (comment), though the reason that this seems more complex than the suggestion there was I was trying to maintain the current behavior in as many cases as possible.

In particular, the test cases seemed to fail with a "The handle is invalid" error if exec_reply is polled too late, but I'll still give a shot at your suggestion and let you know how it goes.

I'm busy today but might try to play with this Sunday or when I get back to work on Monday.

MSeal

Overall the approach looks good. After re-reading it more carefully I'm not sure we can do much better on the timeout handling as I suggested might be a way to go. I can read this accurately and it is a good implementation of the constraints.

You will have a rather annoying merge conflict (apologies for that) with some refactors of the run_cell that got merged. If you need help resolving those conflicts I can push the resolution onto your PR branch.

Next steps after conflicts would be tests, which we'll want a few to cover the new timeout conditions as well as cover the original failure mode you had. Likely we'll have to emulate the original failure with mocking either zeromq channel setup or by mimicing the behavior of losing messages after a point.

nbconvert/preprocessors/execute.py

djherbis · 2019-04-24T17:32:48Z

I seem to be running into some flakiness with the tests, everything looks to pass on my machine, but travis is randomly pass/fail for a few diff methods, but it does look related to my changes possibly (possible the test is inherently racy too?)

MSeal · 2019-04-24T17:52:00Z

So the python 3 failure you're seeing is some race that only occurs on Travis. I've had to restart many Travis Builds to avoid it. I haven't been able to reproduce the race on my local machine yet to figure out why the kernel is dieing. Might be related to resources on travis or slow disks not responding before a timeout? For your PR we'll need to look closely at each Kernel Timeout since we're touching that logic, but likely if it passes on a restart build action it's a pre-existing test issue.

For python 2 you have some 2/3 compatibility issues in the code you'll need to resolve.

SylvainCorlay · 2019-04-30T15:40:13Z

So the python 3 failure you're seeing is some race that only occurs on Travis. I've had to restart many Travis Builds to avoid it. I haven't been able to reproduce the race on my local machine yet to figure out why the kernel is dieing.

Seeing that also on other PRs related to kernels.

djherbis · 2019-04-30T16:25:28Z

@MSeal I merged in latest and fixed the compat issues, what would you like next?

djherbis · 2019-04-30T20:41:22Z

Gah, more failures, I'll fix them

nbconvert/preprocessors/execute.py

MSeal · 2019-05-02T12:00:52Z

Thanks a ton for getting the PR updated @djherbis -- and sorry for all the merge conflicts with the 5.5 release stuff.

I think the next steps are making some tests that mock the behavioral change. Specifically where we pass the deadline in both the exec_reply and get_msgs code paths, for both exception raising an non-exception raising cases.

MSeal · 2019-05-02T18:41:35Z

Figured out that random test failure btw, it's unrelated to your changes.

Let me know if you need any help with getting tests added.

djherbis · 2019-05-02T18:44:31Z

@MSeal running into a little bit of pain for tests, I was currently adding them under test_execute.py::TestRunCell. Main issues seems to be with controlling the timeout in there.

(no way to introduce a delay at the moment, and using timeout=0 gets converted to timeout=None internally).

Should I be adding these tests under test_execute.py::TestExecute::test_run_notebooks, or somewhere else?

MSeal · 2019-05-02T19:40:54Z

Yea under TestRunCell is probably right. The way to achieve this test is to mock the kc.shell_channel.get_msg and kc.iopub_channel.get_msg for each pattern respectfully, which prepare_cell_mocks does for message receiving. You can replace those mocks responses via preprocessor.kc.shell_channel.return_value = ... or preprocessor.kc.iopub_channel.side_effect = [...].

If you need help getting those test mocks setup I can probably help push a starting point onto your PR branch later today.

djherbis · 2019-05-02T20:19:46Z

I tried directly mocking the get_msg calls for some exceptional cases, let me know if this is an alright way to accomplish it.

djherbis · 2019-05-05T16:15:28Z

Thanks that makes more sense, wasn't super clear to me how to use the mock that way.

Anything else left?

MSeal · 2019-05-05T17:03:41Z

nbconvert/preprocessors/execute.py

+
+            if more_output:
+                try:
+                    timeout = self.iopub_timeout


So now with all the functionality in place, we likely need to address the fact that iopub_timeout no longer acts as it used to. In particular, since we aren't waiting for exec reply to fully complete we'll expect to timeout here often. Which means that raise_on_iopub_timeout will fire when users wouldn't expect.

@mpacer I think we should perhaps remove raise_on_iopub_timeout and iopub_timeout with this change and use timeout = self._timeout_with_deadline(1, deadline) for both conditional branches. Those other configurations just don't make any sense any more and are problematic if defined as they were in earlier versions. What's your thoughts? Even if we do a less severe change we'd need to address the change in timeout behavior somehow with this PR merging.

Actually I tried to preserve that behavior, it won't hit the raise_on_iopub_timeout unless polling_exec_reply is false, and it's only false if we've passed the deadline or if exec_reply is not None (completed).

If we've passed the deadline we hit handle_timeout, which will either interrupt or throw. If it throws it won't reach this section, if it interrupts I suspect we would get an idle message?

Let me know if I got anything wrong here.

Ahh yes you're correct, my end-of-conference brain power failed to read properly. Then maybe we should deprecate those fields in the next release (5.5.1) so they can be removed in 6.0? That would help cleanup this code and make it simpler to control in the future imo.

We should add a test for this case if it's not covered yet to ensure we won't timeout unexpectantly. That might be tricky to write with Mocks, so I can probably make that test if it's not clear how to set that up. After that I think I'm good to merge once @mpacer approves / has any further concerns resolved.

Hey sorry, I was also gone for a week long work even this week.
Looks like I've got more merge conflicts to fix, I'll take a look Monday.

Sorry there's been so many merge conflicts. I've been more careful to review / get more opinions on this PR before merging. Shouldn't be too many more conflicts. I asked @mpacer in person to take a look again when she's available so we don't keep this PR open for much longer.

I am not quite following the reasoning behind why the fields need to be deprecated. Could you lay out the situation where these fields cease to make sense?

@MSeal probably has a better answer than me, my interpretation was that iopub_timeout exists to give a small window after receiving exec_reply (so we know all output should be produced by then) for the output to reach the iopub_channel (in case of network latency etc.). I think @MSeal might feel that since we are polling for this output more regularly (and not just after exec_reply) that the purpose of this flag has changed somewhat. Perhaps I shouldn't be using iopub_timeout at all unless exec_reply has finished, and use a constant timeout / deadline similar to _poll_for_reply. I could be misinterpreting @MSeal though so please correct me!

Even with all of that, I still think iopub_timeout is relevant post exec_reply being received, since you may still need to wait for output produced near the end of execution to be received.

That's pretty much on point. It's odd the to have a post exec_reply response specific timeout in addition to an overall timeout when the two channels are being consumed in parallel.

I'll start a separate PR / Issue on the field independent of this PR to avoid leaving this open longer.

@MSeal Thanks, what do you think about using a constant (like 1s) insead of the iopub_timeout in this PR for the poll to iopub_channel while polling_exec_reply=True, and then switching to iopub_timeout once exec_reply is received?

Possibly that would be a way to approach it. In general I'm saying I think making it simpler and having 1 timeout would reduce confusion and be less nuanced.

mpacer · 2019-05-15T21:49:27Z

nbconvert/preprocessors/execute.py

+        more_output = True
+        polling_exec_reply = True
+
+        while more_output or polling_exec_reply:


more_output being used as a signaling mechanism feels like a pretty significant piece of internal logic. Could you add a comment explaining why we're taking this approach?

I would think a description about the intention of the separate roles of more_output and polling_exec_reply would be enough to address the curious individual.

Great point, I added some detailed comments around this section to explain why it's written this way. Let me know if it's still not clear for future readers.

mpacer

Overall this looks like a nice set of changes! thank you @djherbis and @MSeal for working through these details.

I asked for some more documentation and have a question about the comment about a deprecation strategy, but I think those could be addressed in later PRs.

If you want to put them in this PR too, I'll give that a review, but otherwise I think this looks good to merge.

mpacer · 2019-05-15T21:50:40Z

nbconvert/preprocessors/tests/test_execute.py

@@ -510,6 +515,65 @@ def test_busy_message(self, preprocessor, cell_mock, message_mock):
        # Ensure no outputs were generated
        assert cell_mock.outputs == []

+    @prepare_cell_mocks({


These all look great!

All the tests are great work by @MSeal (thank you!) it's my first contribution so I was unfamiliar with how to use these test frameworks.

…xec_reply

MSeal · 2019-05-17T22:39:03Z

@djherbis Ty so much for going through all the bumps with the PR. Really appreciate you driving the change despite the time it took.

meeseeksmachine · 2019-08-09T04:11:37Z

This pull request has been mentioned on Jupyter Community Forum. There might be relevant details there:

https://discourse.jupyter.org/t/nbconvert-5-6-0-release/1867/1

See the following links for details on the issues with IOPub and nbconvert's ExecutePreprocessor: - nteract/papermill#426 (comment) - jupyter/nbconvert#994

first pass at fixing jupyter#659

08d2719

MSeal reviewed Apr 23, 2019

View reviewed changes

djherbis added 5 commits April 23, 2019 12:21

resolving PR comments

927b14e

Merge branch 'master' of https://github.com/jupyter/nbconvert

5d442ff

re-add missing blank lines

e2f9278

_passed_deadline already calls _handle_timeout

7bdde6c

use monotonic time for the deadline

5fd512b

time.monotonic only available in py3

9cccb50

Merge branch 'master' of https://github.com/jupyter/nbconvert

51c1981

use check_alive in poll

a266caa

MSeal reviewed May 2, 2019

View reviewed changes

nbconvert/preprocessors/execute.py Outdated Show resolved Hide resolved

remove unneeded return

e76ad50

add a couple deadline tests

c08e527

Upgraded timeout tests for run_cell improvements

f8c4801

MSeal reviewed May 5, 2019

View reviewed changes

djherbis added 3 commits May 13, 2019 16:04

Merge branch 'master' of https://github.com/jupyter/nbconvert

38714e6

Remove ExecuteTestBase in the new tests

7b115b6

fix indentation level

0fb933d

mpacer reviewed May 15, 2019

View reviewed changes

mpacer approved these changes May 15, 2019

View reviewed changes

adding comments to explain why we're adding more_output and polling_e…

3f17dce

…xec_reply

MSeal merged commit 32550d4 into jupyter:master May 17, 2019

MSeal added this to the 5.6 milestone Jul 30, 2019

gshiba mentioned this pull request Sep 6, 2019

Timeout waiting for IOPub output nteract/papermill#426

Closed

jsvine added a commit to jsvine/nbexec that referenced this pull request Jan 24, 2020

By default, raise errors on IOPub timeouts

719ae2d

See the following links for details on the issues with IOPub and nbconvert's ExecutePreprocessor: - nteract/papermill#426 (comment) - jupyter/nbconvert#994

maartenbreddels mentioned this pull request Feb 5, 2020

fix: avoid iopub losing messages by polling two iopub AND shell socket #1183

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixing #659 (output cutoff and iopub timeout) #994

Fixing #659 (output cutoff and iopub timeout) #994

djherbis commented Apr 20, 2019

MSeal left a comment

djherbis commented Apr 24, 2019

MSeal commented Apr 24, 2019

SylvainCorlay commented Apr 30, 2019

djherbis commented Apr 30, 2019

djherbis commented Apr 30, 2019

MSeal commented May 2, 2019

MSeal commented May 2, 2019

djherbis commented May 2, 2019

MSeal commented May 2, 2019

djherbis commented May 2, 2019

djherbis commented May 5, 2019

MSeal May 5, 2019

djherbis May 5, 2019

MSeal May 5, 2019

djherbis May 11, 2019

MSeal May 12, 2019

mpacer May 15, 2019

djherbis May 17, 2019 •

edited

Loading

MSeal May 17, 2019

djherbis May 17, 2019

MSeal May 17, 2019

mpacer May 15, 2019

djherbis May 17, 2019

mpacer left a comment

mpacer May 15, 2019

djherbis May 17, 2019

MSeal commented May 17, 2019

meeseeksmachine commented Aug 9, 2019

Fixing #659 (output cutoff and iopub timeout) #994

Fixing #659 (output cutoff and iopub timeout) #994

Conversation

djherbis commented Apr 20, 2019

MSeal left a comment

Choose a reason for hiding this comment

djherbis commented Apr 24, 2019

MSeal commented Apr 24, 2019

SylvainCorlay commented Apr 30, 2019

djherbis commented Apr 30, 2019

djherbis commented Apr 30, 2019

MSeal commented May 2, 2019

MSeal commented May 2, 2019

djherbis commented May 2, 2019

MSeal commented May 2, 2019

djherbis commented May 2, 2019

djherbis commented May 5, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

djherbis May 17, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mpacer left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MSeal commented May 17, 2019

meeseeksmachine commented Aug 9, 2019

djherbis May 17, 2019 •

edited

Loading