nixos/test-runner: Fix execute() flakiness #142747

dasJ · 2021-10-24T13:49:11Z

Motivation for this change

Things done

roberth · 2021-10-24T14:38:19Z

nixos/lib/test-driver/test-driver.py

+            # (for example when 4094 bytes of output were written, half
+            # of the status code magic would be in one chunk and the other
+            # half in the next one).
+            chunk_to_check = prev_chunk + chunk


This assumes that chunks are sufficiently big; >= 0.5 * magic

chunk_to_check: it's not really a chunk anymore when they've been combined. Maybe bytes_to_check?

Maybe StreamReader.readuntil could make this code redundant?

Otherwise, see below.

I tried StreamReader but it's async stuff and I'm not going to implement that.

roberth · 2021-10-24T14:57:02Z

nixos/lib/test-driver/test-driver.py

+                return (status_code, (output + output2).decode())
+
+            output += prev_chunk
+            prev_chunk = chunk


This could avoid the minimum chunk size assumption by taking something like chunk_to_check[-max_match_size:] instead of chunk. Also, prev_chunk should probably be renamed to something like search_buffer.

roberth · 2021-10-24T14:58:29Z

nixos/lib/test-driver/test-driver.py

@@ -583,23 +583,44 @@ def require_unit_state(self, unit: str, require_state: str = "active") -> None:
                )

    def execute(self, command: str) -> Tuple[int, str]:
+        status_code_magic = "|!=EOF"


I like the trick used in HTTP chunked encoding. If you use a sufficiently long random string, the probability of a false positive is effectively zero.

roberth · 2021-10-24T15:00:12Z

nixos/lib/test-driver/test-driver.py

+                status_code = int(status_code_b.strip())
+                return (status_code, (output + output2).decode())
+
+            output += prev_chunk


You could create a queue of chunks instead, joining them in one go before .decode(). That avoids quadratic complexity.

dasJ · 2021-10-24T18:07:34Z

I don't care enough about this problem to learn Python, sorry. Maybe someone else will come up with a better understanding of the language and a better solution (cc @K900).

pennae · 2021-10-24T19:21:20Z

nixos/lib/test-driver/test-driver.py

+        # exit code it has, so we execute the command, output a special magic
+        # string afterwards and then the return code.
+        out_command = (
+            f"( set -euo pipefail; {command} ); echo '{status_code_magic}' $?\n"


last time we did this sort of thing we did something like this:

( _stdout=$(mktemp) _stderr=$(mktemp) trap 'rm $_stdout $_stderr' EXIT (set -euo pipefail; {command}) >$_stdout 2>$_stderr echo RESULT: $? $(stat -c $_stdout $_stderr) cat $_stdout $_stderr )

while that is significant extra wrapping it also obviates the need to check for magic substrings, replies can be handled as having one readable line of header and two known-length data blobs following—any output on stderr immediately signals a failure of the wrapper. though that only works well if commands don't echo gigabytes worth of data on stdout/stderr. :/

dasJ · 2021-10-24T19:24:13Z

Looks like I do care… I reimplemented everything to use base64. This way we get around this shifting window stuff by just matching \n.

andir · 2021-10-25T09:29:47Z

nixos/lib/test-driver/test-driver.py

+            decoded = chunk.decode()
+            print(f"c={decoded}")
+            output_buffer += [decoded]
+            if decoded[-1] == "\n":


Is this guaranteed to be the last character? What happens if we receive two lines in one recv call?

Then this breaks. But I don't see how we would receive two lines when we base64-encode the output and tell base64 to not add \n after 76 characters

Also, the exit code is won't be printed before this method returns, so lgtm.

dasJ · 2021-10-25T14:43:23Z

@GrahamcOfBorg eval

cole-h · 2021-10-26T19:27:22Z

@ofborg eval

dasJ · 2021-10-26T21:24:00Z

I had to add a flag to execute() for the reasons outlined in the documentation

nixos/doc/manual/development/writing-nixos-tests.section.md

tfc · 2021-10-27T17:27:37Z

nixos/lib/test-driver/test-driver.py

@@ -582,24 +582,37 @@ def require_unit_state(self, unit: str, require_state: str = "active") -> None:
                    + "'{}' but it is in state ‘{}’".format(require_state, state)
                )

-    def execute(self, command: str) -> Tuple[int, str]:
+    def _read_line_from_shell(self) -> str:


from following the notifications i see that there are still some changes happening, although i had no time this week to look deeper into all of them.

when you found the good/best working version, i suggest renaming this function. from a function with the current name i would expect that it reads output line by line so i could call it repeatedly (or use it as an iterable generator, which would also be very fancy if it yielded individual lines). but instead, the function blocks until it receives a long string with potentially multiple lines until the last received portion ends with a newline (which might not even always be the case?)

It was already done (at least that what I thought) but then I found it breaks the tests I now fix and then the documentation was wrong and so on… so it should now really be done.

What would you recommend as a name? It's actually correct imo because it reads a single line from the chardev that connects to the backdoor that is called shell but I can see how this generic name is misleading. Maybe something like _read_line_from_shell_pipe or something like this?

While re-reading the text, carefully trying to describe the semantics in my own words, in order to use that for a good name, i got another question:

what does not decoded really mean? does this boil down to an empty string? i did not find an explanation in what cases decode() might return a null object

I have only found one case: this happens which is when the pipe breaks. You then get an empty string (or whatever the type is). The if not decoded is there to prevent infinite loops when that happens. We can probably not rescue from that situation anyway (pipe is alredy broken) and returning with parts of the output is preferrable to hanging.

The case with the broken pipe is where check_return comes into play. When you don't set it to False, execute() tries to write echo ${PIPESTATUS[0]}\n into the shell socket which fails when the pipe is broken, thus making the broken pipe detectable as opposed to just hanging forever.

pipe breaks should already be visible at the recv call, shouldn't they? what do you think about putting the check there and then call the function something along the lines next_newline_closed_block or something, because this is really what we do. (one might still take such unusual [but correct] names as a hint that the overall design of this function has not reached the ideal state yet although it works for our relevant cases)

tries to write echo ${PIPESTATUS[0]}\n into the shell socket which fails when the pipe is broken

This implies that the test runner (python) broke the pipe, which I can only see happening when the test fails or crashes, so breaking the pipe in this direction doesn't need to be considered.

But it happens. When the hibernate test for example shuts down the VM by means of systemctl hibernate, the pipe breaks because the qemu process dies. That's exactly where I observed the broken pipes

Ah of course. These aren't normal processes.

Instead of using the magic string, we now just base64-encode everything and check for a newline.

dasJ · 2021-11-02T11:56:41Z

@tfc any updates to this? I'm starting to feel pressure from the soon to be expected feature freeze

fgaz · 2021-11-04T13:09:55Z

Looks like this PR broke at least one nixos test: #144613

vcunat · 2021-11-05T17:31:11Z

I was bisecting nixosTests.xfce now and ended up here (1640359; retried a few times on the commit and its parent to confirm). This currently blocks the nixos-unstable channel.

dasJ · 2021-11-05T17:44:55Z

There's a workaround: #144679

roberth · 2021-11-05T20:26:45Z

@vcunat Fix for xfce is in #144795 now

nixos/tests.plasma5: Fix after #142747

Recently, the implementation behind Machine.execute() and thus also Machine.succeed() has been changed[1] to pipe all the command's output into base64 on the guest machine. Unfortunately this means that base64 is blocking until stdout is closed, which in turn means that we now need to make sure that whenever we run a program in background via "&" we also need to make sure to close stdout. In the PSI test, we're doing this by simply redirecting the output to stderr. [1]: NixOS/nixpkgs#142747 Signed-off-by: aszlig <[email protected]>

In NixOS#142747, the implementation behind Machine.execute() has been changed to pipe all the command's output into base64 on the guest machine. Unfortunately this means that base64 is blocking until stdout is closed, which in turn means that we now need to make sure that whenever we run a program in background via "&" we also need to make sure to close stdout, which we do by redirecting stdout to stderr. Signed-off-by: aszlig <[email protected]>

dasJ requested a review from tfc as a code owner October 24, 2021 13:49

github-actions bot added the 6.topic: nixos Issues or PRs affecting NixOS modules, or package usability issues specific to NixOS label Oct 24, 2021

dasJ mentioned this pull request Oct 24, 2021

nixos/test-driver: Fix thread cleanup and execute #142560

Closed

12 tasks

dasJ force-pushed the fix/test-runner-execute branch from 212594b to 0af4309 Compare October 24, 2021 13:51

dasJ added the 6.topic: testing Tooling for automated testing of packages and modules label Oct 24, 2021

ofborg bot added 10.rebuild-darwin: 0 This PR does not cause any packages to rebuild on Darwin 10.rebuild-linux: 1-10 labels Oct 24, 2021

roberth reviewed Oct 24, 2021

View reviewed changes

pennae reviewed Oct 24, 2021

View reviewed changes

dasJ force-pushed the fix/test-runner-execute branch from 0af4309 to 71d9e16 Compare October 24, 2021 19:23

andir reviewed Oct 25, 2021

View reviewed changes

ofborg bot added the ofborg-internal-error Ofborg encountered an error label Oct 25, 2021

dasJ added the backport release-21.05 label Oct 26, 2021

cole-h removed the ofborg-internal-error Ofborg encountered an error label Oct 26, 2021

dasJ force-pushed the fix/test-runner-execute branch from b55613b to cbde380 Compare October 26, 2021 21:23

github-actions bot added the 8.has: documentation This PR adds or changes documentation label Oct 26, 2021

cole-h reviewed Oct 26, 2021

View reviewed changes

nixos/doc/manual/development/writing-nixos-tests.section.md Outdated Show resolved Hide resolved

ofborg bot added the ofborg-internal-error Ofborg encountered an error label Oct 26, 2021

dasJ force-pushed the fix/test-runner-execute branch from cbde380 to e1c2abe Compare October 26, 2021 22:31

cole-h removed the ofborg-internal-error Ofborg encountered an error label Oct 27, 2021

ofborg bot added the ofborg-internal-error Ofborg encountered an error label Oct 27, 2021

dasJ force-pushed the fix/test-runner-execute branch from e1c2abe to fd8ecd6 Compare October 27, 2021 08:56

dasJ removed the ofborg-internal-error Ofborg encountered an error label Oct 27, 2021

tfc reviewed Oct 27, 2021

View reviewed changes

nixos/test-runner: Fix execute() flakiness

1640359

Instead of using the magic string, we now just base64-encode everything and check for a newline.

nixos/switchTest: Make less flakey

c2bdad7

dasJ force-pushed the fix/test-runner-execute branch from fd8ecd6 to c2bdad7 Compare October 28, 2021 09:51

dasJ removed the backport release-21.05 label Oct 29, 2021

roberth approved these changes Nov 2, 2021

View reviewed changes

tfc merged commit 2ba0732 into NixOS:master Nov 2, 2021

dasJ deleted the fix/test-runner-execute branch November 2, 2021 21:40

fgaz mentioned this pull request Nov 4, 2021

nixosTests.pt2-clone and others hang #144613

Closed

roberth mentioned this pull request Nov 6, 2021

nixosTest stdout blocking is hard to troubleshoot #144875

Open

dasJ added a commit to helsinki-systems/nixpkgs that referenced this pull request Nov 6, 2021

nixos/tests.plasma5: Fix after NixOS#142747

d64f7a7

dasJ added a commit that referenced this pull request Nov 6, 2021

Merge pull request #144924 from helsinki-systems/fix/plasma5-test

c0e4e23

nixos/tests.plasma5: Fix after #142747

chkno mentioned this pull request Jan 24, 2022

nixos/tests/installer: add bcachefs tests #156071

Merged

13 tasks

aszlig mentioned this pull request Mar 21, 2022

nixos/tests/avahi: Fix running background command #165146

Merged

13 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nixos/test-runner: Fix execute() flakiness #142747

nixos/test-runner: Fix execute() flakiness #142747

dasJ commented Oct 24, 2021 •

edited

Loading

roberth Oct 24, 2021

dasJ Oct 24, 2021

roberth Oct 24, 2021 •

edited

Loading

roberth Oct 24, 2021

roberth Oct 24, 2021

dasJ commented Oct 24, 2021

pennae Oct 24, 2021

dasJ commented Oct 24, 2021

andir Oct 25, 2021

dasJ Oct 25, 2021

roberth Oct 25, 2021

dasJ commented Oct 25, 2021

cole-h commented Oct 26, 2021 •

edited

Loading

dasJ commented Oct 26, 2021

tfc Oct 27, 2021

dasJ Oct 27, 2021

tfc Oct 28, 2021

dasJ Oct 28, 2021

tfc Oct 28, 2021

dasJ Oct 28, 2021

roberth Oct 28, 2021

dasJ Oct 28, 2021

roberth Oct 28, 2021

dasJ commented Nov 2, 2021

fgaz commented Nov 4, 2021

vcunat commented Nov 5, 2021

dasJ commented Nov 5, 2021

roberth commented Nov 5, 2021

nixos/test-runner: Fix execute() flakiness #142747

nixos/test-runner: Fix execute() flakiness #142747

Conversation

dasJ commented Oct 24, 2021 • edited Loading

Motivation for this change

Things done

Choose a reason for hiding this comment

Choose a reason for hiding this comment

roberth Oct 24, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dasJ commented Oct 24, 2021

Choose a reason for hiding this comment

dasJ commented Oct 24, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dasJ commented Oct 25, 2021

cole-h commented Oct 26, 2021 • edited Loading

dasJ commented Oct 26, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dasJ commented Nov 2, 2021

fgaz commented Nov 4, 2021

vcunat commented Nov 5, 2021

dasJ commented Nov 5, 2021

roberth commented Nov 5, 2021

dasJ commented Oct 24, 2021 •

edited

Loading

roberth Oct 24, 2021 •

edited

Loading

cole-h commented Oct 26, 2021 •

edited

Loading