pkg/machine/e2e: improve timeout handling #23162

Luap99 · 2024-07-02T09:28:19Z

There seems to be a very bad flake where podman machine init takes to
long. It doesn't seem to be deadlocked it just has to copy the entire
image which is super slow in CI for some reason.

As such increase the timeout by 1m to 5m total. Hopefully this is good
enough for now. Second in case we still hit a timeout send SIGABRT which
makes go dump the stack trace sow e know if it seems to be doing
something or hangs on lock.

Does this PR introduce a user-facing change?

None

openshift-ci · 2024-07-02T09:28:27Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Luap99

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [Luap99]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

rhatdan · 2024-07-02T09:51:33Z

LGTM

Luap99 · 2024-07-02T13:12:34Z

@edsantiago @baude @ashley-cui PTAL

Let's see how this goes, I am not sure why all of the sudden the flake (#22551) is so much worse (maybe #23068?) but from my limited debugging it really looks like it is slow to copy the VM file on init and there is no hang/lock or anything. So let's hope the extra minute is enough.

ashley-cui · 2024-07-02T13:24:16Z

LGTM. Wonder why the other pr made it worse..

Luap99 · 2024-07-02T13:27:41Z

LGTM. Wonder why the other pr made it worse..

yeah I don't understand it either, maybe they are located on different disks/volumes which makes copy slower or something like that?

edsantiago · 2024-07-02T13:58:02Z

Doesn't seem to help

Luap99 · 2024-07-02T14:08:49Z

Yeah really strange..., maybe not the slow IO after all. I have another idea that I can try.

Luap99 · 2024-07-02T14:08:59Z

/hold

Luap99 · 2024-07-02T14:57:57Z

// Interrupts the bar goroutine. It's important that
// bar.Abort(false) is called before p.Wait(), otherwise
// can hang.

Found this command in the progress bar handling code, I pushed a commit to drop the progress bar on decompress. No idea if this is actually the problem but worth a try.

edsantiago · 2024-07-02T15:43:34Z

Sigh. No luck.

Luap99 · 2024-07-02T16:01:42Z

Sigh. No luck.

Looks like it worked though? No mac flake, seems to have failed on linux on start with an entirely different stack trace.

Of course I might just be lucky on this run so I will repush again and see if it passes again.

Luap99 · 2024-07-02T17:02:05Z

<+     > # # podman events --since=the dawn of time...ish
         # #/vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
         # #| FAIL: exit code is 0; expected 125
         # #\^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
         # # [teardown]
         #

That is a new one to me, https://api.cirrus-ci.com/v1/artifact/task/6224808554266624/html/sys-podman-fedora-39-root-host-boltdb.log.html

But looks like all machine tests passed on the first try so that is good.

edsantiago · 2024-07-02T17:04:44Z

That looks ... impossible? The --filter code is deterministic, isn't it?

Luap99 · 2024-07-02T17:09:48Z

That looks ... impossible? The --filter code is deterministic, isn't it?

Until I saw this I would have said yes, now not so much...
Anyhow feel free to file a new flake report for that. It shouldn't be to difficult to follow the code there.

For now I focus on the machine madness and this looks good. I will repush one last time to be sure and even lower the timeout a bit.

mtrmac · 2024-07-03T14:54:20Z

If we would run out of space I would assume write to return an error an not block forever?

(At the end-user level, macOS uses quite a bit of disk space (frequently gigabytes) of “purgeable” storage. I don’t quite know how that works — running low on space tends to trigger attempts to delete some of that “purgeable” storage, but IIRC it is also quite possible to hit an “out of space” error when writing too fast. I really have no idea how much of that is happening at the UX “document” abstraction vs. the low-level filesystem syscall API, i.e. whether a write(2) can block while purging is happening. But all of this should only be applicable if we are close enough to running out.)

Ok so MacOS uses dynamic volume sizes?!

My reading of https://developer.apple.com/support/downloads/Apple-File-System-Reference.pdf is that there is a single “allocation” layer per the whole container, with each volume allocating storage for a file/directory as necessary. (But there does exist a feature to reserve space for a volume, or to impose a quota on a volume.)

Luap99 · 2024-07-03T14:59:16Z

(But there does exist a feature to reserve space for a volume, or to impose a quota on a volume.)

yes I found that too, see containers/automation#204
I guess only one way to find out if this helps here.

mtrmac · 2024-07-04T17:30:02Z

pkg/machine/compression/sparse_file_writer.go

@@ -123,7 +123,7 @@ func (sw *sparseWriter) Close() error {
 		return errors.New("file is already closed")
 	}
 	if sw.pendingZeroes != 0 {
-		if holeSize := sw.pendingZeroes - 1; holeSize >= zerosThreshold {
+		if holeSize := sw.pendingZeroes; holeSize >= zerosThreshold {


AFAICS this will corrupt the image: the code will do a seek(+6 GB) (which does not change the file contents) without writing a trailing non-hole byte; the VM image will be 6 GB shorter.

The passing machine test seems to disagree: https://cirrus-ci.com/task/5791451773140992
I am not going to pretend I know how sparse files work in general, I am sure there is good reason for the final write but at least on macos it appears to work like this

I meant “corrupt” as in “will have a different length and digest”.

If this is a raw image, I would’t be surprised if the VM using it would (try to?) extend the file by writing to the unallocated parts of the image; i.e. this moves the sparse-file extension / allocation from the decompression time to the VM time.

If this change actually preserves length / digest … I’m extremely confused. (Also, in that case, there should not be any speedup! Creating a 6GB-1 hole and issuing a 1-byte write should not be substantially slower than creating a 6GB hole and not issuing a 1-byte write.)

ls -lh did report the full size and du did show the actual size on disk without holes so I guess it worked fine
unless I tested the wrong binary

Feel free test things yourself, I am not saying any of this is correct I am just trying to test things until we can get this insane flake fixed

Testing this on Mac with a completely default filesystem, with the holeSize := sw.pendingZeroes, sparse_writer.go as is otherwise, and

func main() { srcBytes := [10_000_000]byte{} src := bytes.NewReader(srcBytes[:]) dest, err := os.OpenFile("t", os.O_RDWR|os.O_CREATE, 0o600) if err != nil { panic(err) } sw := NewSparseWriter(dest) cnt, err := io.Copy(sw, src) if err != nil { panic(err) } fmt.Printf("copy count %d\n", cnt) err = sw.Close() if err != nil { panic(err) } err = dest.Close() if err != nil { panic(err) } }

I see, per fs_usage:

open F=3 (RWC________X) t fstat64 F=3 lseek F=3 O=0x00989680 <SEEK_CUR> write F=3 B=0x0 close F=3

and the created file is 0 bytes long. So the trailing zeroes were not written.

With holeSize := sw.pendingZeroes - 1, this is seek(N-1)+1-byte write, as expected; but the output is

% ls -la t -rw------- 1 … … 10000000 … t % du -h t 9,5M t % stat -f '%z %b %k' t 10000000 19536 4096

i.e. it does not look sparse.

On further testing, a 10M file is not created sparse, but a ~1G one is:

% stat -f '%z %b %k' t 1000000000 8 4096

This is going to make experiments so confusing.

The precise trigger seems to be 16 MiB of a hole (16MiB +1 total size), at least in this test case.

Yeah, I have absolutely no idea what is going on and if you found a way to avoid the flakes, that’s definitely very valuable.

(I’d probably push for this “don’t write trailing zeroes” mode of SparseWriter (assuming that’s what it does) to be opt-in, because not writing the input data as given would be so surprising. But if we know that’s fine for VM images, sure, let’s do that.)

if you follow the machine init code there is actually a truncate on the raw disk image based to actual requested VM disk size at the end. So I guess is doesn't matter for our sparse writer to write the remaining zeros because of that. And the truncate doesn't cause the weird write behaviour we are seeing here. So maybe should we try to switch seek over to truncate for the remaining zeros?

Sure, why not.

(I’d still like to, eventually, understand what is going on here — the kernel, ~by definition, doesn’t know that the 1-byte write is “the last write”, so I think whatever is happening could also happen in the middle of some other VM image. But that’s not as important as keeping CI sane.)

Looks like with truncate is seems to hang even more and I also noticed a hang within the regular write() not at the end in Close() so yes you are right in can happen in the middle...

I’d still like to, eventually, understand what is going on here

There's an apple developer page on writing sparse files. IIRC, there are some unique requirements. The one that struck me was some statement that seemed like it would un-sparse the file if it was re-opened. Perhaps I'm reading that wrongly.

Luap99 · 2024-07-04T17:52:11Z

Also FYI I did some disk speed tests, 10g (not sparsed) file was written in like 12s at the beginning of the test and 20g took 25s so I really don't think IO speed is the problem.

Luap99 · 2024-07-05T12:02:36Z

I bumped the timeout up to 10 mins and it passed on first try, I will rerun several times to check

podman inspect stop
/Users/MacM1-2-worker/ci/task-6339901111664640[/pkg/machine/e2e/inspect_test.go:13](https://github.com/containers/podman/blob/39cfa349cc8fddb7c655a7ca542b58945cf85cff/pkg/machine/e2e/inspect_test.go#L13)
  inspect two machines
  /Users/MacM1-2-worker/ci/task-6339901111664640[/pkg/machine/e2e/inspect_test.go:23](https://github.com/containers/podman/blob/39cfa349cc8fddb7c655a7ca542b58945cf85cff/pkg/machine/e2e/inspect_test.go#L23)
  → Enter [BeforeEach] TOP-LEVEL - /Users/MacM1-2-worker/ci/task-6339901111664640[/pkg/machine/e2e/machine_test.go:160](https://github.com/containers/podman/blob/39cfa349cc8fddb7c655a7ca542b58945cf85cff/pkg/machine/e2e/machine_test.go#L160) @ 07/05/24 11:18:54.419
  ← Exit  [BeforeEach] TOP-LEVEL - /Users/MacM1-2-worker/ci/task-6339901111664640[/pkg/machine/e2e/machine_test.go:160](https://github.com/containers/podman/blob/39cfa349cc8fddb7c655a7ca542b58945cf85cff/pkg/machine/e2e/machine_test.go#L160) @ 07/05/24 11:18:54.42 (0s)
  → Enter [It] inspect two machines - /Users/MacM1-2-worker/ci/task-6339901111664640[/pkg/machine/e2e/inspect_test.go:23](https://github.com/containers/podman/blob/39cfa349cc8fddb7c655a7ca542b58945cf85cff/pkg/machine/e2e/inspect_test.go#L23) @ 07/05/24 11:18:54.42
  /Users/MacM1-2-worker/ci/task-6339901111664640/bin/darwin/podman machine init --disk-size 11 --image /private/tmp/ci/podman-machine-daily.aarch64.applehv.raw foo1
  Machine init complete
  To start your machine run:

  	podman machine start foo1

  /Users/MacM1-2-worker/ci/task-6339901111664640/bin/darwin/podman machine init --disk-size 11 --image /private/tmp/ci/podman-machine-daily.aarch64.applehv.raw foo2
  Machine init complete
  To start your machine run:

  	podman machine start foo2

  /Users/MacM1-2-worker/ci/task-6339901111664640/bin/darwin/podman machine inspect --format {{.Name}} foo1
  foo1
  ← Exit  [It] inspect two machines - /Users/MacM1-2-worker/ci/task-6339901111664640[/pkg/machine/e2e/inspect_test.go:23](https://github.com/containers/podman/blob/39cfa349cc8fddb7c655a7ca542b58945cf85cff/pkg/machine/e2e/inspect_test.go#L23) @ 07/05/24 11:26:08.767 (7m14.196s)
  > Enter [DeferCleanup (Each)] podman inspect stop - /Users/MacM1-2-worker/ci/task-6339901111664640[/pkg/machine/e2e/config_init_test.go:83](https://github.com/containers/podman/blob/39cfa349cc8fddb7c655a7ca542b58945cf85cff/pkg/machine/e2e/config_init_test.go#L83) @ 07/05/24 11:26:08.767
  /Users/MacM1-2-worker/ci/task-6339901111664640/bin/darwin/podman machine rm --force foo2
  < Exit [DeferCleanup (Each)] podman inspect stop - /Users/MacM1-2-worker/ci/task-6339901111664640[/pkg/machine/e2e/config_init_test.go:83](https://github.com/containers/podman/blob/39cfa349cc8fddb7c655a7ca542b58945cf85cff/pkg/machine/e2e/config_init_test.go#L83) @ 07/05/24 11:26:08.853 (86ms)
  > Enter [DeferCleanup (Each)] podman inspect stop - /Users/MacM1-2-worker/ci/task-6339901111664640[/pkg/machine/e2e/config_init_test.go:83](https://github.com/containers/podman/blob/39cfa349cc8fddb7c655a7ca542b58945cf85cff/pkg/machine/e2e/config_init_test.go#L83) @ 07/05/24 11:26:08.853
  /Users/MacM1-2-worker/ci/task-6339901111664640/bin/darwin/podman machine rm --force foo1
  < Exit [DeferCleanup (Each)] podman inspect stop - /Users/MacM1-2-worker/ci/task-6339901111664640[/pkg/machine/e2e/config_init_test.go:83](https://github.com/containers/podman/blob/39cfa349cc8fddb7c655a7ca542b58945cf85cff/pkg/machine/e2e/config_init_test.go#L83) @ 07/05/24 11:26:08.935 (82ms)
  > Enter [DeferCleanup (Each)] TOP-LEVEL - /Users/MacM1-2-worker/ci/task-6339901111664640[/pkg/machine/e2e/machine_test.go:162](https://github.com/containers/podman/blob/39cfa349cc8fddb7c655a7ca542b58945cf85cff/pkg/machine/e2e/machine_test.go#L162) @ 07/05/24 11:26:08.935
  < Exit [DeferCleanup (Each)] TOP-LEVEL - /Users/MacM1-2-worker/ci/task-6339901111664640[/pkg/machine/e2e/machine_test.go:162](https://github.com/containers/podman/blob/39cfa349cc8fddb7c655a7ca542b58945cf85cff/pkg/machine/e2e/machine_test.go#L162) @ 07/05/24 11:26:08.936 (1ms)
• [434.366 seconds]

It took over 7 mins for this one test to run, all others are much faster.

In case of timeouts actually log the command again and make sure to send SIGABRT to the process as go will create a useful stack strace where we can see where things are hanging. It also kill the process unlike the default Eventually().Should(Exit()) call the leaves the process around. The output will be captured by default in the log so we just see the stack trace there. And while at it bump the timout up to 10 mins, we are hitting hard flakes in CI where machine init takes longer than 5 mins for unknown reasons but this seems to be good enough. Signed-off-by: Paul Holzinger <[email protected]>

To debug the slow machine init command. Let's see the disk image size. Signed-off-by: Paul Holzinger <[email protected]>

Makes it easier to see which tests are slow. Signed-off-by: Paul Holzinger <[email protected]>

Luap99 · 2024-07-05T15:00:30Z

I think this is good to review, I had several runs without flakes now. There are still some debug command in here to see the file sizes but that should not hurt us and makes finding patterns easier so I like to keep them for now.

So the fix for now is to bump the timeout to 10 minutes as somehow it magically takes like around 6-7m sometimes???

So fat the pattern I can see is when it is super slow the file does not seem sparse, it is always one of the early test (often the first one). Some tests write sparse files with different sizes (I would have assumed it be deterministic but looks like it is not?!) Even tests that do not write the file sparse still write fast enough, it is just sometimes where it starts to hang.

  [du -ah /private/tmp/ci/podman_test1707448420/.local/share/containers/podman/machine/applehv]
0	/private/tmp/ci/podman_test1707448420/.local/share/containers/podman/machine/applehv/cache
8.1G	/private/tmp/ci/podman_test1707448420/.local/share/containers/podman/machine/applehv/3f6176645637-arm64.raw
8.1G	/private/tmp/ci/podman_test1707448420/.local/share/containers/podman/machine/applehv
  [ls -lh /private/tmp/ci/podman_test1707448420/.local/share/containers/podman/machine/applehv]
total 8.1G
-rw-r--r-- 1 MacM1-3-worker staff 11G Jul  5 14:24 3f6176645637-arm64.raw
drwxr-xr-x 2 MacM1-3-worker staff  64 Jul  5 14:17 cache
  [stat /private/tmp/ci/podman_test1707448420/.local/share/containers/podman/machine/applehv/3f6176645637-arm64.raw]
  File: /private/tmp/ci/podman_test1707448420/.local/share/containers/podman/machine/applehv/3f6176645637-arm64.raw
  Size: 11811160064	Blocks: 16914816   IO Block: 4096   regular file
Device: 1,29	Inode: 28584       Links: 1
Access: (0644/-rw-r--r--)  Uid: (  502/MacM1-3-worker)   Gid: (   20/   staff)
Access: 2024-07-05 14:17:08.611603530 +0000
Modify: 2024-07-05 14:24:07.695944514 +0000
Change: 2024-07-05 14:24:07.695944514 +0000
 Birth: 2024-07-05 14:17:08.611603530 +0000
...
 [413.586 seconds]

mheon · 2024-07-05T15:53:15Z

Neat trick with the func() string using SIGABRT. LGTM

ashley-cui · 2024-07-05T18:25:04Z

LGTM, nice work

cevich · 2024-07-05T19:19:47Z

/lgtm

cevich · 2024-07-05T19:24:43Z

/hold cancel

openshift-ci bot added the release-note-none label Jul 2, 2024

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 2, 2024

github-actions bot added the machine label Jul 2, 2024

Luap99 mentioned this pull request Jul 2, 2024

pkg/machine/e2e: fix broken cleanup #23154

Merged

Luap99 marked this pull request as draft July 2, 2024 12:09

openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 2, 2024

Luap99 force-pushed the machine-hang branch from bbb22c9 to 33e7598 Compare July 2, 2024 13:00

Luap99 changed the title ~~pkg/machine/e2e: debug test hangs~~ pkg/machine/e2e: improve timeout handling Jul 2, 2024

Luap99 marked this pull request as ready for review July 2, 2024 13:01

openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 2, 2024

Luap99 mentioned this pull request Jul 2, 2024

Small fixes for testing libkrun #23123

Merged

openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 2, 2024

Luap99 mentioned this pull request Jul 2, 2024

Windows Installer: bump wix to v5 #22408

Merged

Luap99 force-pushed the machine-hang branch from 33e7598 to a49f8c3 Compare July 2, 2024 14:55

Luap99 force-pushed the machine-hang branch from a49f8c3 to cb3d54d Compare July 2, 2024 16:05

Luap99 force-pushed the machine-hang branch from cb3d54d to 9933177 Compare July 2, 2024 17:15

cevich mentioned this pull request Jul 3, 2024

[CI:MACHINE] DO NOT MERGE: TESTING MAC FLAKE #23179

Closed

Luap99 force-pushed the machine-hang branch 4 times, most recently from e90c3e6 to f352dc3 Compare July 4, 2024 16:56

mtrmac reviewed Jul 4, 2024

View reviewed changes

Luap99 force-pushed the machine-hang branch 2 times, most recently from 9cd609e to 39cfa34 Compare July 5, 2024 11:03

Luap99 added 3 commits July 5, 2024 15:28

pkg/machine/e2e: run debug commands after init

527c0f0

To debug the slow machine init command. Let's see the disk image size. Signed-off-by: Paul Holzinger <[email protected]>

pkg/machine/e2e: print tests timings at the end

5e3d821

Makes it easier to see which tests are slow. Signed-off-by: Paul Holzinger <[email protected]>

Luap99 force-pushed the machine-hang branch from 08fe286 to 5e3d821 Compare July 5, 2024 13:28

Luap99 marked this pull request as ready for review July 5, 2024 14:53

openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 5, 2024

openshift-ci bot assigned cevich Jul 5, 2024

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jul 5, 2024

openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 5, 2024

openshift-merge-bot bot merged commit 74cfb3c into containers:main Jul 5, 2024
49 checks passed

Luap99 deleted the machine-hang branch July 5, 2024 19:39

stale-locking-app bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Oct 4, 2024

stale-locking-app bot locked as resolved and limited conversation to collaborators Oct 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pkg/machine/e2e: improve timeout handling #23162

pkg/machine/e2e: improve timeout handling #23162

Luap99 commented Jul 2, 2024 •

edited

Loading

openshift-ci bot commented Jul 2, 2024

rhatdan commented Jul 2, 2024

Luap99 commented Jul 2, 2024

ashley-cui commented Jul 2, 2024

Luap99 commented Jul 2, 2024

edsantiago commented Jul 2, 2024

Luap99 commented Jul 2, 2024

Luap99 commented Jul 2, 2024

Luap99 commented Jul 2, 2024

edsantiago commented Jul 2, 2024

Luap99 commented Jul 2, 2024 •

edited

Loading

Luap99 commented Jul 2, 2024

edsantiago commented Jul 2, 2024

Luap99 commented Jul 2, 2024

mtrmac commented Jul 3, 2024

Luap99 commented Jul 3, 2024

mtrmac Jul 4, 2024

Luap99 Jul 4, 2024

mtrmac Jul 4, 2024

Luap99 Jul 4, 2024

mtrmac Jul 4, 2024

mtrmac Jul 4, 2024

Luap99 Jul 4, 2024

mtrmac Jul 4, 2024

Luap99 Jul 5, 2024

cevich Jul 5, 2024

Luap99 commented Jul 4, 2024

Luap99 commented Jul 5, 2024

Luap99 commented Jul 5, 2024

mheon commented Jul 5, 2024

ashley-cui commented Jul 5, 2024

cevich commented Jul 5, 2024

cevich commented Jul 5, 2024

pkg/machine/e2e: improve timeout handling #23162

pkg/machine/e2e: improve timeout handling #23162

Conversation

Luap99 commented Jul 2, 2024 • edited Loading

Does this PR introduce a user-facing change?

openshift-ci bot commented Jul 2, 2024

rhatdan commented Jul 2, 2024

Luap99 commented Jul 2, 2024

ashley-cui commented Jul 2, 2024

Luap99 commented Jul 2, 2024

edsantiago commented Jul 2, 2024

Luap99 commented Jul 2, 2024

Luap99 commented Jul 2, 2024

Luap99 commented Jul 2, 2024

edsantiago commented Jul 2, 2024

Luap99 commented Jul 2, 2024 • edited Loading

Luap99 commented Jul 2, 2024

edsantiago commented Jul 2, 2024

Luap99 commented Jul 2, 2024

mtrmac commented Jul 3, 2024

Luap99 commented Jul 3, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Luap99 commented Jul 4, 2024

Luap99 commented Jul 5, 2024

Luap99 commented Jul 5, 2024

mheon commented Jul 5, 2024

ashley-cui commented Jul 5, 2024

cevich commented Jul 5, 2024

cevich commented Jul 5, 2024

Luap99 commented Jul 2, 2024 •

edited

Loading

Luap99 commented Jul 2, 2024 •

edited

Loading