Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pkg/machine/e2e: improve timeout handling #23162

Merged
merged 3 commits into from
Jul 5, 2024

Conversation

Luap99
Copy link
Member

@Luap99 Luap99 commented Jul 2, 2024

There seems to be a very bad flake where podman machine init takes to
long. It doesn't seem to be deadlocked it just has to copy the entire
image which is super slow in CI for some reason.

As such increase the timeout by 1m to 5m total. Hopefully this is good
enough for now. Second in case we still hit a timeout send SIGABRT which
makes go dump the stack trace sow e know if it seems to be doing
something or hangs on lock.

Does this PR introduce a user-facing change?

None

Copy link
Contributor

openshift-ci bot commented Jul 2, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Luap99

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 2, 2024
@rhatdan
Copy link
Member

rhatdan commented Jul 2, 2024

LGTM

@Luap99 Luap99 marked this pull request as draft July 2, 2024 12:09
@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 2, 2024
@Luap99 Luap99 changed the title pkg/machine/e2e: debug test hangs pkg/machine/e2e: improve timeout handling Jul 2, 2024
@Luap99 Luap99 marked this pull request as ready for review July 2, 2024 13:01
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 2, 2024
@Luap99
Copy link
Member Author

Luap99 commented Jul 2, 2024

@edsantiago @baude @ashley-cui PTAL

Let's see how this goes, I am not sure why all of the sudden the flake (#22551) is so much worse (maybe #23068?) but from my limited debugging it really looks like it is slow to copy the VM file on init and there is no hang/lock or anything. So let's hope the extra minute is enough.

@ashley-cui
Copy link
Member

LGTM. Wonder why the other pr made it worse..

@Luap99
Copy link
Member Author

Luap99 commented Jul 2, 2024

LGTM. Wonder why the other pr made it worse..

yeah I don't understand it either, maybe they are located on different disks/volumes which makes copy slower or something like that?

@edsantiago
Copy link
Member

Doesn't seem to help

@Luap99
Copy link
Member Author

Luap99 commented Jul 2, 2024

Yeah really strange..., maybe not the slow IO after all. I have another idea that I can try.

@Luap99
Copy link
Member Author

Luap99 commented Jul 2, 2024

/hold

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 2, 2024
@Luap99
Copy link
Member Author

Luap99 commented Jul 2, 2024

// Interrupts the bar goroutine. It's important that
// bar.Abort(false) is called before p.Wait(), otherwise
// can hang.

Found this command in the progress bar handling code, I pushed a commit to drop the progress bar on decompress. No idea if this is actually the problem but worth a try.

@edsantiago
Copy link
Member

Sigh. No luck.

@Luap99
Copy link
Member Author

Luap99 commented Jul 2, 2024

Sigh. No luck.

Looks like it worked though? No mac flake, seems to have failed on linux on start with an entirely different stack trace.

Of course I might just be lucky on this run so I will repush again and see if it passes again.

@Luap99
Copy link
Member Author

Luap99 commented Jul 2, 2024

<+     > # # podman events --since=the dawn of time...ish
         # #/vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
         # #| FAIL: exit code is 0; expected 125
         # #\^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
         # # [teardown]
         #

That is a new one to me, https://api.cirrus-ci.com/v1/artifact/task/6224808554266624/html/sys-podman-fedora-39-root-host-boltdb.log.html

But looks like all machine tests passed on the first try so that is good.

@edsantiago
Copy link
Member

That looks ... impossible? The --filter code is deterministic, isn't it?

@Luap99
Copy link
Member Author

Luap99 commented Jul 2, 2024

That looks ... impossible? The --filter code is deterministic, isn't it?

Until I saw this I would have said yes, now not so much...
Anyhow feel free to file a new flake report for that. It shouldn't be to difficult to follow the code there.

For now I focus on the machine madness and this looks good. I will repush one last time to be sure and even lower the timeout a bit.

@mtrmac
Copy link
Collaborator

mtrmac commented Jul 3, 2024

If we would run out of space I would assume write to return an error an not block forever?

(At the end-user level, macOS uses quite a bit of disk space (frequently gigabytes) of “purgeable” storage. I don’t quite know how that works — running low on space tends to trigger attempts to delete some of that “purgeable” storage, but IIRC it is also quite possible to hit an “out of space” error when writing too fast. I really have no idea how much of that is happening at the UX “document” abstraction vs. the low-level filesystem syscall API, i.e. whether a write(2) can block while purging is happening. But all of this should only be applicable if we are close enough to running out.)


Ok so MacOS uses dynamic volume sizes?!

My reading of https://developer.apple.com/support/downloads/Apple-File-System-Reference.pdf is that there is a single “allocation” layer per the whole container, with each volume allocating storage for a file/directory as necessary. (But there does exist a feature to reserve space for a volume, or to impose a quota on a volume.)

@Luap99
Copy link
Member Author

Luap99 commented Jul 3, 2024

(But there does exist a feature to reserve space for a volume, or to impose a quota on a volume.)

yes I found that too, see containers/automation#204
I guess only one way to find out if this helps here.

@Luap99 Luap99 force-pushed the machine-hang branch 4 times, most recently from e90c3e6 to f352dc3 Compare July 4, 2024 16:56
@@ -123,7 +123,7 @@ func (sw *sparseWriter) Close() error {
return errors.New("file is already closed")
}
if sw.pendingZeroes != 0 {
if holeSize := sw.pendingZeroes - 1; holeSize >= zerosThreshold {
if holeSize := sw.pendingZeroes; holeSize >= zerosThreshold {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAICS this will corrupt the image: the code will do a seek(+6 GB) (which does not change the file contents) without writing a trailing non-hole byte; the VM image will be 6 GB shorter.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The passing machine test seems to disagree: https://cirrus-ci.com/task/5791451773140992
I am not going to pretend I know how sparse files work in general, I am sure there is good reason for the final write but at least on macos it appears to work like this

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant “corrupt” as in “will have a different length and digest”.

If this is a raw image, I would’t be surprised if the VM using it would (try to?) extend the file by writing to the unallocated parts of the image; i.e. this moves the sparse-file extension / allocation from the decompression time to the VM time.

If this change actually preserves length / digest … I’m extremely confused. (Also, in that case, there should not be any speedup! Creating a 6GB-1 hole and issuing a 1-byte write should not be substantially slower than creating a 6GB hole and not issuing a 1-byte write.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ls -lh did report the full size and du did show the actual size on disk without holes so I guess it worked fine
unless I tested the wrong binary

Feel free test things yourself, I am not saying any of this is correct I am just trying to test things until we can get this insane flake fixed

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Testing this on Mac with a completely default filesystem, with the holeSize := sw.pendingZeroes, sparse_writer.go as is otherwise, and

func main() {
	srcBytes := [10_000_000]byte{}
	src := bytes.NewReader(srcBytes[:])
	dest, err := os.OpenFile("t", os.O_RDWR|os.O_CREATE, 0o600)
	if err != nil {
		panic(err)
	}
	sw := NewSparseWriter(dest)

	cnt, err := io.Copy(sw, src)
	if err != nil {
		panic(err)
	}
	fmt.Printf("copy count %d\n", cnt)
	err = sw.Close()
	if err != nil {
		panic(err)
	}
	err = dest.Close()
	if err != nil {
		panic(err)
	}
}

I see, per fs_usage:

open              F=3        (RWC________X)  t
fstat64           F=3
lseek             F=3    O=0x00989680         <SEEK_CUR>
write             F=3    B=0x0
close             F=3  

and the created file is 0 bytes long. So the trailing zeroes were not written.


With holeSize := sw.pendingZeroes - 1, this is seek(N-1)+1-byte write, as expected; but the output is

% ls -la t                   
-rw-------  1 …  …  10000000  … t
% du -h t                    
9,5M	t
% stat -f '%z %b %k' t  
10000000 19536 4096

i.e. it does not look sparse.


On further testing, a 10M file is not created sparse, but a ~1G one is:

% stat -f '%z %b %k' t
1000000000 8 4096

This is going to make experiments so confusing.

The precise trigger seems to be 16 MiB of a hole (16MiB +1 total size), at least in this test case.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I have absolutely no idea what is going on and if you found a way to avoid the flakes, that’s definitely very valuable.

(I’d probably push for this “don’t write trailing zeroes” mode of SparseWriter (assuming that’s what it does) to be opt-in, because not writing the input data as given would be so surprising. But if we know that’s fine for VM images, sure, let’s do that.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you follow the machine init code there is actually a truncate on the raw disk image based to actual requested VM disk size at the end. So I guess is doesn't matter for our sparse writer to write the remaining zeros because of that. And the truncate doesn't cause the weird write behaviour we are seeing here. So maybe should we try to switch seek over to truncate for the remaining zeros?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, why not.

(I’d still like to, eventually, understand what is going on here — the kernel, ~by definition, doesn’t know that the 1-byte write is “the last write”, so I think whatever is happening could also happen in the middle of some other VM image. But that’s not as important as keeping CI sane.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like with truncate is seems to hang even more and I also noticed a hang within the regular write() not at the end in Close() so yes you are right in can happen in the middle...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’d still like to, eventually, understand what is going on here

There's an apple developer page on writing sparse files. IIRC, there are some unique requirements. The one that struck me was some statement that seemed like it would un-sparse the file if it was re-opened. Perhaps I'm reading that wrongly.

@Luap99
Copy link
Member Author

Luap99 commented Jul 4, 2024

Also FYI I did some disk speed tests, 10g (not sparsed) file was written in like 12s at the beginning of the test and 20g took 25s so I really don't think IO speed is the problem.

@Luap99 Luap99 force-pushed the machine-hang branch 2 times, most recently from 9cd609e to 39cfa34 Compare July 5, 2024 11:03
@Luap99
Copy link
Member Author

Luap99 commented Jul 5, 2024

I bumped the timeout up to 10 mins and it passed on first try, I will rerun several times to check

podman inspect stop
/Users/MacM1-2-worker/ci/task-6339901111664640[/pkg/machine/e2e/inspect_test.go:13](https://github.com/containers/podman/blob/39cfa349cc8fddb7c655a7ca542b58945cf85cff/pkg/machine/e2e/inspect_test.go#L13)
  inspect two machines
  /Users/MacM1-2-worker/ci/task-6339901111664640[/pkg/machine/e2e/inspect_test.go:23](https://github.com/containers/podman/blob/39cfa349cc8fddb7c655a7ca542b58945cf85cff/pkg/machine/e2e/inspect_test.go#L23)
  → Enter [BeforeEach] TOP-LEVEL - /Users/MacM1-2-worker/ci/task-6339901111664640[/pkg/machine/e2e/machine_test.go:160](https://github.com/containers/podman/blob/39cfa349cc8fddb7c655a7ca542b58945cf85cff/pkg/machine/e2e/machine_test.go#L160) @ 07/05/24 11:18:54.419
  ← Exit  [BeforeEach] TOP-LEVEL - /Users/MacM1-2-worker/ci/task-6339901111664640[/pkg/machine/e2e/machine_test.go:160](https://github.com/containers/podman/blob/39cfa349cc8fddb7c655a7ca542b58945cf85cff/pkg/machine/e2e/machine_test.go#L160) @ 07/05/24 11:18:54.42 (0s)
  → Enter [It] inspect two machines - /Users/MacM1-2-worker/ci/task-6339901111664640[/pkg/machine/e2e/inspect_test.go:23](https://github.com/containers/podman/blob/39cfa349cc8fddb7c655a7ca542b58945cf85cff/pkg/machine/e2e/inspect_test.go#L23) @ 07/05/24 11:18:54.42
  /Users/MacM1-2-worker/ci/task-6339901111664640/bin/darwin/podman machine init --disk-size 11 --image /private/tmp/ci/podman-machine-daily.aarch64.applehv.raw foo1
  Machine init complete
  To start your machine run:

  	podman machine start foo1

  /Users/MacM1-2-worker/ci/task-6339901111664640/bin/darwin/podman machine init --disk-size 11 --image /private/tmp/ci/podman-machine-daily.aarch64.applehv.raw foo2
  Machine init complete
  To start your machine run:

  	podman machine start foo2

  /Users/MacM1-2-worker/ci/task-6339901111664640/bin/darwin/podman machine inspect --format {{.Name}} foo1
  foo1
  ← Exit  [It] inspect two machines - /Users/MacM1-2-worker/ci/task-6339901111664640[/pkg/machine/e2e/inspect_test.go:23](https://github.com/containers/podman/blob/39cfa349cc8fddb7c655a7ca542b58945cf85cff/pkg/machine/e2e/inspect_test.go#L23) @ 07/05/24 11:26:08.767 (7m14.196s)
  > Enter [DeferCleanup (Each)] podman inspect stop - /Users/MacM1-2-worker/ci/task-6339901111664640[/pkg/machine/e2e/config_init_test.go:83](https://github.com/containers/podman/blob/39cfa349cc8fddb7c655a7ca542b58945cf85cff/pkg/machine/e2e/config_init_test.go#L83) @ 07/05/24 11:26:08.767
  /Users/MacM1-2-worker/ci/task-6339901111664640/bin/darwin/podman machine rm --force foo2
  < Exit [DeferCleanup (Each)] podman inspect stop - /Users/MacM1-2-worker/ci/task-6339901111664640[/pkg/machine/e2e/config_init_test.go:83](https://github.com/containers/podman/blob/39cfa349cc8fddb7c655a7ca542b58945cf85cff/pkg/machine/e2e/config_init_test.go#L83) @ 07/05/24 11:26:08.853 (86ms)
  > Enter [DeferCleanup (Each)] podman inspect stop - /Users/MacM1-2-worker/ci/task-6339901111664640[/pkg/machine/e2e/config_init_test.go:83](https://github.com/containers/podman/blob/39cfa349cc8fddb7c655a7ca542b58945cf85cff/pkg/machine/e2e/config_init_test.go#L83) @ 07/05/24 11:26:08.853
  /Users/MacM1-2-worker/ci/task-6339901111664640/bin/darwin/podman machine rm --force foo1
  < Exit [DeferCleanup (Each)] podman inspect stop - /Users/MacM1-2-worker/ci/task-6339901111664640[/pkg/machine/e2e/config_init_test.go:83](https://github.com/containers/podman/blob/39cfa349cc8fddb7c655a7ca542b58945cf85cff/pkg/machine/e2e/config_init_test.go#L83) @ 07/05/24 11:26:08.935 (82ms)
  > Enter [DeferCleanup (Each)] TOP-LEVEL - /Users/MacM1-2-worker/ci/task-6339901111664640[/pkg/machine/e2e/machine_test.go:162](https://github.com/containers/podman/blob/39cfa349cc8fddb7c655a7ca542b58945cf85cff/pkg/machine/e2e/machine_test.go#L162) @ 07/05/24 11:26:08.935
  < Exit [DeferCleanup (Each)] TOP-LEVEL - /Users/MacM1-2-worker/ci/task-6339901111664640[/pkg/machine/e2e/machine_test.go:162](https://github.com/containers/podman/blob/39cfa349cc8fddb7c655a7ca542b58945cf85cff/pkg/machine/e2e/machine_test.go#L162) @ 07/05/24 11:26:08.936 (1ms)
• [434.366 seconds]

It took over 7 mins for this one test to run, all others are much faster.

Luap99 added 3 commits July 5, 2024 15:28
In case of timeouts actually log the command again and make sure to send
SIGABRT to the process as go will create a useful stack strace where we
can see where things are hanging. It also kill the process unlike the
default Eventually().Should(Exit()) call the leaves the process around.

The output will be captured by default in the log so we just see the
stack trace there.

And while at it bump the timout up to 10 mins, we are hitting hard
flakes in CI where machine init takes longer than 5 mins for unknown
reasons but this seems to be good enough.

Signed-off-by: Paul Holzinger <[email protected]>
To debug the slow machine init command. Let's see the disk image size.

Signed-off-by: Paul Holzinger <[email protected]>
Makes it easier to see which tests are slow.

Signed-off-by: Paul Holzinger <[email protected]>
@Luap99 Luap99 marked this pull request as ready for review July 5, 2024 14:53
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 5, 2024
@Luap99
Copy link
Member Author

Luap99 commented Jul 5, 2024

I think this is good to review, I had several runs without flakes now. There are still some debug command in here to see the file sizes but that should not hurt us and makes finding patterns easier so I like to keep them for now.

So the fix for now is to bump the timeout to 10 minutes as somehow it magically takes like around 6-7m sometimes???

So fat the pattern I can see is when it is super slow the file does not seem sparse, it is always one of the early test (often the first one). Some tests write sparse files with different sizes (I would have assumed it be deterministic but looks like it is not?!) Even tests that do not write the file sparse still write fast enough, it is just sometimes where it starts to hang.

  [du -ah /private/tmp/ci/podman_test1707448420/.local/share/containers/podman/machine/applehv]
0	/private/tmp/ci/podman_test1707448420/.local/share/containers/podman/machine/applehv/cache
8.1G	/private/tmp/ci/podman_test1707448420/.local/share/containers/podman/machine/applehv/3f6176645637-arm64.raw
8.1G	/private/tmp/ci/podman_test1707448420/.local/share/containers/podman/machine/applehv
  [ls -lh /private/tmp/ci/podman_test1707448420/.local/share/containers/podman/machine/applehv]
total 8.1G
-rw-r--r-- 1 MacM1-3-worker staff 11G Jul  5 14:24 3f6176645637-arm64.raw
drwxr-xr-x 2 MacM1-3-worker staff  64 Jul  5 14:17 cache
  [stat /private/tmp/ci/podman_test1707448420/.local/share/containers/podman/machine/applehv/3f6176645637-arm64.raw]
  File: /private/tmp/ci/podman_test1707448420/.local/share/containers/podman/machine/applehv/3f6176645637-arm64.raw
  Size: 11811160064	Blocks: 16914816   IO Block: 4096   regular file
Device: 1,29	Inode: 28584       Links: 1
Access: (0644/-rw-r--r--)  Uid: (  502/MacM1-3-worker)   Gid: (   20/   staff)
Access: 2024-07-05 14:17:08.611603530 +0000
Modify: 2024-07-05 14:24:07.695944514 +0000
Change: 2024-07-05 14:24:07.695944514 +0000
 Birth: 2024-07-05 14:17:08.611603530 +0000
...
 [413.586 seconds]

@mheon
Copy link
Member

mheon commented Jul 5, 2024

Neat trick with the func() string using SIGABRT. LGTM

@ashley-cui
Copy link
Member

LGTM, nice work

@cevich
Copy link
Member

cevich commented Jul 5, 2024

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jul 5, 2024
@cevich
Copy link
Member

cevich commented Jul 5, 2024

/hold cancel

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 5, 2024
@openshift-merge-bot openshift-merge-bot bot merged commit 74cfb3c into containers:main Jul 5, 2024
49 checks passed
@Luap99 Luap99 deleted the machine-hang branch July 5, 2024 19:39
@stale-locking-app stale-locking-app bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Oct 4, 2024
@stale-locking-app stale-locking-app bot locked as resolved and limited conversation to collaborators Oct 4, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. machine release-note-none
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants