machine config: make write atomic #21857

vrothberg · 2024-02-28T08:48:14Z

As indicated in #21849, loading the machine config can flake/fail with an EOF JSON error indicating an incomplete file. Address the issue by atomically writing the config. This way, it is not possible to load an incomplete or partially written file. The lock can be acquired later on to sync state.

[NO NEW TESTS NEEDED] as it's a hard-to-hit race.

Fixes: #21849

Does this PR introduce a user-facing change?

Fix a race condition in podman machine to avoid loading an incomplete config file.

As indicated in containers#21849, loading the machine config can flake/fail with an EOF JSON error indicating an incomplete file. Address the issue by atomically writing the config. This way, it is not possible to load an incomplete or partially written file. The lock can be acquired later on to sync state. [NO NEW TESTS NEEDED] as it's a hard-to-hit race. Fixes: containers#21849 Signed-off-by: Valentin Rothberg <[email protected]>

packit-as-a-service · 2024-02-28T09:13:38Z

Cockpit tests failed for commit f8abd7f. @martinpitt, @jelly, @mvollmer please check.

giuseppe

LGTM

Luap99

LGTM
This patch is correct and useful to prevent writing corrupted files in case a process gets killed mid write.
However the reason we see this flake is that there are fundamental locking issues in the code that still need to get fixed.

openshift-ci · 2024-02-28T09:45:38Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: giuseppe, Luap99, vrothberg

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [Luap99,giuseppe,vrothberg]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

vrothberg · 2024-02-28T09:47:27Z

However the reason we see this flake is that there are fundamental locking issues in the code that still need to get fixed.

Can you elaborate further? What I noticed is that loading is done without having the lock. Whether that's an issue or not depends on the consistency model which I am not totally sure of.

Luap99 · 2024-02-28T10:13:35Z

However the reason we see this flake is that there are fundamental locking issues in the code that still need to get fixed.

Can you elaborate further? What I noticed is that loading is done without having the lock. Whether that's an issue or not depends on the consistency model which I am not totally sure of.

Reading without the lock held is always broken. How can we know the data is still accurate if we do not hold the lock?
The logic currently reads the file (unlocked) then at some point later takes the lock then acts on data that was read before taking the lock, how do we know the VM is still in the state we think it is? When we take a lock we have to re-read the config file otherwise we may try to do the wrong thing. If someone runs start in parallel we may end up trying to start the VM twice which would be very bad.

vrothberg · 2024-02-28T10:20:44Z

Reading without the lock held is always broken.

That depends on the consistency model. As long as writes are atomic, it is valid to read without holding a lock as long as the data is not considered to be accurate.

How can we know the data is still accurate if we do not hold the lock?
The logic currently reads the file (unlocked) then at some point later takes the lock then acts on data that was read before taking the lock, how do we know the VM is still in the state we think it is?

For querying the state, QEMU does a Refresh() which will re-read the file while holding a lock.

Luap99 · 2024-02-28T10:33:19Z

Reading without the lock held is always broken.

That depends on the consistency model. As long as writes are atomic, it is valid to read without holding a lock as long as the data is not considered to be accurate.

There is no consistency model here, yes you can read the file sure. But the data you have must be consider incorrect.

How can we know the data is still accurate if we do not hold the lock?
The logic currently reads the file (unlocked) then at some point later takes the lock then acts on data that was read before taking the lock, how do we know the VM is still in the state we think it is?

For querying the state, QEMU does a Refresh() which will re-read the file while holding a lock.

Only QEMU does that and Refresh is unlocked (unless a caller locked it) which is the reason we the the flake in the first place because the ls quires the State() which is also unlocked. For something like podman machine ls that is fine I agree and this patch will fix that.

However podman machine start is broken, it can easily start two VMS if you call it at the same time.

Also the only provider using Refresh() at all is qemu which means the other ones are even worse.

Anyhow no point in arguing this here, #21854 should fix most of the issues I think.

vrothberg · 2024-02-28T10:35:17Z

Good to merge?

Luap99

/lgtm

openshift-ci bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note labels Feb 28, 2024

vrothberg marked this pull request as ready for review February 28, 2024 08:48

openshift-ci bot added approved Indicates a PR has been approved by an approver from all required OWNERS files. and removed do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. labels Feb 28, 2024

github-actions bot added the machine label Feb 28, 2024

giuseppe approved these changes Feb 28, 2024

View reviewed changes

Luap99 approved these changes Feb 28, 2024

View reviewed changes

Luap99 reviewed Feb 28, 2024

View reviewed changes

openshift-ci bot assigned Luap99 Feb 28, 2024

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Feb 28, 2024

openshift-merge-bot bot merged commit 2d4ef6f into containers:main Feb 28, 2024
92 of 93 checks passed

Luap99 mentioned this pull request Feb 28, 2024

Move locks to shim layer #21854

Merged

stale-locking-app bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label May 29, 2024

stale-locking-app bot locked as resolved and limited conversation to collaborators May 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

machine config: make write atomic #21857

machine config: make write atomic #21857

vrothberg commented Feb 28, 2024

packit-as-a-service bot commented Feb 28, 2024

giuseppe left a comment

Luap99 left a comment

openshift-ci bot commented Feb 28, 2024

vrothberg commented Feb 28, 2024

Luap99 commented Feb 28, 2024

vrothberg commented Feb 28, 2024

Luap99 commented Feb 28, 2024

vrothberg commented Feb 28, 2024

Luap99 left a comment

machine config: make write atomic #21857

machine config: make write atomic #21857

Conversation

vrothberg commented Feb 28, 2024

Does this PR introduce a user-facing change?

packit-as-a-service bot commented Feb 28, 2024

giuseppe left a comment

Choose a reason for hiding this comment

Luap99 left a comment

Choose a reason for hiding this comment

openshift-ci bot commented Feb 28, 2024

vrothberg commented Feb 28, 2024

Luap99 commented Feb 28, 2024

vrothberg commented Feb 28, 2024

Luap99 commented Feb 28, 2024

vrothberg commented Feb 28, 2024

Luap99 left a comment

Choose a reason for hiding this comment