Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

machine config: make write atomic #21857

Merged
merged 1 commit into from
Feb 28, 2024

Conversation

vrothberg
Copy link
Member

As indicated in #21849, loading the machine config can flake/fail with an EOF JSON error indicating an incomplete file. Address the issue by atomically writing the config. This way, it is not possible to load an incomplete or partially written file. The lock can be acquired later on to sync state.

[NO NEW TESTS NEEDED] as it's a hard-to-hit race.

Fixes: #21849

Does this PR introduce a user-facing change?

Fix a race condition in podman machine to avoid loading an incomplete config file.

As indicated in containers#21849, loading the machine config can flake/fail with
an EOF JSON error indicating an incomplete file.  Address the issue by
atomically writing the config.  This way, it is not possible to load an
incomplete or partially written file.  The lock can be acquired later on
to sync state.

[NO NEW TESTS NEEDED] as it's a hard-to-hit race.

Fixes: containers#21849
Signed-off-by: Valentin Rothberg <[email protected]>
@openshift-ci openshift-ci bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note labels Feb 28, 2024
@vrothberg vrothberg marked this pull request as ready for review February 28, 2024 08:48
@openshift-ci openshift-ci bot added approved Indicates a PR has been approved by an approver from all required OWNERS files. and removed do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. labels Feb 28, 2024
Copy link

Cockpit tests failed for commit f8abd7f. @martinpitt, @jelly, @mvollmer please check.

Copy link
Member

@giuseppe giuseppe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@Luap99 Luap99 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
This patch is correct and useful to prevent writing corrupted files in case a process gets killed mid write.
However the reason we see this flake is that there are fundamental locking issues in the code that still need to get fixed.

Copy link
Contributor

openshift-ci bot commented Feb 28, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: giuseppe, Luap99, vrothberg

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [Luap99,giuseppe,vrothberg]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@vrothberg
Copy link
Member Author

However the reason we see this flake is that there are fundamental locking issues in the code that still need to get fixed.

Can you elaborate further? What I noticed is that loading is done without having the lock. Whether that's an issue or not depends on the consistency model which I am not totally sure of.

@Luap99
Copy link
Member

Luap99 commented Feb 28, 2024

However the reason we see this flake is that there are fundamental locking issues in the code that still need to get fixed.

Can you elaborate further? What I noticed is that loading is done without having the lock. Whether that's an issue or not depends on the consistency model which I am not totally sure of.

Reading without the lock held is always broken. How can we know the data is still accurate if we do not hold the lock?
The logic currently reads the file (unlocked) then at some point later takes the lock then acts on data that was read before taking the lock, how do we know the VM is still in the state we think it is? When we take a lock we have to re-read the config file otherwise we may try to do the wrong thing. If someone runs start in parallel we may end up trying to start the VM twice which would be very bad.

@vrothberg
Copy link
Member Author

Reading without the lock held is always broken.

That depends on the consistency model. As long as writes are atomic, it is valid to read without holding a lock as long as the data is not considered to be accurate.

How can we know the data is still accurate if we do not hold the lock?
The logic currently reads the file (unlocked) then at some point later takes the lock then acts on data that was read before taking the lock, how do we know the VM is still in the state we think it is?

For querying the state, QEMU does a Refresh() which will re-read the file while holding a lock.

@Luap99
Copy link
Member

Luap99 commented Feb 28, 2024

Reading without the lock held is always broken.

That depends on the consistency model. As long as writes are atomic, it is valid to read without holding a lock as long as the data is not considered to be accurate.

There is no consistency model here, yes you can read the file sure. But the data you have must be consider incorrect.

How can we know the data is still accurate if we do not hold the lock?
The logic currently reads the file (unlocked) then at some point later takes the lock then acts on data that was read before taking the lock, how do we know the VM is still in the state we think it is?

For querying the state, QEMU does a Refresh() which will re-read the file while holding a lock.

Only QEMU does that and Refresh is unlocked (unless a caller locked it) which is the reason we the the flake in the first place because the ls quires the State() which is also unlocked. For something like podman machine ls that is fine I agree and this patch will fix that.

However podman machine start is broken, it can easily start two VMS if you call it at the same time.

Also the only provider using Refresh() at all is qemu which means the other ones are even worse.

Anyhow no point in arguing this here, #21854 should fix most of the issues I think.

@vrothberg
Copy link
Member Author

Good to merge?

Copy link
Member

@Luap99 Luap99 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Feb 28, 2024
@openshift-merge-bot openshift-merge-bot bot merged commit 2d4ef6f into containers:main Feb 28, 2024
92 of 93 checks passed
@stale-locking-app stale-locking-app bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label May 29, 2024
@stale-locking-app stale-locking-app bot locked as resolved and limited conversation to collaborators May 29, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. machine release-note
Projects
None yet
Development

Successfully merging this pull request may close these issues.

machine-linux: unable to load machine config file: "unexpected end of JSON input"
3 participants