Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Platform-specific file locking for kubeconfig #107354

Closed

Conversation

rosspeoples
Copy link

@rosspeoples rosspeoples commented Jan 5, 2022

What type of PR is this?

/kind feature

What this PR does / why we need it:

Previously when writing the kubeconfig file, it would write out a .lock file which was used to prevent concurrent writes to the file. This PR replaces that mechanism with a more robust platform-specific locking mechanism using LockFileEx for Windows or flock for Linux.

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

This is my first pass and I made the assumption that we didn't want to change the signature of WriteToFile which is why it's structured this way. Also, while previously we were pre-locking all the files at once, they are now locked on-demand. This maintains the pre-locking behavior.

Does this PR introduce a user-facing change?

This updates the etcd dependency to 3.5.1 and is the first step to transitioning to platform-native file locking for handling concurrent writes to kubeconfig files.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

Related to conversation found here: #92513


@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/feature Categorizes issue or PR as related to a new feature. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jan 5, 2022
@k8s-ci-robot
Copy link
Contributor

Hi @deejross. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-priority Indicates a PR lacks a `priority/foo` label and requires one. label Jan 5, 2022
@rosspeoples
Copy link
Author

/assign @soltysh

@k8s-ci-robot k8s-ci-robot added area/dependency Issues or PRs related to dependency changes sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jan 5, 2022
@k8s-ci-robot k8s-ci-robot requested review from jpbetz, sttts and a team January 5, 2022 22:52
@pandaamanda
Copy link
Contributor

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jan 6, 2022
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: deejross
To complete the pull request process, please assign liggitt after the PR has been reviewed.
You can assign the PR to them by writing /assign @liggitt in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added area/cloudprovider area/code-generation area/kubectl sig/auth Categorizes an issue or PR as relevant to SIG Auth. sig/cli Categorizes an issue or PR as relevant to SIG CLI. sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. labels Jan 6, 2022
@fedebongio
Copy link
Contributor

/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jan 6, 2022
@BenTheElder
Copy link
Member

How will this behave when another process uses the current implementation concurrently with this new approach?

More specifically: If I have binary A using client-go from kubernetes 1.23 and I have binary B using client-go from kubernetes 1.24 with this change, and they both attempt to lock the configs, one using the .lock file and the other with this approach, presumably with e.g. flock on linux both processes will "lock" the file with their respective approaches and race reading/writing the configs?

Repeat question for each platform's unique approach in the new implementation.

@rosspeoples
Copy link
Author

If I have binary A using client-go from kubernetes 1.23 and I have binary B using client-go from kubernetes 1.24 with this change, and they both attempt to lock the configs, one using the .lock file and the other with this approach, presumably with e.g. flock on linux both processes will "lock" the file with their respective approaches and race reading/writing the configs?

Another great point, I guess it would make sense to deprecate the .lock files, but continue using both approaches for a few releases. Priority would be to check for platform-specific locks first, then checking for .lock files in order to allow for a smoother migration in the future.

@BenTheElder
Copy link
Member

I think that sounds right, there probably needs to be some decision about how many releases. Not sure on that part ... 🤔

Also, we might need to check on client libraries that aren't client-go and this probably will need a release note to warn that other implementations need to be updated to match? (right now this PR has release-note=NONE).

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed release-note-none Denotes a PR that doesn't merit a release note. labels Jan 7, 2022
@rosspeoples
Copy link
Author

@BenTheElder any thoughts on the latest changes?

@@ -26,6 +27,7 @@ import (
"strings"

"github.com/imdario/mergo"
"go.etcd.io/etcd/client/pkg/v3/fileutil"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not too happy about this ... our client-go depending on "go.etcd.io/etcd/client/pkg/v3/fileutil"

cc @liggitt

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's not really relevant... adding dependencies to k8s.io/{api,apimachinery,client-go} pushes them onto far more consumers

@liggitt
Copy link
Member

liggitt commented Jan 20, 2022

not too happy about this ... our client-go depending on "go.etcd.io/etcd/client/pkg/v3/fileutil"

agreed. I actually think the lock file behavior is generally problematic and we should figure out how to get rid of it (xref #32606, #67676)

@liggitt
Copy link
Member

liggitt commented Jan 20, 2022

were lock-free alternatives like writing to a tmp file and moving to replace considered?

@rosspeoples
Copy link
Author

The current mechanism uses .lock files, but they are not completely atomic.

@liggitt
Copy link
Member

liggitt commented Jan 20, 2022

The current mechanism uses .lock files, but they are not completely atomic.

right, and they have other problems like requiring the entire directory to be writeable instead of just the kubeconfig file

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 20, 2022
@k8s-ci-robot
Copy link
Contributor

@deejross: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@BenTheElder
Copy link
Member

BenTheElder commented Jan 20, 2022

sorry, I think these changes address my initial comments, but the other concerns mentioned seem valid.

I don't know if there's a good path forward on this. I do think tools are going to race merging changes without some sort of lock (e.g. in my case users concurrently creating KIND clusters all writing to the default $KUBECONFIG), but ...

were lock-free alternatives like writing to a tmp file and moving to replace considered?

Atomic file rename would solve that readers unaware of the lock may try to read during writing, but the locks are also important for the problem of "two processes are reading, mutating, then writing back config", e.g. gcloud container clusters get-credentials, kind create cluster, etc. -- in that case we need to keep the files locked until the read-merge-write cycle is complete, which is something client-go supports today, and atomic rename alone doesn't address.

The approach currently in this PR would still support that and address the "directory needs to be writeable" concern once we remove the old .lock code after all these tools are updated to match after some grace period of supporting both modes.

IMO writing by atomic rename would be great, but it would bring back the "directory must be writeable" problem * and we'd still have the read-write-merge issue if we didn't also use some other technique.

* You can't just write to a temp directory when doing atomic rename because we can only do a rename(2) on the same filesystem/disk, and we can't guarantee that TMPDIR is on the same disk as $KUBECONFIG. Usually you would write to a temp file adjacent to the target path (by appending some random suffix) and then rename, which will still require the directory to be writeable.

@rosspeoples
Copy link
Author

@BenTheElder what would be the next steps on this?

@BenTheElder
Copy link
Member

BenTheElder commented Feb 1, 2022

: thinking :

I don't know if this change needs anything so heavyweight as a KEP, but you might want to at least bring it up with SIG CLI in some way (since it will impact kubectl behavior) and/or api-machinery (client-go).

#107354 (comment) seems like the main blocker in review so far, client-go dependencies can be something of a headache, perhaps the subset we need can be re-implemented without additional dependencies (or forked into one of our existing dependencies?), but before going on that route I'd probably try to make sure OWNERS agree.

I only own this from a dependency review standpoint, I just also happen to have some familiarity with this code and problem ...
https://github.com/kubernetes/community/blob/4026287dc3a2d16762353b62ca2fe4b80682960a/contributors/devel/sig-architecture/godep.md#reviewing-and-approving-dependency-changes

@lavalamp
Copy link
Member

lavalamp commented Feb 8, 2022

/remove-sig api-machinery

I believe this is a question for SIG CLI, not api machinery. I've removed this from tomorrow's api machinery agenda; feel free to put it back if you want to talk about it anyway, but I don't think we can give you permission (and also I personally agree with the above comments on this PR).

@k8s-ci-robot k8s-ci-robot removed the sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. label Feb 8, 2022
@kensipe
Copy link
Contributor

kensipe commented Feb 10, 2022

Looks like I'm late to the game here. For awareness, there has been a long standing request to add this feature in Golang... technically it already exists in Go but lives under internal and thus is not exposed. There is a design proposal which merged Mar 20, 2020 however it has not be "Accepted". As noted on the issue, Ian is pushing for "faster consideration", which I don't entirely know what that means but trust him.
Prior to fully reading the proposal, I created an example of what I thought should be exposed from the Go core: https://github.com/kensipe/flock Now that I more fully understand the proposal.. I'm planning on refactoring that project. The intention of my project is to match the proposal close enough to not need any changes (other than imports) to adjust to the addition of it to the golang core. (which we should expect to be months away). The solution that exists there today is a lower level lock, rlock, unlock functional. That is not the level the Go team wants to expose... the expectation is to expose several files from https://github.com/golang/go/tree/f229e70/src/cmd/go/internal/lockedfile. in particular `lockedfile.go and mutex.go

IMO if we are going to properly support this, we should align with coming expectations.

@rosspeoples
Copy link
Author

Closing this due to the general consensus being we don't to bring a new dependency into client-go, and would prefer to utilize locking functionality exposed from the Go standard library.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/apiserver area/cloudprovider area/code-generation area/dependency Issues or PRs related to dependency changes area/kubectl cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/auth Categorizes an issue or PR as relevant to SIG Auth. sig/cli Categorizes an issue or PR as relevant to SIG CLI. sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging this pull request may close these issues.