Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Find better solution for solving race conditions with MachineSet controller reconciling #245

Closed
k4leung4 opened this issue May 30, 2018 · 2 comments · Fixed by #316
Closed
Assignees
Labels
priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.

Comments

@k4leung4
Copy link
Contributor

The current MachineSet controller watches machine and determines whether to reconcile the machine set that owns the machine.
This causes race conditions when the MachineSet creates machines, which may trigger another reconciliation, which ends up creating or deleting additional machines as the system may not have detected the modifications from the initial reconciliation.
A mutex lock was added initially to solve the issue, but back-to-back reconciliation happens quick enough such that the race condition still occurs.
A temp fix to sleep is currently proposed to avoid prevent this race condition.

The replica set controller solves this with using expectations, https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/controller_utils.go#L113
This may be something we will want to adopt as a better fix.

k4leung4 added a commit to k4leung4/cluster-api that referenced this issue May 30, 2018
The race condition is where the machineset reconciles on the same key
too quickly, where the creation/deletion of machines is not detected by
the second reconcilation, causing it create/delete additional machines.

I attempted to use WaitForCacheSync, but that is also insufficient in
preventing the race condition.

The fix here is to add 1 second sleep before releasing the mutex lock
when reconciling, which gives the system a chance to recognize the
changes made from the first reconciliation.

Issue kubernetes-sigs#245 was created to improve this hacky fix.
k4leung4 added a commit to k4leung4/cluster-api that referenced this issue May 30, 2018
The race condition is where the machineset reconciles on the same key
too quickly, where the creation/deletion of machines is not detected by
the second reconcilation, causing it create/delete additional machines.

I attempted to use WaitForCacheSync, but that is also insufficient in
preventing the race condition.

The fix here is to add 1 second sleep before releasing the mutex lock
when reconciling, which gives the system a chance to recognize the
changes made from the first reconciliation.

Issue kubernetes-sigs#245 was created to improve this hacky fix.
k8s-ci-robot pushed a commit that referenced this issue May 31, 2018
The race condition is where the machineset reconciles on the same key
too quickly, where the creation/deletion of machines is not detected by
the second reconcilation, causing it create/delete additional machines.

I attempted to use WaitForCacheSync, but that is also insufficient in
preventing the race condition.

The fix here is to add 1 second sleep before releasing the mutex lock
when reconciling, which gives the system a chance to recognize the
changes made from the first reconciliation.

Issue #245 was created to improve this hacky fix.
@karan
Copy link
Contributor

karan commented Jun 8, 2018

/priority important-soon

@k8s-ci-robot k8s-ci-robot added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Jun 8, 2018
@karan
Copy link
Contributor

karan commented Jun 8, 2018

/assign @k4leung4

chuckha pushed a commit to chuckha/cluster-api that referenced this issue Oct 2, 2019
…od-workaround

🐛 Copy over hack/tools go mod workaround from cluster-api
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants