Improve patch helper #8706

sbueringer · 2023-05-19T17:46:59Z

Today patch helper has the following behaviors which can lead to surprising behavior / CR state:

Status & Status conditions are written sequentially (not atomic)
- This means that when a consumer reads a CR between those writes, status and the corresponding conditions can be in an inconsistent state
Optimistic locking is not used to write Spec & Status, only for status conditions (except if resourceVersion is changed)
- Basically this means that the patch helper could calculate a patch based on a stale object
- Because of that the patch might incorrectly change or not change status field
- Example: It can happen that the KCP replica fields are temporarily in an inconsistent state (i.e. replica numbers don't sum up in a way that it makes sense)

Overall this seems to work good enough today and eventually after multiple reconciles CRs should be in the correct state again.

Nonetheless, I wonder if we should change / re-design the patch helper.

One idea would be to:

use optimistic locking for all writes
- if the write fails we can either:
  - just return the error
  - or re-calculate the patch based on an up-to-date original object (downside here is that the whole reconcile was run against a stale object so the modified object might be wrong)
write status and status conditions together

Detailed Description

.

Anything else you would like to add?

No response

Label(s) to be applied

/kind feature
One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels.

k8s-ci-robot · 2023-05-19T17:47:06Z

This issue is currently awaiting triage.

If CAPI contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

sbueringer · 2023-05-19T17:49:35Z

Not sure if my suggestion is feasible. I see this issue for now mostly as a place where we can collect feedback about the patch helper and then over time figure out if we want to change something

cc @fabriziopandini @vincepri

vincepri · 2023-05-19T20:15:48Z

Eventually we could make use of server side apply instead of doing our own patching logic? The reason the patching logic isn't using optimistic locking, is ultimately to avoid deadlocks or missing data; if we haven't seen problems in reality with the current logic and the above is mostly an optimization, we could start revisiting by conditionally using a new patcher that uses server side apply instead. That said, I think having better support in controller-runtime first would be a pre-requisite.

sbueringer · 2023-05-22T08:28:08Z

Eventually we could make use of server side apply instead of doing our own patching logic?

It is an option, although I wonder how to use SSA in a performant way. Executing one SSA per reconcile seems not very efficient compared to what we do today (nothing if nothing changes). But maybe we can also use some sort of diff logic and only run SSA if something changes. Probably fine for fields that are only owned by us.

I'm not sure if just using SSA helps against the issues mentioned above. (Does it change something regarding status conditions?)

The reason the patching logic isn't using optimistic locking, is ultimately to avoid deadlocks or missing data

How would we deadlock with optimistic locking? What do you mean with missing data? (rather write incomplete/outdated data vs no data if we would requeue?)

if we haven't seen problems in reality with the current logic

The problem is that the status is not reliable. An example:

KCP is going through a rollout

T0: New Machine is created:
- calculated KCP state: .spec.replicas = 3 .status.replicas = 4 .status.readyReplicas =3, .status.updatedReplicas = 3, .status.unavailableReplicas 0
T1: Old Machine is deleted
- calculated KCP state: .spec.replicas = 3 .status.replicas = 3 .status.readyReplicas =3, .status.updatedReplicas = 3, .status.unavailableReplicas 0
T2: New Machine is created
- calculated KCP state: .spec.replicas = 3 .status.replicas = 4 .status.readyReplicas =3, .status.updatedReplicas = 3, .status.unavailableReplicas 1

If patchHelper calculates the diff based on T0 instead of T1 .status.replicas won't be patched to 4 and stay on 3.
The result is that KCP looks like that: .spec.replicas = 3 .status.replicas = 3 (should be 4) .status.readyReplicas =3, .status.updatedReplicas = 3, .status.unavailableReplicas 1.

We had code in the topology controller which was starting MD upgrades once KCP is rolled out. To check if KCP is rolled out we were checking .spec.replicas, .status.replicas, .status.readyReplicas, .status.updatedReplicas. Once we figured out what was going on we implemented a workaround to also check .status.unavailableReplicas. (#7914)

But this is just one example of how status can be not reliable and the "fix" is also very brittle. What it comes down to for me is that basically our status is not trustworthy and I think there is no way to guard against it. I think .status.observedGeneration only guarantees us that the status was calculated against the latest generation (i.e. spec) (status updates don't trigger a new generation). Thus we have no guarantee that the patch to write the status was calculated correctly.

That being said, overall I think it's okay'ish to keep it as is for now. The edge cases are rare and we didn't have many reports of concrete problems like the one I described above.

fabriziopandini · 2023-05-24T13:39:31Z

FYI there is also another issue about optimistic lock #8412 in patch helper.

Optimistic lock seems something that we can investigate at this point; it will also be great to combine this investigation with the work we are doing on stress tests, so we can battle-test the idea and have a good signal that we are not introducing issues due to the change.

I think we should wait a little bit before SSA, we are still learning how to deal with it...

vincepri · 2023-05-24T21:50:15Z

How would we deadlock with optimistic locking? What do you mean with missing data? (rather write incomplete/outdated data vs no data if we would requeue?)

Thinking about more about the historical reasons, we initially used to use Update instead of patching, which always updates the entire object. It's always one call for spec, and one for status subresource. There were cases where an update in a controller would overwrite the results from a different controller (modifying different parts of the object) and effectively losing data.

In the patch case the above scenario shouldn't happen; that said, the immediate concern that comes from locking based on resource version is the sequential calls we make to patch metadata+spec, status, and then conditions.

By enabling optimistic locks everywhere, we'd need to handle the case where the call to metadata+spec goes through, the resource version changes, and we'd have to reapply the diff on status on the latest resource version. The conditions code today does exactly that

cluster-api/util/patch/patch.go

Lines 231 to 234 in 1f69d07

    
           // Get a new copy of the object. 
        
           if err := h.client.Get(ctx, key, latest); err != nil { 
        
           	return false, err 
        
           }

although it does so by safely applying the diff strictly related to the conditions.

In a generic case, which the patch helper is trying to satisfy, if the metadata+spec patch goes through, retrieving the object again and re-applying (forcibly) the status diff on the current object might encounter the same issues we have today; hence why I was suggesting SSA, which more in general would be a safer option, but it comes with its own set of limitations.

The other point to understand, if the current patching behavior isn't causing real world problems and has been working good enough (at least for now), is this something that needs to be optimized?

sbueringer · 2023-05-25T12:54:00Z

The other point to understand, if the current patching behavior isn't causing real world problems and has been working good enough (at least for now), is this something that needs to be optimized?

For now my goal was mainly to get a discussion started and collect limitations / issues of the current patch helper (and make them visible) somewhere which is not my local todo list :).

I don't have any immediate plans (and bandwith) to optimize it anyway. Happy to wait for now how the SSA story evolves.

jessehu · 2023-05-26T14:17:55Z

We hit bugs in the described scenario by @sbueringer when using CAPI patchHelper in our controllers due to the optimistic locking is not used to write Spec & Status, only for status conditions. I think CAPI controllers itself could hit the same bugs too.

We can write another enhanced PatchHelper based on CAPI PatchHelper, but it would be great if CAPI patchHelper can expose the optimistic locking as an option for the caller to specify.

vincepri · 2023-05-26T17:19:16Z

Are there test cases or examples you can provide to understand the issue a bit more?

jessehu · 2023-05-27T00:26:48Z

This is the case in CAPI controller. As PatchHelper will patch CR.Status.Conditions -> CR.Spec & CR.Metadata -> CR.Status in sequence.

Optimistic locking is not used to write Spec & Status, only for status conditions (except if resourceVersion is changed)

Basically this means that the patch helper could calculate a patch based on a stale object
Because of that the patch might incorrectly change or not change status field
Example: It can happen that the KCP replica fields are temporarily in an inconsistent state (i.e. replica numbers don't sum up in a way that it makes sense)

It's similar in the case of my controller. It depends on an annotation and a status field in reconcilation. But the patch helper could calculate a patch based on a stale object and patch CR with the stale annotation.

patch CR status conditions
patch CR annotation
patch CR status

vincepri · 2023-07-10T23:14:52Z

It's similar in the case of my controller. It depends on an annotation and a status field in reconcilation. But the patch helper could calculate a patch based on a stale object and patch CR with the stale annotation.

In this case I'd expect the entire reconcile loop to be kicked off again; in reality we've seen that having locking in place can cause data loss in most cases, especially when in Cluster API (and its providers) controllers perform non re-entrant operations like creating an external resource and need to store identifiers that would otherwise be lost. In cases like these, reconciling again would result in duplicated resources for example.

vincepri · 2023-07-10T23:15:00Z

/triage needs-information

jessehu · 2023-07-18T04:35:47Z

We hit another problem caused by CAPI patchHelper without setting resourceVersion. When creating two ClusterResourceSets for a Cluster at the same time, CAPI starts reconciling ClusterResourceSets and both reconciles use patchHelper to patch ClusterResourceSetBinding.Spec.Bindings which is slice. Then these two patch calls could overwrite other's data. In most cases this will not cause issues.

Especially in our case, the kapp-controller yaml manifest in one of the ClusterResourceSet can not be applied more than once, because kapp-controller pod sets caBundle for the APIService CR in this yaml only when the pod starts.

vincepri · 2023-07-18T13:59:04Z

CAPI starts reconciling ClusterResourceSets and both reconciles use patchHelper to patch ClusterResourceSetBinding.Spec.Bindings which is slice.

This seems something that can be solved by using SSA, and potentially improve the implementation of ClusterResourceSet and its bindings.

k8s-triage-robot · 2024-01-24T19:06:01Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2024-02-23T19:11:24Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot · 2024-03-24T19:51:53Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot · 2024-03-24T19:51:57Z

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen

Mark this issue as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels May 19, 2023

fabriziopandini mentioned this issue May 22, 2023

Can patchHelper provide an option to verify resourceVersion? #8412

Closed

k8s-ci-robot added the triage/needs-information Indicates an issue needs more information in order to work on it. label Jul 10, 2023

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 24, 2024

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 23, 2024

k8s-ci-robot closed this as not planned Won't fix, can't repro, duplicate, stale Mar 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve patch helper #8706

Improve patch helper #8706

sbueringer commented May 19, 2023 •

edited

Loading

k8s-ci-robot commented May 19, 2023

sbueringer commented May 19, 2023

vincepri commented May 19, 2023

sbueringer commented May 22, 2023 •

edited

Loading

fabriziopandini commented May 24, 2023

vincepri commented May 24, 2023 •

edited

Loading

sbueringer commented May 25, 2023 •

edited

Loading

jessehu commented May 26, 2023

vincepri commented May 26, 2023

jessehu commented May 27, 2023

vincepri commented Jul 10, 2023

vincepri commented Jul 10, 2023

jessehu commented Jul 18, 2023

vincepri commented Jul 18, 2023

k8s-triage-robot commented Jan 24, 2024

k8s-triage-robot commented Feb 23, 2024

k8s-triage-robot commented Mar 24, 2024

k8s-ci-robot commented Mar 24, 2024

Improve patch helper #8706

Improve patch helper #8706

Comments

sbueringer commented May 19, 2023 • edited Loading

Detailed Description

Anything else you would like to add?

Label(s) to be applied

k8s-ci-robot commented May 19, 2023

sbueringer commented May 19, 2023

vincepri commented May 19, 2023

sbueringer commented May 22, 2023 • edited Loading

fabriziopandini commented May 24, 2023

vincepri commented May 24, 2023 • edited Loading

sbueringer commented May 25, 2023 • edited Loading

jessehu commented May 26, 2023

vincepri commented May 26, 2023

jessehu commented May 27, 2023

vincepri commented Jul 10, 2023

vincepri commented Jul 10, 2023

jessehu commented Jul 18, 2023

vincepri commented Jul 18, 2023

k8s-triage-robot commented Jan 24, 2024

k8s-triage-robot commented Feb 23, 2024

k8s-triage-robot commented Mar 24, 2024

k8s-ci-robot commented Mar 24, 2024

sbueringer commented May 19, 2023 •

edited

Loading

sbueringer commented May 22, 2023 •

edited

Loading

vincepri commented May 24, 2023 •

edited

Loading

sbueringer commented May 25, 2023 •

edited

Loading