Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve patch helper #8706

Closed
sbueringer opened this issue May 19, 2023 · 18 comments
Closed

Improve patch helper #8706

sbueringer opened this issue May 19, 2023 · 18 comments
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. triage/needs-information Indicates an issue needs more information in order to work on it.

Comments

@sbueringer
Copy link
Member

sbueringer commented May 19, 2023

Today patch helper has the following behaviors which can lead to surprising behavior / CR state:

  • Status & Status conditions are written sequentially (not atomic)
    • This means that when a consumer reads a CR between those writes, status and the corresponding conditions can be in an inconsistent state
  • Optimistic locking is not used to write Spec & Status, only for status conditions (except if resourceVersion is changed)
    • Basically this means that the patch helper could calculate a patch based on a stale object
    • Because of that the patch might incorrectly change or not change status field
    • Example: It can happen that the KCP replica fields are temporarily in an inconsistent state (i.e. replica numbers don't sum up in a way that it makes sense)

Overall this seems to work good enough today and eventually after multiple reconciles CRs should be in the correct state again.

Nonetheless, I wonder if we should change / re-design the patch helper.

One idea would be to:

  • use optimistic locking for all writes
    • if the write fails we can either:
      • just return the error
      • or re-calculate the patch based on an up-to-date original object (downside here is that the whole reconcile was run against a stale object so the modified object might be wrong)
  • write status and status conditions together

Detailed Description

.

Anything else you would like to add?

No response

Label(s) to be applied

/kind feature
One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels.

@k8s-ci-robot k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels May 19, 2023
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If CAPI contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@sbueringer
Copy link
Member Author

Not sure if my suggestion is feasible. I see this issue for now mostly as a place where we can collect feedback about the patch helper and then over time figure out if we want to change something

cc @fabriziopandini @vincepri

@vincepri
Copy link
Member

Eventually we could make use of server side apply instead of doing our own patching logic? The reason the patching logic isn't using optimistic locking, is ultimately to avoid deadlocks or missing data; if we haven't seen problems in reality with the current logic and the above is mostly an optimization, we could start revisiting by conditionally using a new patcher that uses server side apply instead. That said, I think having better support in controller-runtime first would be a pre-requisite.

@sbueringer
Copy link
Member Author

sbueringer commented May 22, 2023

Eventually we could make use of server side apply instead of doing our own patching logic?

It is an option, although I wonder how to use SSA in a performant way. Executing one SSA per reconcile seems not very efficient compared to what we do today (nothing if nothing changes). But maybe we can also use some sort of diff logic and only run SSA if something changes. Probably fine for fields that are only owned by us.

I'm not sure if just using SSA helps against the issues mentioned above. (Does it change something regarding status conditions?)

The reason the patching logic isn't using optimistic locking, is ultimately to avoid deadlocks or missing data

How would we deadlock with optimistic locking? What do you mean with missing data? (rather write incomplete/outdated data vs no data if we would requeue?)

if we haven't seen problems in reality with the current logic

The problem is that the status is not reliable. An example:

KCP is going through a rollout

  • T0: New Machine is created:
    • calculated KCP state: .spec.replicas = 3 .status.replicas = 4 .status.readyReplicas =3, .status.updatedReplicas = 3, .status.unavailableReplicas 0
  • T1: Old Machine is deleted
    • calculated KCP state: .spec.replicas = 3 .status.replicas = 3 .status.readyReplicas =3, .status.updatedReplicas = 3, .status.unavailableReplicas 0
  • T2: New Machine is created
    • calculated KCP state: .spec.replicas = 3 .status.replicas = 4 .status.readyReplicas =3, .status.updatedReplicas = 3, .status.unavailableReplicas 1

If patchHelper calculates the diff based on T0 instead of T1 .status.replicas won't be patched to 4 and stay on 3.
The result is that KCP looks like that: .spec.replicas = 3 .status.replicas = 3 (should be 4) .status.readyReplicas =3, .status.updatedReplicas = 3, .status.unavailableReplicas 1.

We had code in the topology controller which was starting MD upgrades once KCP is rolled out. To check if KCP is rolled out we were checking .spec.replicas, .status.replicas, .status.readyReplicas, .status.updatedReplicas. Once we figured out what was going on we implemented a workaround to also check .status.unavailableReplicas. (#7914)

But this is just one example of how status can be not reliable and the "fix" is also very brittle. What it comes down to for me is that basically our status is not trustworthy and I think there is no way to guard against it. I think .status.observedGeneration only guarantees us that the status was calculated against the latest generation (i.e. spec) (status updates don't trigger a new generation). Thus we have no guarantee that the patch to write the status was calculated correctly.

That being said, overall I think it's okay'ish to keep it as is for now. The edge cases are rare and we didn't have many reports of concrete problems like the one I described above.

@fabriziopandini
Copy link
Member

FYI there is also another issue about optimistic lock #8412 in patch helper.

Optimistic lock seems something that we can investigate at this point; it will also be great to combine this investigation with the work we are doing on stress tests, so we can battle-test the idea and have a good signal that we are not introducing issues due to the change.

I think we should wait a little bit before SSA, we are still learning how to deal with it...

@vincepri
Copy link
Member

vincepri commented May 24, 2023

How would we deadlock with optimistic locking? What do you mean with missing data? (rather write incomplete/outdated data vs no data if we would requeue?)

Thinking about more about the historical reasons, we initially used to use Update instead of patching, which always updates the entire object. It's always one call for spec, and one for status subresource. There were cases where an update in a controller would overwrite the results from a different controller (modifying different parts of the object) and effectively losing data.

In the patch case the above scenario shouldn't happen; that said, the immediate concern that comes from locking based on resource version is the sequential calls we make to patch metadata+spec, status, and then conditions.

By enabling optimistic locks everywhere, we'd need to handle the case where the call to metadata+spec goes through, the resource version changes, and we'd have to reapply the diff on status on the latest resource version. The conditions code today does exactly that

// Get a new copy of the object.
if err := h.client.Get(ctx, key, latest); err != nil {
return false, err
}
although it does so by safely applying the diff strictly related to the conditions.

In a generic case, which the patch helper is trying to satisfy, if the metadata+spec patch goes through, retrieving the object again and re-applying (forcibly) the status diff on the current object might encounter the same issues we have today; hence why I was suggesting SSA, which more in general would be a safer option, but it comes with its own set of limitations.

The other point to understand, if the current patching behavior isn't causing real world problems and has been working good enough (at least for now), is this something that needs to be optimized?

@sbueringer
Copy link
Member Author

sbueringer commented May 25, 2023

The other point to understand, if the current patching behavior isn't causing real world problems and has been working good enough (at least for now), is this something that needs to be optimized?

For now my goal was mainly to get a discussion started and collect limitations / issues of the current patch helper (and make them visible) somewhere which is not my local todo list :).

I don't have any immediate plans (and bandwith) to optimize it anyway. Happy to wait for now how the SSA story evolves.

@jessehu
Copy link
Contributor

jessehu commented May 26, 2023

We hit bugs in the described scenario by @sbueringer when using CAPI patchHelper in our controllers due to the optimistic locking is not used to write Spec & Status, only for status conditions. I think CAPI controllers itself could hit the same bugs too.

We can write another enhanced PatchHelper based on CAPI PatchHelper, but it would be great if CAPI patchHelper can expose the optimistic locking as an option for the caller to specify.

@vincepri
Copy link
Member

Are there test cases or examples you can provide to understand the issue a bit more?

@jessehu
Copy link
Contributor

jessehu commented May 27, 2023

This is the case in CAPI controller. As PatchHelper will patch CR.Status.Conditions -> CR.Spec & CR.Metadata -> CR.Status in sequence.

Optimistic locking is not used to write Spec & Status, only for status conditions (except if resourceVersion is changed)

Basically this means that the patch helper could calculate a patch based on a stale object
Because of that the patch might incorrectly change or not change status field
Example: It can happen that the KCP replica fields are temporarily in an inconsistent state (i.e. replica numbers don't sum up in a way that it makes sense)

It's similar in the case of my controller. It depends on an annotation and a status field in reconcilation. But the patch helper could calculate a patch based on a stale object and patch CR with the stale annotation.

  1. patch CR status conditions
  2. patch CR annotation
  3. patch CR status

@vincepri
Copy link
Member

It's similar in the case of my controller. It depends on an annotation and a status field in reconcilation. But the patch helper could calculate a patch based on a stale object and patch CR with the stale annotation.

In this case I'd expect the entire reconcile loop to be kicked off again; in reality we've seen that having locking in place can cause data loss in most cases, especially when in Cluster API (and its providers) controllers perform non re-entrant operations like creating an external resource and need to store identifiers that would otherwise be lost. In cases like these, reconciling again would result in duplicated resources for example.

@vincepri
Copy link
Member

/triage needs-information

@k8s-ci-robot k8s-ci-robot added the triage/needs-information Indicates an issue needs more information in order to work on it. label Jul 10, 2023
@jessehu
Copy link
Contributor

jessehu commented Jul 18, 2023

We hit another problem caused by CAPI patchHelper without setting resourceVersion. When creating two ClusterResourceSets for a Cluster at the same time, CAPI starts reconciling ClusterResourceSets and both reconciles use patchHelper to patch ClusterResourceSetBinding.Spec.Bindings which is slice. Then these two patch calls could overwrite other's data. In most cases this will not cause issues.

Especially in our case, the kapp-controller yaml manifest in one of the ClusterResourceSet can not be applied more than once, because kapp-controller pod sets caBundle for the APIService CR in this yaml only when the pod starts.

@vincepri
Copy link
Member

CAPI starts reconciling ClusterResourceSets and both reconciles use patchHelper to patch ClusterResourceSetBinding.Spec.Bindings which is slice.

This seems something that can be solved by using SSA, and potentially improve the implementation of ClusterResourceSet and its bindings.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 24, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 23, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot closed this as not planned Won't fix, can't repro, duplicate, stale Mar 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. triage/needs-information Indicates an issue needs more information in order to work on it.
Projects
None yet
Development

No branches or pull requests

6 participants