Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: set defaults in Deployment, else k8s sets them for you, creating infinite reconciliation loop #1594

Merged
merged 30 commits into from
Jul 27, 2023

Conversation

qicz
Copy link
Member

@qicz qicz commented Jun 26, 2023

Which issue(s) this PR fixes:
fixes the envoy proxy resources update bug, the resourceProvider's provides resources have more different fields from applied.

Fixes

@qicz qicz requested a review from a team as a code owner June 26, 2023 10:48
@qicz qicz added kind/bug Something isn't working area/infra-mgr Issues related to the provisioner used for provisioning the managed Envoy Proxy fleet. labels Jun 26, 2023
@qicz qicz added this to the 0.5.0-rc1 milestone Jun 26, 2023
@qicz qicz changed the title fix: envoy proxy resource compare bug. fix: envoy proxy resource apply bug. Jun 26, 2023
@codecov
Copy link

codecov bot commented Jun 26, 2023

Codecov Report

Merging #1594 (ff25ed3) into main (89d16a6) will decrease coverage by 0.09%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main    #1594      +/-   ##
==========================================
- Coverage   64.90%   64.82%   -0.09%     
==========================================
  Files          83       83              
  Lines       11912    11924      +12     
==========================================
- Hits         7732     7730       -2     
- Misses       3701     3711      +10     
- Partials      479      483       +4     
Files Changed Coverage Δ
...ternal/infrastructure/kubernetes/proxy/resource.go 93.22% <100.00%> (+0.21%) ⬆️
...frastructure/kubernetes/proxy/resource_provider.go 86.39% <100.00%> (+0.16%) ⬆️
...al/infrastructure/kubernetes/ratelimit/resource.go 96.21% <100.00%> (+0.04%) ⬆️
...tructure/kubernetes/ratelimit/resource_provider.go 97.16% <100.00%> (+0.04%) ⬆️

... and 2 files with indirect coverage changes

@arkodg
Copy link
Contributor

arkodg commented Jun 26, 2023

@qicz can you please elaborate on the bug ?

@qicz
Copy link
Member Author

qicz commented Jun 26, 2023

@qicz can you please elaborate on the bug ?

the Controller watch all Resource, when some Resource changed the r.InfraIR will Store new Infra

for key, val := range result.InfraIR {
if err := val.Validate(); err != nil {
r.Logger.Error(err, "unable to validate infra ir, skipped sending it")
} else {
r.InfraIR.Store(key, val)
newKeys = append(newKeys, key)
}

and the r.InfraIR will Subscribe the change and do CreateOrUpdateProxyInfra

func (r *Runner) subscribeToProxyInfraIR(ctx context.Context) {
// Subscribe to resources
message.HandleSubscription(r.InfraIR.Subscribe(ctx),
func(update message.Update[string, *ir.Infra]) {
val := update.Value
if update.Delete {
if err := r.mgr.DeleteProxyInfra(ctx, val); err != nil {
r.Logger.Error(err, "failed to delete infra")
}
} else {
// Manage the proxy infra.
if err := r.mgr.CreateOrUpdateProxyInfra(ctx, val); err != nil {
r.Logger.Error(err, "failed to create new infra")
}
}
},
)
r.Logger.Info("infra subscriber shutting down")
}

func (i *Infra) createOrUpdateDeployment(ctx context.Context, r ResourceRender) error {
deployment, err := r.Deployment()
if err != nil {
return err
}
current := &appsv1.Deployment{}
key := types.NamespacedName{
Namespace: deployment.Namespace,
Name: deployment.Name,
}
return i.Client.CreateOrUpdate(ctx, key, current, deployment, func() bool {
return !reflect.DeepEqual(deployment.Spec, current.Spec)
})
}

when creating the deployment, some filed will fill default values like DefaultMode 420, DeprecatedServiceAccount the sa name, RevisionHistoryLimit &10, ProgressDeadlineSeconds &600, and the HPA will change the Replicas when occurring the AutoScaler. we should align these fields to ensure the updateChecker() works right, otherwise will update the deployment all the time, and the more pods terminating

return i.Client.CreateOrUpdate(ctx, key, current, deployment, func() bool {
// applied to k8s the "DeprecatedServiceAccount" will fill it.
deployment.Spec.Template.Spec.DeprecatedServiceAccount = current.Spec.Template.Spec.DeprecatedServiceAccount
// applied to k8s the "SecurityContext" will fill it with default settings.
deployment.Spec.Template.Spec.SecurityContext = current.Spec.Template.Spec.SecurityContext
// adapter the hpa updating.
deployment.Spec.Replicas = current.Spec.Replicas
return !reflect.DeepEqual(deployment.Spec, current.Spec)
})
}

if updateChecker() {
specific.SetUID(current.GetUID())
if err := cli.Client.Update(ctx, specific); err != nil {
return errors.Wrap(err, "for Update")
}
}
}

image

@qicz
Copy link
Member Author

qicz commented Jun 27, 2023

maybe we should watch the Resource by per controller like the past version @arkodg

Signed-off-by: qicz <[email protected]>
@qicz
Copy link
Member Author

qicz commented Jun 27, 2023

@zirain the e2e needs review, sometime works fine on v1.27.0, or v1.25.8 or v1.26.3 but can not works fine all.

@zirain
Copy link
Member

zirain commented Jun 27, 2023

@zirain the e2e needs review, sometime works fine on v1.27.0, or v1.25.8 or v1.26.3 but can not works fine all.

see #1599

qicz added 2 commits June 27, 2023 14:27
@arkodg
Copy link
Contributor

arkodg commented Jun 27, 2023

should the logic be reverted back to update only the fields that EG cares about instead of updating entire object ?
or should we skip comparison of some fields (that are changed by other controllers) ?

@qicz
Copy link
Member Author

qicz commented Jun 27, 2023

should the logic be reverted back to update only the fields that EG cares about instead of updating entire object ? or should we skip comparison of some fields (that are changed by other controllers) ?

maybe needs a controller for envoyproxy, then we just needs compare enovyproxy.

At present, we only compare the fields that EG cares about or ignore some fields for subsequent feature development, which is easy to cause comparative logic omissions.

At the same time, when the HPA is triggered, it will also be affected, for example, the original number of replicas is 2, the HPA is changed to 3, at this time we modify the envoyproxy, changing it to 4 will not take effect. We needs compare the hard deployment and k8s current deployment replicas when hard > current fire updating.

@qicz
Copy link
Member Author

qicz commented Jul 26, 2023

@arkodg ptal, has been updated ref offline communication

Signed-off-by: qicz <[email protected]>
arkodg
arkodg previously approved these changes Jul 26, 2023
Copy link
Contributor

@arkodg arkodg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM thanks !

@arkodg arkodg requested a review from a team July 26, 2023 17:50
@arkodg arkodg modified the milestones: 0.5.0, 0.5.0-rc1 Jul 26, 2023
@arkodg arkodg requested review from a team, kflynn, zhaohuabing and chauhanshubham and removed request for a team July 26, 2023 18:06
@arkodg arkodg modified the milestones: 0.5.0-rc1, 0.5.0 Jul 26, 2023
Signed-off-by: qicz <[email protected]>
@zirain
Copy link
Member

zirain commented Jul 27, 2023

cc @arkodg need cherrypick to release-0.5?

@zirain
Copy link
Member

zirain commented Jul 27, 2023

/wait #1714

@zirain zirain added the cherrypick/release-v0.5 cherrypick to release/v0.5 label Jul 27, 2023
Copy link
Member

@Xunzhuo Xunzhuo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch and thanks for digging into the problem : )

@Xunzhuo Xunzhuo merged commit 9ba9103 into envoyproxy:main Jul 27, 2023
@Xunzhuo
Copy link
Member

Xunzhuo commented Jul 27, 2023

@zirain
Copy link
Member

zirain commented Jul 27, 2023

have no idea about the fail, maybe something underlying changed?

arkodg pushed a commit to arkodg/gateway that referenced this pull request Aug 2, 2023
…ating infinite reconciliation loop (envoyproxy#1594)

* fix: envoy proxy resource apply bug.

Signed-off-by: qicz <[email protected]>

* update pointer.

Signed-off-by: qicz <[email protected]>

* add comment

Signed-off-by: qicz <[email protected]>

* update cm cmp logic.

Signed-off-by: qicz <[email protected]>

* fix lint

Signed-off-by: qicz <[email protected]>

* add probe field default value.

Signed-off-by: qicz <[email protected]>

* fix uts

Signed-off-by: qicz <[email protected]>

* align probe

Signed-off-by: qicz <[email protected]>

* optimize deploy compare logic

Signed-off-by: qicz <[email protected]>

* add compare deploy uts

Signed-off-by: qicz <[email protected]>

* rm cm binarydata cmp

Signed-off-by: qicz <[email protected]>

* rm deploy cmp logic

Signed-off-by: qicz <[email protected]>

* fix ut

Signed-off-by: qicz <[email protected]>

* fix lint

Signed-off-by: qicz <[email protected]>

---------

Signed-off-by: qicz <[email protected]>
Signed-off-by: qi <[email protected]>
(cherry picked from commit 9ba9103)
arkodg added a commit that referenced this pull request Aug 2, 2023
* refactor: set defaults in Deployment, else k8s sets them for you, creating infinite reconciliation loop (#1594)

* fix: envoy proxy resource apply bug.

Signed-off-by: qicz <[email protected]>

* update pointer.

Signed-off-by: qicz <[email protected]>

* add comment

Signed-off-by: qicz <[email protected]>

* update cm cmp logic.

Signed-off-by: qicz <[email protected]>

* fix lint

Signed-off-by: qicz <[email protected]>

* add probe field default value.

Signed-off-by: qicz <[email protected]>

* fix uts

Signed-off-by: qicz <[email protected]>

* align probe

Signed-off-by: qicz <[email protected]>

* optimize deploy compare logic

Signed-off-by: qicz <[email protected]>

* add compare deploy uts

Signed-off-by: qicz <[email protected]>

* rm cm binarydata cmp

Signed-off-by: qicz <[email protected]>

* rm deploy cmp logic

Signed-off-by: qicz <[email protected]>

* fix ut

Signed-off-by: qicz <[email protected]>

* fix lint

Signed-off-by: qicz <[email protected]>

---------

Signed-off-by: qicz <[email protected]>
Signed-off-by: qi <[email protected]>
(cherry picked from commit 9ba9103)

* DeepCopy resources that require status updates (#1723)

* Was seeing constant churn between provider runner publishing resources
and gateway-api runner receiving them.

* Tried to debug it by printing the o/p of `cmp.Diff` between current
  and previous values
```
diff --git a/internal/gatewayapi/runner/runner.go b/internal/gatewayapi/runner/runner.go
index 050394ba..50d09f6f 100644
--- a/internal/gatewayapi/runner/runner.go
+++ b/internal/gatewayapi/runner/runner.go
@@ -8,6 +8,7 @@ package runner
 import (
        "context"

+       "github.com/google/go-cmp/cmp"
        "k8s.io/apimachinery/pkg/runtime/schema"
        "sigs.k8s.io/gateway-api/apis/v1beta1"
        "sigs.k8s.io/yaml"
@@ -49,6 +50,7 @@ func (r *Runner) Start(ctx context.Context) error {
 }

 func (r *Runner) subscribeAndTranslate(ctx context.Context) {
+       prev := &gatewayapi.Resources{}
        message.HandleSubscription(r.ProviderResources.GatewayAPIResources.Subscribe(ctx),
                func(update message.Update[string, *gatewayapi.Resources]) {
                        val := update.Value
@@ -56,6 +58,9 @@ func (r *Runner) subscribeAndTranslate(ctx context.Context) {
                        if update.Delete || val == nil {
                                return
                        }
+                       diff := cmp.Diff(prev, val)
+                       r.Logger.WithValues("output", "diff").Info(diff)
+                       prev = val.DeepCopy()

                        // Translate and publish IRs.
                        t := &gatewayapi.Translator{
```

Here's the o/p and its empty
```
2023-07-27T23:55:29.795Z	INFO	gateway-api	runner/runner.go:62		{"runner": "gateway-api", "output": "diff"}
```

* Using a DeepCopy for resources that were updating the `Status`
  subresource seems to have solved the issue, which implies that
  watchable doesnt like clients to mutate the value, even though they
  are meant to be a `DeepCopy`

Fixes: #1715

Signed-off-by: Arko Dasgupta <[email protected]>
(cherry picked from commit 5b72451)

* observability: add container port for metrics (#1736)

container port

Signed-off-by: zirain <[email protected]>
(cherry picked from commit 4bba03a)

* docs: Add user docs for EnvoyPatchPolicy (#1733)

* Add user docs for EnvoyPatchPolicy

Relates to #24

Signed-off-by: Arko Dasgupta <[email protected]>

* nits

Signed-off-by: Arko Dasgupta <[email protected]>

* wrap up

Signed-off-by: Arko Dasgupta <[email protected]>

* lint

Signed-off-by: Arko Dasgupta <[email protected]>

* address comments && fix config

Signed-off-by: Arko Dasgupta <[email protected]>

---------

Signed-off-by: Arko Dasgupta <[email protected]>
(cherry picked from commit 27b0939)

* e2e & misc fixes for EnvoyPatchPolicy (#1738)

* Add E2E for EnvoyPatchPolicy

* Use LocalReplyConfig to return a custom
status code `406` when there is no valid route match

Signed-off-by: Arko Dasgupta <[email protected]>
(cherry picked from commit a7784c5)

---------

Signed-off-by: Arko Dasgupta <[email protected]>
Co-authored-by: qi <[email protected]>
Co-authored-by: zirain <[email protected]>
This was referenced Aug 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/infra-mgr Issues related to the provisioner used for provisioning the managed Envoy Proxy fleet. cherrypick/release-v0.5 cherrypick to release/v0.5 kind/refactor
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants