Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(EG K8S Provider): Improve EG Gateway xDS & startup reliability #2918

Closed
wants to merge 20 commits into from
Closed

feat(EG K8S Provider): Improve EG Gateway xDS & startup reliability #2918

wants to merge 20 commits into from

Conversation

alexwo
Copy link
Contributor

@alexwo alexwo commented Mar 13, 2024

What type of PR is this?

feat(EG K8S Provider): Improve EG Gateway xDS & startup reliability

What this PR does / why we need it:

  • Service Readiness: By waiting for the xDS service to start up, we ensure that the service discovery mechanism is operational, which is crucial for dynamically managing service configurations and routing in microservices architectures.
  • Stability and Reliability: This cautious approach prioritizes the stability and reliability of the system by preventing any unconfigured or misconfigured instances of Envoy from intercepting traffic, which could lead to service disruptions or degraded user experiences.
  • Observability: Completing a successful reconcile cycle before startup can make it easier to monitor and debug the system, as it sets a clear baseline for the operational state of the Envoy instances.

Which issue(s) this PR fixes:

This enhancement facilitates the safer upgrades of EG gateways and features improved xDS consistency on startup.

High Level Changes

A custom health check is implemented to ensure that EG controller is deemed ready only once the xDS snapshot is persisted, or after at least one reconciliation is completed for empty or new clusters.
A custom watcher is added to initiate an initial dummy reconciliation, this ensures empty or news deployments can start (as a successful reconcile is a trigger for startup)

  • Configuration Validation: Ensuring a successful reconcile cycle before startup means that the initial configuration passed to Envoy proxies is validated and error-free. This minimizes the chances of runtime errors related to misconfigurations.

#2810

! Draft !

arkodg and others added 14 commits March 13, 2024 21:56
Fixes: #2875

Signed-off-by: Arko Dasgupta <[email protected]>
Signed-off-by: Alexander Volchok <[email protected]>
Signed-off-by: Alexander Volchok <[email protected]>
…om conflicting (#2786)

* * Validate that multiple policies that affect listener configuration don't map to
  the same listener filter chain.
* Change the XDS listener generation so that instead of
  defaultFilterChain for non-TLS routes, a filterChain with a
  destinationPort matcher is used.
  This allows multiple policies attached to non-TLS listeners that
  differ on the destination port to provide different policies without
  conflicting.

Signed-off-by: Lior Okman <[email protected]>

* Make hostname based routing work again for non-TLS listeners

Signed-off-by: Lior Okman <[email protected]>

* Fixed testdata for egctl

Signed-off-by: Lior Okman <[email protected]>

* Make the linter happy

Signed-off-by: Lior Okman <[email protected]>

* Added a unit-test

Signed-off-by: Lior Okman <[email protected]>

* Make the linter happy

Signed-off-by: Lior Okman <[email protected]>

* Update an e2e test with the new filterChain patch

Signed-off-by: Lior Okman <[email protected]>

* Revert changing the XDS translation, since a new listener is created
anyways for each port.

Signed-off-by: Lior Okman <[email protected]>

* Also revert the xds change in the e2e test.

Signed-off-by: Lior Okman <[email protected]>

* Don't need to go over the full XDSIR map - just the current gateway.

Signed-off-by: Lior Okman <[email protected]>

* Refactored to separate the validation and the translation.

Renamed the helper method to a more generic name.

Signed-off-by: Lior Okman <[email protected]>

---------

Signed-off-by: Lior Okman <[email protected]>
Co-authored-by: Guy Daich <[email protected]>
Signed-off-by: Alexander Volchok <[email protected]>
* remove ProcessBackendTLSPoliciesAncestorRef

Signed-off-by: huabing zhao <[email protected]>

* address comments

Signed-off-by: huabing zhao <[email protected]>

---------

Signed-off-by: huabing zhao <[email protected]>
Signed-off-by: Alexander Volchok <[email protected]>
* Change the Merge behavior to Replace for BackendTrafficPolicy

Signed-off-by: huabing zhao <[email protected]>

* address comments

Signed-off-by: huabing zhao <[email protected]>

---------

Signed-off-by: huabing zhao <[email protected]>
Signed-off-by: Alexander Volchok <[email protected]>
* skip publishing empty status for policies

* #2802 skips computing status
if a target resource cannot be found, mainly because that target maybe
irrelevant to this specific translation, its hard to proactively find
that out in the provider layer

* This fix ensures that any empty status is not published and resets any
existing status for a policy

Signed-off-by: Arko Dasgupta <[email protected]>

* also fix for envoypatchpolicy

Signed-off-by: Arko Dasgupta <[email protected]>

* also discard status for backendtlspolicy

Signed-off-by: Arko Dasgupta <[email protected]>

---------

Signed-off-by: Arko Dasgupta <[email protected]>
Signed-off-by: Alexander Volchok <[email protected]>
* docs: multiple gatewayclass and merge gateways deployment mode

Signed-off-by: Karol Szwaj <[email protected]>

* add merged-gateways example

Signed-off-by: Karol Szwaj <[email protected]>

* md lint

Signed-off-by: Karol Szwaj <[email protected]>

* yaml lint

Signed-off-by: Karol Szwaj <[email protected]>

* add user guides

Signed-off-by: Karol Szwaj <[email protected]>

---------

Signed-off-by: Karol Szwaj <[email protected]>
Co-authored-by: Xunzhuo <[email protected]>
Signed-off-by: Alexander Volchok <[email protected]>
* add PolicyStatus for CTP

Signed-off-by: shawnh2 <[email protected]>

* fix gen-check

Signed-off-by: shawnh2 <[email protected]>

* revert discard policy status

Signed-off-by: shawnh2 <[email protected]>

---------

Signed-off-by: shawnh2 <[email protected]>
Signed-off-by: Alexander Volchok <[email protected]>
* use gwapiv1a2.PolicyStatus for SecurityPolicy Status

Signed-off-by: huabing zhao <[email protected]>

* fix lint

Signed-off-by: huabing zhao <[email protected]>

* add test for cross-ns refs

Signed-off-by: huabing zhao <[email protected]>

* add todo

Signed-off-by: huabing zhao <[email protected]>

* Update internal/gatewayapi/securitypolicy.go

Co-authored-by: sh2 <[email protected]>
Signed-off-by: Huabing Zhao <[email protected]>

* address comments

Signed-off-by: huabing zhao <[email protected]>

---------

Signed-off-by: huabing zhao <[email protected]>
Signed-off-by: Huabing Zhao <[email protected]>
Co-authored-by: sh2 <[email protected]>
Signed-off-by: Alexander Volchok <[email protected]>
fix oidc doc

Signed-off-by: huabing zhao <[email protected]>
Signed-off-by: Alexander Volchok <[email protected]>
* add v1.0.0 release note

Signed-off-by: bitliu <[email protected]>

* generate v1.0 release page

Signed-off-by: bitliu <[email protected]>

* add v1.0.0 release announcement

Signed-off-by: bitliu <[email protected]>

* generate v1.0.0 docs

Signed-off-by: bitliu <[email protected]>

* update site links

Signed-off-by: bitliu <[email protected]>

* fix linter

Signed-off-by: bitliu <[email protected]>

---------

Signed-off-by: bitliu <[email protected]>
Signed-off-by: Alexander Volchok <[email protected]>
Copy link

codecov bot commented Mar 13, 2024

Codecov Report

Attention: Patch coverage is 51.61290% with 30 lines in your changes are missing coverage. Please review.

Project coverage is 64.60%. Comparing base (29946b0) to head (7afa2db).
Report is 122 commits behind head on main.

❗ Current head 7afa2db differs from pull request most recent head 1505793. Consider uploading reports for the commit 1505793 to get more accurate results

Files Patch % Lines
internal/probs/types.go 40.00% 14 Missing and 1 partial ⚠️
internal/cmd/server.go 0.00% 5 Missing ⚠️
internal/provider/kubernetes/controller.go 75.00% 2 Missing and 1 partial ⚠️
internal/provider/kubernetes/kubernetes.go 33.33% 0 Missing and 2 partials ⚠️
internal/provider/kubernetes/sources.go 85.71% 2 Missing ⚠️
internal/xds/server/runner/runner.go 0.00% 2 Missing ⚠️
internal/provider/runner/runner.go 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2918      +/-   ##
==========================================
- Coverage   66.51%   64.60%   -1.91%     
==========================================
  Files         161      123      -38     
  Lines       22673    21179    -1494     
==========================================
- Hits        15080    13683    -1397     
+ Misses       6720     6646      -74     
+ Partials      873      850      -23     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@alexwo alexwo marked this pull request as ready for review March 13, 2024 22:26
@alexwo alexwo requested a review from a team as a code owner March 13, 2024 22:26
@arkodg
Copy link
Contributor

arkodg commented Mar 13, 2024

@alexwo is this necessary ?
based on https://github.com/envoyproxy/gateway/blob/main/internal/xds/cache/snapshotcache.go I believe we only push a snapshot if the IR exists, so if a IR hasn't been translated yet, it shouldnt result in a xds response (this needs to be cross checked)

@alexwo
Copy link
Contributor Author

alexwo commented Mar 13, 2024

@alexwo is this necessary ? based on https://github.com/envoyproxy/gateway/blob/main/internal/xds/cache/snapshotcache.go I believe we only push a snapshot if the IR exists, so if a IR hasn't been translated yet, it shouldnt result in a xds response (this needs to be cross checked)

Sure,

There are essentially two type of conditions that indicate that an EG instance is healthy to serve traffic:

1. Empty or new cluster:
If there are no items requiring translation, we don't anticipate an XDS response. Our focus is solely on verifying that the XDS server has successfully initiated and that a reconciliation loop, albeit inactive, has been executed. In this case EG just startup without xDS having to persist anything.

To figure out that a cluster is empty or new and can start due to this conditions, we trigger the "initial reconcile loop" which will signal to start the instance if xDS / initial loop figures out there is nothing to do.

2. Cluster with resources for xDS:
If there are any IRs translated due to a reconciliation loop containing items needing processing, we verify that a snapshot has indeed been persisted before deeming the instance ready. (In this case the reconcile ha items, so we wait until the loop complete and write the first snapshot, only after mark the instance as healthy.)

@alexwo alexwo mentioned this pull request Apr 30, 2024
@alexwo alexwo closed this by deleting the head repository May 29, 2024
@aoledk
Copy link
Contributor

aoledk commented May 29, 2024

@alexwo You closed this PR in favor of @arkodg 's suggestion (#2810 (comment)) to use longer initialDelaySeconds ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants