Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve EG Gateway xDS & startup Reliability (custom k8s health prob) #2810

Open
alexwo opened this issue Mar 6, 2024 · 17 comments
Open

Improve EG Gateway xDS & startup Reliability (custom k8s health prob) #2810

alexwo opened this issue Mar 6, 2024 · 17 comments
Assignees
Labels
kind/feature new feature
Milestone

Comments

@alexwo
Copy link
Contributor

alexwo commented Mar 6, 2024

The proposed enhancement involves modifying the controller's "ready" status to accurately reflect the completion and synchronization of xds discovery processes.

Specifically, the "ready" status indicator can transition to "true" when the xDS discovery has fully completed to store it's initial snapshot or when there is no reconciliation required. (empty or new deployment).

This can ensure that envoy proxies are always in sync with latest xDS service and that an EG that has started is able to reconcile.

-> If there is nothing to reconcile -> ready = true
-> if there are changes to reconcile, wait for xDS to complete -> ready = true
-> other wise -> ready = false

Currently, there may be certain cases where xDS is not completely synchronized at startup, which could cause new Envoy proxies to work with an incomplete xDS.

This can provide better guarantees that an operational EG consistently maintains an updated xDS, potentially also can allow avoiding situations where instances startup but fail during the initial reconcile.

Leader Election and multiple instances use case:
Will improve consistency in environments where multiple instances of EG run simultaneously by ensuring they start only once xDS server has persisted the latest state snapshot. #1953

@alexwo alexwo added the triage label Mar 6, 2024
@alexwo alexwo changed the title Improve EG Gateway xDS & startup Reliability (via custom k8s health check impl) Improve EG Gateway xDS & startup Reliability (custom k8s health probs) Mar 6, 2024
@alexwo alexwo changed the title Improve EG Gateway xDS & startup Reliability (custom k8s health probs) Improve EG Gateway xDS & startup Reliability (custom k8s health prob) Mar 6, 2024
@arkodg
Copy link
Contributor

arkodg commented Mar 6, 2024

makes sense, a workaround for this until this is implemented is to wake up slowly i.e. set initialDelaySeconds to a higher value

@arkodg arkodg added help wanted Extra attention is needed kind/feature new feature and removed triage labels Mar 6, 2024
@arkodg arkodg added this to the Backlog milestone Mar 6, 2024
@alexwo
Copy link
Contributor Author

alexwo commented Mar 7, 2024

please assign to me

Copy link

This issue has been automatically marked as stale because it has not had activity in the last 30 days.

@aoledk
Copy link
Contributor

aoledk commented May 29, 2024

makes sense, a workaround for this until this is implemented is to wake up slowly i.e. set initialDelaySeconds to a higher value

@arkodg But currently EG hasn't exposed initialDelaySeconds via Chart Values in

readinessProbe:
httpGet:
path: /readyz
port: 8081
initialDelaySeconds: 5
periodSeconds: 10

@github-actions github-actions bot removed the stale label May 29, 2024
@arkodg
Copy link
Contributor

arkodg commented May 29, 2024

I was referring to initialDelaySeconds for envoy proxy (data plane) so it gets enough time to receive xds before receiving traffic (from external LB) once signaling its ready

@aoledk
Copy link
Contributor

aoledk commented May 30, 2024

Since EG is xDS resources provider and envoy is consumer, setting longer initialDelaySeconds for envoy can't prevent envoy from receiving incomplete xDS resources when EG is starting. This will lead to frequent envoy listener draining.

Workaround here maybe set longer initialDelaySeconds for EG to make sure envoy always consume complete xDS resources from EG.

@aoledk
Copy link
Contributor

aoledk commented Jun 6, 2024

@arkodg As a workaround, could we make this EG readiness initialDelaySeconds configurable via helm chart. I can help with this.

@arkodg
Copy link
Contributor

arkodg commented Jun 6, 2024

@aoledk if EG's cache is not ready yet but the xds server is ready, we send an empty response

// If no snapshot has been generated yet, we can't do anything, so don't mess with this request.

will this cause the proxy listeners to drain ?

@aoledk
Copy link
Contributor

aoledk commented Jun 7, 2024

@arkodg if cache is not ready at all (non-exist), EG will return nil and not set snapshot for envoy, only set snapshot will trigger sending xDS resources to envoy. So envoy will use its current active xDS resources instead of empty xDS resources, no listener drain.

if cache is partly ready (starting EG is still reconciling objects), EG will set partly snapshot for envoy. Then envoy will receive partly xDS resources to replace its current active complete xDS resources, maybe leading to listener drain 1, specially when there are ClientTrafficPolicies not be reconciled.

// If no snapshot has been generated yet, we can't do anything, so don't mess with this request.
// go-control-plane will respond with an empty response, then send an update when a snapshot is generated.
if s.lastSnapshot[cluster] == nil {
return nil
}
_, err := s.GetSnapshot(nodeID)
if err != nil {
err = s.SetSnapshot(context.TODO(), nodeID, s.lastSnapshot[cluster])
if err != nil {
return err
}
}

Footnotes

  1. https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overview/listeners/listener_filters#filter-chain-only-update

@arkodg
Copy link
Contributor

arkodg commented Jun 8, 2024

ah, so afaik the part ready case shouldnt happen because there is one big reconciler,

func (r *gatewayAPIReconciler) Reconcile(ctx context.Context, _ reconcile.Request) (reconcile.Result, error) {
so if we reconcile once, we should have state of the world of all resources, until then, there are no resources and no xds o/p in the cache. Is there an incorrect assumption made here ?

@aoledk
Copy link
Contributor

aoledk commented Jun 11, 2024

@arkodg It's my mistake, and you're right, partly ready xDS resources will never be generated under this big reconciler.

@arkodg
Copy link
Contributor

arkodg commented Jun 11, 2024

thanks for cross checking and brainstorming the edge cases @aoledk !

Copy link

This issue has been automatically marked as stale because it has not had activity in the last 30 days.

@github-actions github-actions bot added the stale label Jul 11, 2024
@arkodg
Copy link
Contributor

arkodg commented Aug 16, 2024

can we close this @alexwo

@github-actions github-actions bot removed the stale label Aug 16, 2024
@alexwo alexwo closed this as not planned Won't fix, can't repro, duplicate, stale Aug 16, 2024
@guydc
Copy link
Contributor

guydc commented Jan 10, 2025

To consider: move the current healthcheck listener and filter chain from static config (bootstrap) to dynamic config generated by EG. This would mean that readdiness checks would only pass once the proxy was programmed at least once.
cc @arkodg, @liorokman @alexwo

@guydc guydc reopened this Jan 10, 2025
@arkodg
Copy link
Contributor

arkodg commented Jan 10, 2025

great idea @guydc , big +1

@zirain
Copy link
Member

zirain commented Jan 11, 2025

+1, this will also reduce the size of bootstrap configuration.

@arkodg arkodg modified the milestones: Backlog, v1.3.0-rc.1 Jan 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature new feature
Projects
None yet
Development

No branches or pull requests

5 participants