Improve EG Gateway xDS & startup Reliability (custom k8s health prob) #2810

alexwo · 2024-03-06T22:29:14Z

The proposed enhancement involves modifying the controller's "ready" status to accurately reflect the completion and synchronization of xds discovery processes.

Specifically, the "ready" status indicator can transition to "true" when the xDS discovery has fully completed to store it's initial snapshot or when there is no reconciliation required. (empty or new deployment).

This can ensure that envoy proxies are always in sync with latest xDS service and that an EG that has started is able to reconcile.

-> If there is nothing to reconcile -> ready = true
-> if there are changes to reconcile, wait for xDS to complete -> ready = true
-> other wise -> ready = false

Currently, there may be certain cases where xDS is not completely synchronized at startup, which could cause new Envoy proxies to work with an incomplete xDS.

This can provide better guarantees that an operational EG consistently maintains an updated xDS, potentially also can allow avoiding situations where instances startup but fail during the initial reconcile.

Leader Election and multiple instances use case:
Will improve consistency in environments where multiple instances of EG run simultaneously by ensuring they start only once xDS server has persisted the latest state snapshot. #1953

arkodg · 2024-03-06T22:40:32Z

makes sense, a workaround for this until this is implemented is to wake up slowly i.e. set initialDelaySeconds to a higher value

alexwo · 2024-03-07T15:11:45Z

please assign to me

github-actions · 2024-04-13T12:01:57Z

This issue has been automatically marked as stale because it has not had activity in the last 30 days.

aoledk · 2024-05-29T07:36:53Z

makes sense, a workaround for this until this is implemented is to wake up slowly i.e. set initialDelaySeconds to a higher value

@arkodg But currently EG hasn't exposed initialDelaySeconds via Chart Values in

gateway/charts/gateway-helm/templates/envoy-gateway-deployment.yaml

Lines 66 to 71 in 78fe57a

    
           readinessProbe: 
        
             httpGet: 
        
               path: /readyz 
        
               port: 8081 
        
             initialDelaySeconds: 5 
        
             periodSeconds: 10

arkodg · 2024-05-29T18:12:48Z

I was referring to initialDelaySeconds for envoy proxy (data plane) so it gets enough time to receive xds before receiving traffic (from external LB) once signaling its ready

aoledk · 2024-05-30T09:17:13Z

Since EG is xDS resources provider and envoy is consumer, setting longer initialDelaySeconds for envoy can't prevent envoy from receiving incomplete xDS resources when EG is starting. This will lead to frequent envoy listener draining.

Workaround here maybe set longer initialDelaySeconds for EG to make sure envoy always consume complete xDS resources from EG.

aoledk · 2024-06-06T10:31:07Z

@arkodg As a workaround, could we make this EG readiness initialDelaySeconds configurable via helm chart. I can help with this.

arkodg · 2024-06-06T22:10:22Z

@aoledk if EG's cache is not ready yet but the xds server is ready, we send an empty response

gateway/internal/xds/cache/snapshotcache.go

Line 202 in 92760c8

    
           // If no snapshot has been generated yet, we can't do anything, so don't mess with this request.

will this cause the proxy listeners to drain ?

aoledk · 2024-06-07T06:03:37Z

@arkodg if cache is not ready at all (non-exist), EG will return nil and not set snapshot for envoy, only set snapshot will trigger sending xDS resources to envoy. So envoy will use its current active xDS resources instead of empty xDS resources, no listener drain.

if cache is partly ready (starting EG is still reconciling objects), EG will set partly snapshot for envoy. Then envoy will receive partly xDS resources to replace its current active complete xDS resources, maybe leading to listener drain ¹, specially when there are ClientTrafficPolicies not be reconciled.

gateway/internal/xds/cache/snapshotcache.go

Lines 202 to 214 in 33fceb0

    
           // If no snapshot has been generated yet, we can't do anything, so don't mess with this request. 
        
           // go-control-plane will respond with an empty response, then send an update when a snapshot is generated. 
        
           if s.lastSnapshot[cluster] == nil { 
        
           	return nil 
        
           } 
        
           _, err := s.GetSnapshot(nodeID) 
        
           if err != nil { 
        
           	err = s.SetSnapshot(context.TODO(), nodeID, s.lastSnapshot[cluster]) 
        
           	if err != nil { 
        
           		return err 
        
           	} 
        
           }

https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overview/listeners/listener_filters#filter-chain-only-update ↩

arkodg · 2024-06-08T00:11:45Z

ah, so afaik the part ready case shouldnt happen because there is one big reconciler,

gateway/internal/provider/kubernetes/controller.go

Line 160 in 33fceb0

    
           func (r *gatewayAPIReconciler) Reconcile(ctx context.Context, _ reconcile.Request) (reconcile.Result, error) {

so if we reconcile once, we should have state of the world of all resources, until then, there are no resources and no xds o/p in the cache. Is there an incorrect assumption made here ?

aoledk · 2024-06-11T07:22:20Z

@arkodg It's my mistake, and you're right, partly ready xDS resources will never be generated under this big reconciler.

arkodg · 2024-06-11T18:59:45Z

thanks for cross checking and brainstorming the edge cases @aoledk !

github-actions · 2024-07-11T20:01:56Z

This issue has been automatically marked as stale because it has not had activity in the last 30 days.

arkodg · 2024-08-16T00:30:50Z

can we close this @alexwo

guydc · 2025-01-10T18:31:49Z

To consider: move the current healthcheck listener and filter chain from static config (bootstrap) to dynamic config generated by EG. This would mean that readdiness checks would only pass once the proxy was programmed at least once.
cc @arkodg, @liorokman @alexwo

arkodg · 2025-01-10T18:37:26Z

great idea @guydc , big +1

zirain · 2025-01-11T01:41:10Z

+1, this will also reduce the size of bootstrap configuration.

alexwo added the triage label Mar 6, 2024

alexwo changed the title ~~Improve EG Gateway xDS & startup Reliability (via custom k8s health check impl)~~ Improve EG Gateway xDS & startup Reliability (custom k8s health probs) Mar 6, 2024

alexwo changed the title ~~Improve EG Gateway xDS & startup Reliability (custom k8s health probs)~~ Improve EG Gateway xDS & startup Reliability (custom k8s health prob) Mar 6, 2024

arkodg added help wanted Extra attention is needed kind/feature new feature and removed triage labels Mar 6, 2024

arkodg added this to the Backlog milestone Mar 6, 2024

liorokman assigned alexwo Mar 7, 2024

arkodg removed the help wanted Extra attention is needed label Mar 7, 2024

alexwo mentioned this issue Mar 13, 2024

feat(EG K8S Provider): Improve EG Gateway xDS & startup reliability #2918

Closed

github-actions bot added the stale label Apr 13, 2024

alexwo mentioned this issue Apr 30, 2024

fix e2e EnvoyShutdown #3283

Closed

github-actions bot removed the stale label May 29, 2024

guydc mentioned this issue Jun 21, 2024

fix: envoy shutdown flaky test #3646

Merged

github-actions bot added the stale label Jul 11, 2024

github-actions bot removed the stale label Aug 16, 2024

alexwo closed this as not planned Won't fix, can't repro, duplicate, stale Aug 16, 2024

guydc reopened this Jan 10, 2025

arkodg modified the milestones: Backlog, v1.3.0-rc.1 Jan 11, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve EG Gateway xDS & startup Reliability (custom k8s health prob) #2810

Improve EG Gateway xDS & startup Reliability (custom k8s health prob) #2810

alexwo commented Mar 6, 2024 •

edited

Loading

arkodg commented Mar 6, 2024

alexwo commented Mar 7, 2024

github-actions bot commented Apr 13, 2024

aoledk commented May 29, 2024

arkodg commented May 29, 2024

aoledk commented May 30, 2024

aoledk commented Jun 6, 2024

arkodg commented Jun 6, 2024

aoledk commented Jun 7, 2024

arkodg commented Jun 8, 2024

aoledk commented Jun 11, 2024

arkodg commented Jun 11, 2024

github-actions bot commented Jul 11, 2024

arkodg commented Aug 16, 2024

guydc commented Jan 10, 2025 •

edited

Loading

arkodg commented Jan 10, 2025

zirain commented Jan 11, 2025 •

edited

Loading

Improve EG Gateway xDS & startup Reliability (custom k8s health prob) #2810

Improve EG Gateway xDS & startup Reliability (custom k8s health prob) #2810

Comments

alexwo commented Mar 6, 2024 • edited Loading

arkodg commented Mar 6, 2024

alexwo commented Mar 7, 2024

github-actions bot commented Apr 13, 2024

aoledk commented May 29, 2024

arkodg commented May 29, 2024

aoledk commented May 30, 2024

aoledk commented Jun 6, 2024

arkodg commented Jun 6, 2024

aoledk commented Jun 7, 2024

Footnotes

arkodg commented Jun 8, 2024

aoledk commented Jun 11, 2024

arkodg commented Jun 11, 2024

github-actions bot commented Jul 11, 2024

arkodg commented Aug 16, 2024

guydc commented Jan 10, 2025 • edited Loading

arkodg commented Jan 10, 2025

zirain commented Jan 11, 2025 • edited Loading

alexwo commented Mar 6, 2024 •

edited

Loading

guydc commented Jan 10, 2025 •

edited

Loading

zirain commented Jan 11, 2025 •

edited

Loading