fix: recover from panics that occur during envoy gateway's reconciliation #4643

liorokman · 2024-11-06T11:03:34Z

What this PR does / why we need it:
This PR catches panics that occur during calls to the watchable infrastructure. All the runners are affected, and panics that occur in them will be logged and reported instead of crashing Envoy Gateway.

Note that the Kubernetes client already recovered from panics caused during calls to the provider specific reconcile - this is the default behavior and the default was never changed.

Which issue(s) this PR fixes:
Fixes #4332

Release Notes: No

Signed-off-by: Lior Okman <[email protected]>

codecov · 2024-11-06T11:10:25Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 65.54%. Comparing base (e68d573) to head (db04d3a).
Report is 33 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #4643      +/-   ##
==========================================
- Coverage   65.54%   65.54%   -0.01%     
==========================================
  Files         211      211              
  Lines       31945    31962      +17     
==========================================
+ Hits        20939    20950      +11     
- Misses       9761     9768       +7     
+ Partials     1245     1244       -1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

liorokman · 2024-11-06T11:59:57Z

/retest

guydc · 2024-11-07T00:03:55Z

Can we also emit some sort of metric, so that users can easily setup alerts on these issues?

Signed-off-by: Lior Okman <[email protected]>

liorokman · 2024-11-07T08:07:37Z

Can we also emit some sort of metric, so that users can easily setup alerts on these issues?

Done.

HandleSubscription handler function, Signed-off-by: Lior Okman <[email protected]>

liorokman · 2024-11-07T12:30:52Z

/retest

arkodg · 2024-11-07T20:01:04Z

hey @liorokman thanks for adding this

would you be able to manually simulate a scenario where this code kicks in (when there is an exception in - provider, gateway-api or xds-translator) and avoid the process from panicking thereby allowing the xds server to continue running, which should allow the envoy proxy fleet to continue to scale out and fetch xds.
a. does scaling out envoy proxies work (can be done using kubectl or hpa) ?
b. after recovery do the watchable runners stabilize, until an update ?
c. if the offending config is fixed, and more config is applied, does new xds get generated ?
d. does any of the above change with multiple replicas of Envoy Gateway

liorokman · 2024-11-08T05:21:28Z

would you be able to manually simulate a scenario where this code kicks in (when there is an exception in - provider, gateway-api or xds-translator) and avoid the process from panicking thereby allowing the xds server to continue running, which should allow the envoy proxy fleet to continue to scale out and fetch xds.

Manually simulating means writing code that panics on purpose. This isn't something that I would want to commit.

a. does scaling out envoy proxies work (can be done using kubectl or hpa) ?

As long as Envoy Gateway doesn't crash, in theory there should be no problem scaling out Envoy Proxy.

b. after recovery do the watchable runners stabilise, until an update ?

As the unit test shows, if code that is called inside the handler provided to HandleSubscription panics, then that panic is contained and recovered. The handler will be called again for the next item in the queue.

If the runners stabilise would depend on the nature of the panic. A bad configuration would probably mean that the panic would occur again in that runner as long as that runner's queue contained something that triggers the panic. But an error in the gateway-api runner wouldn't crash any of the other runners, meaning the xds-translator would continue working.

c. if the offending config is fixed, and more config is applied, does new xds get generated ?

Probably? At some point the queue would trigger a reconcile cycle that doesn't panic.

But I guess it would depend on the specific bug - if the panic causes some other unexpected effect before it is recovered, then all bets are off.

d. does any of the above change with multiple replicas of Envoy Gateway

If all the copies of Envoy Gateway that are running have the same bug causing a panic, then all of them should panic in the same place given the same configuration. Shouldn't matter how many copies are running if the bug is deterministic.

arkodg · 2024-11-08T20:48:02Z

thanks @liorokman, would be great if we simulate this in a fork to make sure its happening

liorokman · 2024-11-11T07:33:35Z

thanks @liorokman, would be great if we simulate this in a fork to make sure its happening

@arkodg

I created a local branch where I could trigger panics on demand in both the gatewayAPI and xds-translator runners.

I verified that causing a panic in either of these doesn't crash the XDS server by scaling the envoy proxy deployment while these runners were panic-ing and making sure that all instances of Envoy Proxy were configured correctly.

I verified that if the configuration is changed so that the panic doesn't occur, then everything resumes work as expected.

However:

If there are panics in either of these runners, and during that time the Envoy Gateway deployment is scaled, then new instances of Envoy Gateway are unable to generate a locally cached working XDS configuration. If, during this time, the Envoy Proxy deployment is scaled, and the new Envoy Proxy instances happen to reach one of the new Envoy Gateway deployments for their XDS configuration, then these instances will not have a valid XDS configuration and traffic routed through them will not work.

We can workaround this by making sure that the Envoy Gateway pods are not considered "healthy" until after the XDS translation has occurred at least once. @alexwo suggested a PR (#2918) to this effect at one point, but it wasn't merged.

arkodg · 2024-11-11T18:33:21Z

thanks @liorokman, would be great if we simulate this in a fork to make sure its happening

@arkodg

I created a local branch where I could trigger panics on demand in both the gatewayAPI and xds-translator runners.

I verified that causing a panic in either of these doesn't crash the XDS server by scaling the envoy proxy deployment while these runners were panic-ing and making sure that all instances of Envoy Proxy were configured correctly.

I verified that if the configuration is changed so that the panic doesn't occur, then everything resumes work as expected.

However:

If there are panics in either of these runners, and during that time the Envoy Gateway deployment is scaled, then new instances of Envoy Gateway are unable to generate a locally cached working XDS configuration. If, during this time, the Envoy Proxy deployment is scaled, and the new Envoy Proxy instances happen to reach one of the new Envoy Gateway deployments for their XDS configuration, then these instances will not have a valid XDS configuration and traffic routed through them will not work.

We can workaround this by making sure that the Envoy Gateway pods are not considered "healthy" until after the XDS translation has occurred at least once. @alexwo suggested a PR (#2918) to this effect at one point, but it wasn't merged.

thanks for testing this !
lets discuss the best way to solve the issue of this new case that arises with change - running xds servers that have an empty cache, in a new GH issue

arkodg · 2024-11-11T18:34:25Z

internal/message/metrics.go

@@ -13,6 +13,11 @@ var (
 		"Current depth of watchable queue.",
 	)

+	panicCounter = metrics.NewCounter(
+		"panics_recovered_total",


should this have the watchable prefix ?
cc @shawnh2

arkodg · 2024-11-11T18:35:36Z

internal/message/watchutil.go

+) {
+	defer func() {
+		if r := recover(); r != nil {
+			logger.WithValues("runner", meta.Runner).Error(fmt.Errorf("%+v", r), "observed an panic",


Suggested change

logger.WithValues("runner", meta.Runner).Error(fmt.Errorf("%+v", r), "observed an panic",

logger.WithValues("runner", meta.Runner).Error(fmt.Errorf("%+v", r), "observed a panic",

Signed-off-by: Lior Okman <[email protected]>

liorokman · 2024-11-12T07:55:08Z

/retest

arkodg

LGTM thanks !

liorokman added 2 commits November 6, 2024 12:22

Added a panic recovery flow for HandleSubscription.

05e792c

Signed-off-by: Lior Okman <[email protected]>

Panic recovery should not be a one-off occurrence

3ac9c38

Signed-off-by: Lior Okman <[email protected]>

liorokman requested a review from a team as a code owner November 6, 2024 11:03

zhaohuabing added cherrypick/release-v1.1.4 cherrypick/release-v1.2.2 labels Nov 7, 2024

Added a metric for recovered panics

f156efb

Signed-off-by: Lior Okman <[email protected]>

Verify that the correct number of calls were received by the

2e4ed60

HandleSubscription handler function, Signed-off-by: Lior Okman <[email protected]>

arkodg reviewed Nov 11, 2024

View reviewed changes

Typo and align the metric name with other metrics in the same area.

db04d3a

Signed-off-by: Lior Okman <[email protected]>

arkodg approved these changes Nov 12, 2024

View reviewed changes

arkodg requested review from a team November 13, 2024 02:27

arkodg removed cherrypick/release-v1.1.4 cherrypick/release-v1.2.2 labels Nov 13, 2024

guydc approved these changes Nov 14, 2024

View reviewed changes

guydc merged commit 1c29f66 into envoyproxy:main Nov 14, 2024
24 checks passed

liorokman deleted the panic-recover branch November 14, 2024 17:33

shawnh2 mentioned this pull request Nov 15, 2024

Update grafana dashboard and exported metrics doc for watchable_panics_recovered_total #4728

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: recover from panics that occur during envoy gateway's reconciliation #4643

fix: recover from panics that occur during envoy gateway's reconciliation #4643

liorokman commented Nov 6, 2024

codecov bot commented Nov 6, 2024 •

edited

Loading

liorokman commented Nov 6, 2024

guydc commented Nov 7, 2024

liorokman commented Nov 7, 2024

liorokman commented Nov 7, 2024

arkodg commented Nov 7, 2024

liorokman commented Nov 8, 2024

arkodg commented Nov 8, 2024

liorokman commented Nov 11, 2024

arkodg commented Nov 11, 2024

arkodg Nov 11, 2024

arkodg Nov 11, 2024

liorokman commented Nov 12, 2024

arkodg left a comment

	logger.WithValues("runner", meta.Runner).Error(fmt.Errorf("%+v", r), "observed an panic",
	logger.WithValues("runner", meta.Runner).Error(fmt.Errorf("%+v", r), "observed a panic",

fix: recover from panics that occur during envoy gateway's reconciliation #4643

fix: recover from panics that occur during envoy gateway's reconciliation #4643

Conversation

liorokman commented Nov 6, 2024

codecov bot commented Nov 6, 2024 • edited Loading

Codecov Report

liorokman commented Nov 6, 2024

guydc commented Nov 7, 2024

liorokman commented Nov 7, 2024

liorokman commented Nov 7, 2024

arkodg commented Nov 7, 2024

liorokman commented Nov 8, 2024

arkodg commented Nov 8, 2024

liorokman commented Nov 11, 2024

arkodg commented Nov 11, 2024

arkodg Nov 11, 2024

Choose a reason for hiding this comment

arkodg Nov 11, 2024

Choose a reason for hiding this comment

liorokman commented Nov 12, 2024

arkodg left a comment

Choose a reason for hiding this comment

codecov bot commented Nov 6, 2024 •

edited

Loading