Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: recover from panics that occur during envoy gateway's reconciliation #4643

Merged
merged 5 commits into from
Nov 14, 2024

Conversation

liorokman
Copy link
Contributor

What this PR does / why we need it:
This PR catches panics that occur during calls to the watchable infrastructure. All the runners are affected, and panics that occur in them will be logged and reported instead of crashing Envoy Gateway.

Note that the Kubernetes client already recovered from panics caused during calls to the provider specific reconcile - this is the default behavior and the default was never changed.

Which issue(s) this PR fixes:
Fixes #4332

Release Notes: No

@liorokman liorokman requested a review from a team as a code owner November 6, 2024 11:03
Copy link

codecov bot commented Nov 6, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 65.54%. Comparing base (e68d573) to head (db04d3a).
Report is 33 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4643      +/-   ##
==========================================
- Coverage   65.54%   65.54%   -0.01%     
==========================================
  Files         211      211              
  Lines       31945    31962      +17     
==========================================
+ Hits        20939    20950      +11     
- Misses       9761     9768       +7     
+ Partials     1245     1244       -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@liorokman
Copy link
Contributor Author

/retest

@guydc
Copy link
Contributor

guydc commented Nov 7, 2024

Can we also emit some sort of metric, so that users can easily setup alerts on these issues?

@liorokman
Copy link
Contributor Author

Can we also emit some sort of metric, so that users can easily setup alerts on these issues?

Done.

HandleSubscription handler function,

Signed-off-by: Lior Okman <[email protected]>
@liorokman
Copy link
Contributor Author

/retest

@arkodg
Copy link
Contributor

arkodg commented Nov 7, 2024

hey @liorokman thanks for adding this

would you be able to manually simulate a scenario where this code kicks in (when there is an exception in - provider, gateway-api or xds-translator) and avoid the process from panicking thereby allowing the xds server to continue running, which should allow the envoy proxy fleet to continue to scale out and fetch xds.
a. does scaling out envoy proxies work (can be done using kubectl or hpa) ?
b. after recovery do the watchable runners stabilize, until an update ?
c. if the offending config is fixed, and more config is applied, does new xds get generated ?
d. does any of the above change with multiple replicas of Envoy Gateway

@liorokman
Copy link
Contributor Author

would you be able to manually simulate a scenario where this code kicks in (when there is an exception in - provider, gateway-api or xds-translator) and avoid the process from panicking thereby allowing the xds server to continue running, which should allow the envoy proxy fleet to continue to scale out and fetch xds.

Manually simulating means writing code that panics on purpose. This isn't something that I would want to commit.

a. does scaling out envoy proxies work (can be done using kubectl or hpa) ?

As long as Envoy Gateway doesn't crash, in theory there should be no problem scaling out Envoy Proxy.

b. after recovery do the watchable runners stabilise, until an update ?

As the unit test shows, if code that is called inside the handler provided to HandleSubscription panics, then that panic is contained and recovered. The handler will be called again for the next item in the queue.

If the runners stabilise would depend on the nature of the panic. A bad configuration would probably mean that the panic would occur again in that runner as long as that runner's queue contained something that triggers the panic. But an error in the gateway-api runner wouldn't crash any of the other runners, meaning the xds-translator would continue working.

c. if the offending config is fixed, and more config is applied, does new xds get generated ?

Probably? At some point the queue would trigger a reconcile cycle that doesn't panic.

But I guess it would depend on the specific bug - if the panic causes some other unexpected effect before it is recovered, then all bets are off.

d. does any of the above change with multiple replicas of Envoy Gateway

If all the copies of Envoy Gateway that are running have the same bug causing a panic, then all of them should panic in the same place given the same configuration. Shouldn't matter how many copies are running if the bug is deterministic.

@arkodg
Copy link
Contributor

arkodg commented Nov 8, 2024

thanks @liorokman, would be great if we simulate this in a fork to make sure its happening

@liorokman
Copy link
Contributor Author

thanks @liorokman, would be great if we simulate this in a fork to make sure its happening

@arkodg

I created a local branch where I could trigger panics on demand in both the gatewayAPI and xds-translator runners.

I verified that causing a panic in either of these doesn't crash the XDS server by scaling the envoy proxy deployment while these runners were panic-ing and making sure that all instances of Envoy Proxy were configured correctly.

I verified that if the configuration is changed so that the panic doesn't occur, then everything resumes work as expected.

However:

If there are panics in either of these runners, and during that time the Envoy Gateway deployment is scaled, then new instances of Envoy Gateway are unable to generate a locally cached working XDS configuration. If, during this time, the Envoy Proxy deployment is scaled, and the new Envoy Proxy instances happen to reach one of the new Envoy Gateway deployments for their XDS configuration, then these instances will not have a valid XDS configuration and traffic routed through them will not work.

We can workaround this by making sure that the Envoy Gateway pods are not considered "healthy" until after the XDS translation has occurred at least once. @alexwo suggested a PR (#2918) to this effect at one point, but it wasn't merged.

@arkodg
Copy link
Contributor

arkodg commented Nov 11, 2024

thanks @liorokman, would be great if we simulate this in a fork to make sure its happening

@arkodg

I created a local branch where I could trigger panics on demand in both the gatewayAPI and xds-translator runners.

I verified that causing a panic in either of these doesn't crash the XDS server by scaling the envoy proxy deployment while these runners were panic-ing and making sure that all instances of Envoy Proxy were configured correctly.

I verified that if the configuration is changed so that the panic doesn't occur, then everything resumes work as expected.

However:

If there are panics in either of these runners, and during that time the Envoy Gateway deployment is scaled, then new instances of Envoy Gateway are unable to generate a locally cached working XDS configuration. If, during this time, the Envoy Proxy deployment is scaled, and the new Envoy Proxy instances happen to reach one of the new Envoy Gateway deployments for their XDS configuration, then these instances will not have a valid XDS configuration and traffic routed through them will not work.

We can workaround this by making sure that the Envoy Gateway pods are not considered "healthy" until after the XDS translation has occurred at least once. @alexwo suggested a PR (#2918) to this effect at one point, but it wasn't merged.

thanks for testing this !
lets discuss the best way to solve the issue of this new case that arises with change - running xds servers that have an empty cache, in a new GH issue

@@ -13,6 +13,11 @@ var (
"Current depth of watchable queue.",
)

panicCounter = metrics.NewCounter(
"panics_recovered_total",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this have the watchable prefix ?
cc @shawnh2

) {
defer func() {
if r := recover(); r != nil {
logger.WithValues("runner", meta.Runner).Error(fmt.Errorf("%+v", r), "observed an panic",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
logger.WithValues("runner", meta.Runner).Error(fmt.Errorf("%+v", r), "observed an panic",
logger.WithValues("runner", meta.Runner).Error(fmt.Errorf("%+v", r), "observed a panic",

@liorokman
Copy link
Contributor Author

/retest

Copy link
Contributor

@arkodg arkodg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM thanks !

@guydc guydc merged commit 1c29f66 into envoyproxy:main Nov 14, 2024
24 checks passed
@liorokman liorokman deleted the panic-recover branch November 14, 2024 17:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Recover from reconciler panics
4 participants