-
Notifications
You must be signed in to change notification settings - Fork 363
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: recover from panics that occur during envoy gateway's reconciliation #4643
Conversation
Signed-off-by: Lior Okman <[email protected]>
Signed-off-by: Lior Okman <[email protected]>
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #4643 +/- ##
==========================================
- Coverage 65.54% 65.54% -0.01%
==========================================
Files 211 211
Lines 31945 31962 +17
==========================================
+ Hits 20939 20950 +11
- Misses 9761 9768 +7
+ Partials 1245 1244 -1 ☔ View full report in Codecov by Sentry. |
/retest |
Can we also emit some sort of metric, so that users can easily setup alerts on these issues? |
Signed-off-by: Lior Okman <[email protected]>
Done. |
HandleSubscription handler function, Signed-off-by: Lior Okman <[email protected]>
/retest |
hey @liorokman thanks for adding this would you be able to manually simulate a scenario where this code kicks in (when there is an exception in - |
Manually simulating means writing code that panics on purpose. This isn't something that I would want to commit.
As long as Envoy Gateway doesn't crash, in theory there should be no problem scaling out Envoy Proxy.
As the unit test shows, if code that is called inside the handler provided to If the runners stabilise would depend on the nature of the panic. A bad configuration would probably mean that the panic would occur again in that runner as long as that runner's queue contained something that triggers the panic. But an error in the
Probably? At some point the queue would trigger a reconcile cycle that doesn't panic. But I guess it would depend on the specific bug - if the panic causes some other unexpected effect before it is recovered, then all bets are off.
If all the copies of Envoy Gateway that are running have the same bug causing a panic, then all of them should panic in the same place given the same configuration. Shouldn't matter how many copies are running if the bug is deterministic. |
thanks @liorokman, would be great if we simulate this in a fork to make sure its happening |
I created a local branch where I could trigger panics on demand in both the gatewayAPI and xds-translator runners. I verified that causing a panic in either of these doesn't crash the XDS server by scaling the envoy proxy deployment while these runners were panic-ing and making sure that all instances of Envoy Proxy were configured correctly. I verified that if the configuration is changed so that the panic doesn't occur, then everything resumes work as expected. However: If there are panics in either of these runners, and during that time the Envoy Gateway deployment is scaled, then new instances of Envoy Gateway are unable to generate a locally cached working XDS configuration. If, during this time, the Envoy Proxy deployment is scaled, and the new Envoy Proxy instances happen to reach one of the new Envoy Gateway deployments for their XDS configuration, then these instances will not have a valid XDS configuration and traffic routed through them will not work. We can workaround this by making sure that the Envoy Gateway pods are not considered "healthy" until after the XDS translation has occurred at least once. @alexwo suggested a PR (#2918) to this effect at one point, but it wasn't merged. |
thanks for testing this ! |
internal/message/metrics.go
Outdated
@@ -13,6 +13,11 @@ var ( | |||
"Current depth of watchable queue.", | |||
) | |||
|
|||
panicCounter = metrics.NewCounter( | |||
"panics_recovered_total", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this have the watchable
prefix ?
cc @shawnh2
internal/message/watchutil.go
Outdated
) { | ||
defer func() { | ||
if r := recover(); r != nil { | ||
logger.WithValues("runner", meta.Runner).Error(fmt.Errorf("%+v", r), "observed an panic", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logger.WithValues("runner", meta.Runner).Error(fmt.Errorf("%+v", r), "observed an panic", | |
logger.WithValues("runner", meta.Runner).Error(fmt.Errorf("%+v", r), "observed a panic", |
Signed-off-by: Lior Okman <[email protected]>
/retest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM thanks !
What this PR does / why we need it:
This PR catches panics that occur during calls to the watchable infrastructure. All the runners are affected, and panics that occur in them will be logged and reported instead of crashing Envoy Gateway.
Note that the Kubernetes client already recovered from panics caused during calls to the provider specific reconcile - this is the default behavior and the default was never changed.
Which issue(s) this PR fixes:
Fixes #4332
Release Notes: No