-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve Envoy crash logging #7300
Comments
@alyssawilk at a high level this sounds interesting and useful to me. My only concern would potentially be around perf but I think we can make it light weight enough or potentially make it opt-in with a null implementation or something like that. Also, I think we should sync up with @htuch before you start doing significant coding. I have been discussing with him some ways we might due per-tenant accounting in the future, and there is some overlap I think on the plumbing required. It would be good to make sure we cover both cases so we don't have to do it twice... |
I think the base implementation is setting and clearing a pointer on alarm entry, so should be fine from a perf perspective. I agree once we have scoped "entering a session" objects we can also use them for attribution and that'd be more perf heavy. I'll start on sessing logging utils and printable APIs which shouldn't overlap, and try to save the scoped entry points for when htuch@ is back online. |
Chatting with @yanavlasov today, he had the idea that GDB scripts to crawl the core would provide both the information of interest ^^ and also provide a form of documentation on "how Envoy works", i.e. the key relationship between data structures. I think what @alyssawilk is working on is generally interesting, but just thought I'd point out the potential confluence between these. |
Envoy specific GDB scripts would be amazing! |
+1 - if we could "auto" generate debug info from crashes that'd be awesome. Bonus points for flags to redact obvious PII. |
Tracking the active stream on the encode path, for crash logging. Risk Level: Medium (touching the router) Testing: new unit tests Docs Changes: n/a Release Notes: n/a #7300 Signed-off-by: Alyssa Wilk <[email protected]>
This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or other activity occurs. Thank you for your contributions. |
…nal handlers (#12062) Add hooks for calling fatal error handlers from non-Envoy signal handlers. Move register/removeFatalErrorHandler from SignalAction into a new FatalErrorHandler namespace, and add a new function, callFatalErrorHandlers, which runs the registered error handlers. This makes the crash logging from issue #7300 available for builds that don't use ENVOY_HANDLE_SIGNALS, as long as they do use ENVOY_OBJECT_TRACE_ON_DUMP. Risk Level: Low Testing: bazel test //test/... Docs Changes: N/A Release Notes: Added Fixes #11984 Signed-off-by: Michael Behr <[email protected]>
While core dumps are often better for in-depth debugging we've found that a high percentage of bugs can be debugged a bit more quickly with a combination of stack trace and information about the stream which caused the crash.
To start with I'd love to replicate what we have for L7 debugging in-house, where we track and dump active session. Essentially each active stream in each worker thread registers itself with a scoped thread-local object on all dispatcher entry points, and the crash signal handler logs a bunch of information about active stream on segfault. This is super useful for debug but for consistent state dumping, every alarm and IO entry point has to create a scoped tracker for traces (worse case you just miss out on debug info)
It'd be a bunch of code churn but I don't think it's terrible to have an Printable interface with dumpState() function which various stream / hcm / connection interfaces can implement, and have L7 alarms and IO entry points (interested folks could implement for L4 as well, if inclined, we'd likely not need that for some time) which latch a thread-local-storage Printable interface for the stream. the dumpState functions can also be quite helpful in debug/error logging, especially for ASSERTs/RELEASE_ASSERT as they generally have enough information about state to help assess what went wrong.
Checking in with @envoyproxy/maintainers before I go off code spelunking, both for if we're up for the extra APIage and plumbing, and if you all have lower hanging fruit which might make sense to tackle first.
The text was updated successfully, but these errors were encountered: