Improve Envoy crash logging #7300

alyssawilk · 2019-06-17T15:52:39Z

While core dumps are often better for in-depth debugging we've found that a high percentage of bugs can be debugged a bit more quickly with a combination of stack trace and information about the stream which caused the crash.

To start with I'd love to replicate what we have for L7 debugging in-house, where we track and dump active session. Essentially each active stream in each worker thread registers itself with a scoped thread-local object on all dispatcher entry points, and the crash signal handler logs a bunch of information about active stream on segfault. This is super useful for debug but for consistent state dumping, every alarm and IO entry point has to create a scoped tracker for traces (worse case you just miss out on debug info)

It'd be a bunch of code churn but I don't think it's terrible to have an Printable interface with dumpState() function which various stream / hcm / connection interfaces can implement, and have L7 alarms and IO entry points (interested folks could implement for L4 as well, if inclined, we'd likely not need that for some time) which latch a thread-local-storage Printable interface for the stream. the dumpState functions can also be quite helpful in debug/error logging, especially for ASSERTs/RELEASE_ASSERT as they generally have enough information about state to help assess what went wrong.

Checking in with @envoyproxy/maintainers before I go off code spelunking, both for if we're up for the extra APIage and plumbing, and if you all have lower hanging fruit which might make sense to tackle first.

mattklein123 · 2019-06-17T16:44:17Z

@alyssawilk at a high level this sounds interesting and useful to me. My only concern would potentially be around perf but I think we can make it light weight enough or potentially make it opt-in with a null implementation or something like that.

Also, I think we should sync up with @htuch before you start doing significant coding. I have been discussing with him some ways we might due per-tenant accounting in the future, and there is some overlap I think on the plumbing required. It would be good to make sure we cover both cases so we don't have to do it twice...

alyssawilk · 2019-06-17T18:27:23Z

I think the base implementation is setting and clearing a pointer on alarm entry, so should be fine from a perf perspective. I agree once we have scoped "entering a session" objects we can also use them for attribution and that'd be more perf heavy. I'll start on sessing logging utils and printable APIs which shouldn't overlap, and try to save the scoped entry points for when htuch@ is back online.

htuch · 2019-06-24T19:39:44Z

Chatting with @yanavlasov today, he had the idea that GDB scripts to crawl the core would provide both the information of interest ^^ and also provide a form of documentation on "how Envoy works", i.e. the key relationship between data structures. I think what @alyssawilk is working on is generally interesting, but just thought I'd point out the potential confluence between these.

mattklein123 · 2019-06-24T19:42:33Z

Envoy specific GDB scripts would be amazing!

alyssawilk · 2019-06-24T20:04:36Z

+1 - if we could "auto" generate debug info from crashes that'd be awesome. Bonus points for flags to redact obvious PII.

Tracking the active stream on the encode path, for crash logging. Risk Level: Medium (touching the router) Testing: new unit tests Docs Changes: n/a Release Notes: n/a #7300 Signed-off-by: Alyssa Wilk <[email protected]>

stale · 2019-07-24T20:25:16Z

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or other activity occurs. Thank you for your contributions.

…nal handlers (#12062) Add hooks for calling fatal error handlers from non-Envoy signal handlers. Move register/removeFatalErrorHandler from SignalAction into a new FatalErrorHandler namespace, and add a new function, callFatalErrorHandlers, which runs the registered error handlers. This makes the crash logging from issue #7300 available for builds that don't use ENVOY_HANDLE_SIGNALS, as long as they do use ENVOY_OBJECT_TRACE_ON_DUMP. Risk Level: Low Testing: bazel test //test/... Docs Changes: N/A Release Notes: Added Fixes #11984 Signed-off-by: Michael Behr <[email protected]>

alyssawilk added the enhancement Feature requests. Not bugs or questions. label Jun 17, 2019

alyssawilk self-assigned this Jun 17, 2019

alyssawilk mentioned this issue Jun 24, 2019

http: dumping session state on the decode path #7390

Merged

alyssawilk mentioned this issue Jul 16, 2019

http: tracking object scope on the encode path #7603

Merged

stale bot added the stale stalebot believes this issue/PR has not been touched recently label Jul 24, 2019

alyssawilk added the no stalebot Disables stalebot from closing an issue label Jul 24, 2019

stale bot removed the stale stalebot believes this issue/PR has not been touched recently label Jul 24, 2019

alyssawilk mentioned this issue Jul 31, 2019

http: tracking active session under L7 timers #7782

Merged

mattklein123 added this to the 1.12.0 milestone Aug 6, 2019

mattklein123 closed this as completed in #7782 Aug 23, 2019

eziskind mentioned this issue Dec 5, 2019

Support per-session accounting #9239

Open

antoniovicente mentioned this issue Jun 3, 2020

[http] Further improvements to Envoy crash logging. #11432

Open

This was referenced Jul 9, 2020

Allow ScopeTrackedObject crash logging without Envoy signal handler #11984

Closed

signal: add hooks for calling fatal error handlers from non-envoy signal handlers. #12062

Merged

sio4 mentioned this issue Jun 5, 2024

Allow multiplexed upstream servers to half close the stream before the downstream #34461

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve Envoy crash logging #7300

Improve Envoy crash logging #7300

alyssawilk commented Jun 17, 2019

mattklein123 commented Jun 17, 2019

alyssawilk commented Jun 17, 2019

htuch commented Jun 24, 2019 •

edited

Loading

mattklein123 commented Jun 24, 2019

alyssawilk commented Jun 24, 2019

stale bot commented Jul 24, 2019

Improve Envoy crash logging #7300

Improve Envoy crash logging #7300

Comments

alyssawilk commented Jun 17, 2019

mattklein123 commented Jun 17, 2019

alyssawilk commented Jun 17, 2019

htuch commented Jun 24, 2019 • edited Loading

mattklein123 commented Jun 24, 2019

alyssawilk commented Jun 24, 2019

stale bot commented Jul 24, 2019

htuch commented Jun 24, 2019 •

edited

Loading