Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Envoy crash logging #7300

Closed
alyssawilk opened this issue Jun 17, 2019 · 6 comments · Fixed by #7782
Closed

Improve Envoy crash logging #7300

alyssawilk opened this issue Jun 17, 2019 · 6 comments · Fixed by #7782
Assignees
Labels
enhancement Feature requests. Not bugs or questions. no stalebot Disables stalebot from closing an issue
Milestone

Comments

@alyssawilk
Copy link
Contributor

While core dumps are often better for in-depth debugging we've found that a high percentage of bugs can be debugged a bit more quickly with a combination of stack trace and information about the stream which caused the crash.

To start with I'd love to replicate what we have for L7 debugging in-house, where we track and dump active session. Essentially each active stream in each worker thread registers itself with a scoped thread-local object on all dispatcher entry points, and the crash signal handler logs a bunch of information about active stream on segfault. This is super useful for debug but for consistent state dumping, every alarm and IO entry point has to create a scoped tracker for traces (worse case you just miss out on debug info)

It'd be a bunch of code churn but I don't think it's terrible to have an Printable interface with dumpState() function which various stream / hcm / connection interfaces can implement, and have L7 alarms and IO entry points (interested folks could implement for L4 as well, if inclined, we'd likely not need that for some time) which latch a thread-local-storage Printable interface for the stream. the dumpState functions can also be quite helpful in debug/error logging, especially for ASSERTs/RELEASE_ASSERT as they generally have enough information about state to help assess what went wrong.

Checking in with @envoyproxy/maintainers before I go off code spelunking, both for if we're up for the extra APIage and plumbing, and if you all have lower hanging fruit which might make sense to tackle first.

@alyssawilk alyssawilk added the enhancement Feature requests. Not bugs or questions. label Jun 17, 2019
@alyssawilk alyssawilk self-assigned this Jun 17, 2019
@mattklein123
Copy link
Member

@alyssawilk at a high level this sounds interesting and useful to me. My only concern would potentially be around perf but I think we can make it light weight enough or potentially make it opt-in with a null implementation or something like that.

Also, I think we should sync up with @htuch before you start doing significant coding. I have been discussing with him some ways we might due per-tenant accounting in the future, and there is some overlap I think on the plumbing required. It would be good to make sure we cover both cases so we don't have to do it twice...

@alyssawilk
Copy link
Contributor Author

I think the base implementation is setting and clearing a pointer on alarm entry, so should be fine from a perf perspective. I agree once we have scoped "entering a session" objects we can also use them for attribution and that'd be more perf heavy. I'll start on sessing logging utils and printable APIs which shouldn't overlap, and try to save the scoped entry points for when htuch@ is back online.

@htuch
Copy link
Member

htuch commented Jun 24, 2019

Chatting with @yanavlasov today, he had the idea that GDB scripts to crawl the core would provide both the information of interest ^^ and also provide a form of documentation on "how Envoy works", i.e. the key relationship between data structures. I think what @alyssawilk is working on is generally interesting, but just thought I'd point out the potential confluence between these.

@mattklein123
Copy link
Member

Envoy specific GDB scripts would be amazing!

@alyssawilk
Copy link
Contributor Author

+1 - if we could "auto" generate debug info from crashes that'd be awesome. Bonus points for flags to redact obvious PII.

alyssawilk added a commit that referenced this issue Jul 23, 2019
Tracking the active stream on the encode path, for crash logging.

Risk Level: Medium (touching the router)
Testing: new unit tests
Docs Changes: n/a
Release Notes: n/a
#7300

Signed-off-by: Alyssa Wilk <[email protected]>
@stale
Copy link

stale bot commented Jul 24, 2019

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or other activity occurs. Thank you for your contributions.

@stale stale bot added the stale stalebot believes this issue/PR has not been touched recently label Jul 24, 2019
@alyssawilk alyssawilk added the no stalebot Disables stalebot from closing an issue label Jul 24, 2019
@stale stale bot removed the stale stalebot believes this issue/PR has not been touched recently label Jul 24, 2019
@mattklein123 mattklein123 added this to the 1.12.0 milestone Aug 6, 2019
dio pushed a commit that referenced this issue Aug 7, 2020
…nal handlers (#12062)

Add hooks for calling fatal error handlers from non-Envoy signal handlers. Move register/removeFatalErrorHandler from SignalAction into a new FatalErrorHandler namespace, and add a new function, callFatalErrorHandlers, which runs the registered error handlers. This makes the crash logging from issue #7300 available for builds that don't use ENVOY_HANDLE_SIGNALS, as long as they do use ENVOY_OBJECT_TRACE_ON_DUMP.

Risk Level: Low
Testing: bazel test //test/...
Docs Changes: N/A
Release Notes: Added
Fixes #11984 

Signed-off-by: Michael Behr <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Feature requests. Not bugs or questions. no stalebot Disables stalebot from closing an issue
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants