-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
http: dumping session state on the decode path #7390
Conversation
Signed-off-by: Alyssa Wilk <[email protected]>
Signed-off-by: Alyssa Wilk <[email protected]>
Signed-off-by: Alyssa Wilk <[email protected]>
Signed-off-by: Alyssa Wilk <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good structurally, I have a bunch of specific areas on the leafs we could tackle to make this a bit cleaner. Thanks.
/wait
os << spaces << #member ": "; \ | ||
if ((member) != nullptr) { \ | ||
os << "\n"; \ | ||
(member)->logState(os, indent_level + 1); \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not a huge fan of this style of printing and indenting TBH. I get that you want to make this efficient and have no allocations, but it's leaking through above into a duck typing of a pure interface (the header map), which seems like a pattern best avoided across if it's going to appear in more places.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, to your point I think there is value in having a structured data print type, and I can think of several use cases where it'd be valuable internally. From (possibly outdated?) priors I believe there is also value in a minimal allocation state drop, which is what I planned on implementing first. I think for out-process communication and machine-consumption we'll eventually want to tackle the latter, but I don't think we'd want to use that version in the fatal error handler and I'd like the "core dumps are hard" session state to be human readable where possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a fair point. I think it would be useful to document (i.e. provide the citations) on how allocations during fatal handling are a problem. The main one I can imagine is that we're in a soft OOM situation, and allocation would lead to a hard OOM, would be good to have some more intuition on this one.
source/exe/signal_action.cc
Outdated
|
||
ABSL_CONST_INIT static absl::Mutex failure_mutex(absl::kConstInit); | ||
using FailureFunctionList = std::list<const CrashHandlerInterface*>; | ||
ABSL_CONST_INIT std::atomic<FailureFunctionList*> crash_handlers{nullptr}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, all the code here to me seems pretty problematic, but maybe there isn't a better way. We have mutexes, atomics with relaxed memory ordering, manual memory allocations, a singleton, etc. Oh my :)
One way we could be cleaner about this is to have a per-thread local optional signal handler for any signal. On a signal, the global handler will iterate over the list of threads and try and invoke the signal handler if registered.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I'd started by trying to implement this as a single handler in the server, and doing the plumbing through to snag dispatcher state, and it was messy enough I thought this was cleaner. I think if we get to a point where more than one class wants to register signal handlers it'd might be cleaner to do a single entry point per thread but I think for now it's just moving the mess to a different location.
Signed-off-by: Alyssa Wilk <[email protected]>
Signed-off-by: Alyssa Wilk <[email protected]>
I'm generally good with this modulo the wish to have better clarity on how significant the "no allocation during fatal handling" policy is, since it is the main constraint on the solution space here. Will leave it to @mattklein123 for review/merge. |
Signed-off-by: Alyssa Wilk <[email protected]>
Yeah, checked in with other proxy devs and one of our kernel guys, and new/malloc in signal handlers is pretty bad form. Added more detail to the comments. FWIW cerr is not without risk either - I got a suggestion to switch to ABSL_RAW_LOG which looks like it'd be safer but I'm not sure how to pass that through the Envoy stack in a plausible way. Both folks I checked with independently suggested proto arena as a viable compromise. I think it's a bit riskier than what we have mainly because I'm unfamiliar with how it works under the hood, but I think it's a pretty viable compromise where you get your structured data and I get my (trusting jeremy) minimal-to-no allocations. I suspect (still trying to avoid memory allocations) that I'd have to write my own proto printer, but I don't think that's hard to do with the existing reflection, and it'd keep all the pretty printing ascii logic to a single utility function rather than a widely included set of macros. WDYT? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for checking with folks. I do like the idea of protbuf arenas for future consideration.
@mattklein123 I think this is ready for your pass then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, cool stuff and will definitely be useful. Flushing out some comments.
General question: I wonder if this behavior should be opt-in, opt-out, or have some type of compile control? The reason I ask is that there is some risk of doubt fault and I'm not sure if core dumps will be useful if that happens?
/wait
include/envoy/common/scope_tracker.h
Outdated
|
||
/* | ||
* A class for tracking the scope of work. | ||
* Currently this is only used for best-effort tracking the any L7 stream doing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/the any/any ?
source/common/common/scope_tracker.h
Outdated
|
||
namespace Envoy { | ||
|
||
class ScopeTrackerImpl { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I might call this RAIIScopeTracker
or something like that? Since this isn't really an implementation of anything?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this name is better future-proofed for when we use the scope object for customer memory/cpu accounting (the htuch@ feature) - WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't feel strongly about it, other than to say that Impl
usually is an implementation of an interface which this is not, so I would probably try to find a more descriptive name which IMO is some type of context or something along those lines. e.g. ScopeTrackerScopeState
or something. Anyway, not a big deal either way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I guess a small class comment on what it is for would be nice either way)
@@ -25,6 +25,10 @@ | |||
|
|||
#include "event2/event.h" | |||
|
|||
#ifdef ENVOY_HANDLE_SIGNALS | |||
#include "exe/signal_action.h" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reaching into exe here seems wrong from a layering perspective. Would it be better to have some other type of injected interface for this purpose?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for layering, honestly I'd think all of the signal handling code could be moved out of exe since we use it in test, so it's somewhat core to the server handling. maybe source/server/signal/?
I'm also not sure if that addresses your concerns, or if you still think there's value in the dispatcher not reaching directly to signal handling code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sgtm regarding moving. I might even go as far as to move some of it into source/common/signal but up to you.
@@ -416,6 +417,8 @@ ConnectionManagerImpl::ActiveStream::ActiveStream(ConnectionManagerImpl& connect | |||
connection_manager_.stats_.named_.downstream_rq_time_, connection_manager_.timeSource())), | |||
stream_info_(connection_manager_.codec_->protocol(), connection_manager_.timeSource()), | |||
upstream_options_(std::make_shared<Network::Socket::Options>()) { | |||
ScopeTrackerImpl scope(this, connection_manager_.read_callbacks_->connection().dispatcher()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How did you pick the places where you setup the tracked scope? Can we document in comments somewhere? It seems like we should be potentially tracking the encode side as well? Filter continuations? Etc.? I wonder if there is some more generic way we could do this? Nothing immediately comes to mind but I figured I would throw it out there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's in PR description - I'm going to do encode side (from the router filter) as well as alarms for L7 in follow-ups :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK sounds good.
@@ -20,6 +46,14 @@ void SignalAction::sigHandler(int sig, siginfo_t* info, void* context) { | |||
} | |||
tracer.logTrace(); | |||
|
|||
FailureFunctionList* list = fatal_error_handlers.exchange(nullptr, std::memory_order_relaxed); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add some comments on why we are dealing with thread safety in this way?
test/exe/signals_test.cc
Outdated
"backtrace.*Segmentation fault"); | ||
SignalAction::removeFatalErrorHandler(handler); | ||
|
||
/* SignalAction actions; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
?
Signed-off-by: Alyssa Wilk <[email protected]>
Signed-off-by: Alyssa Wilk <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, I think I got all of your comments. Added a compile option since signals already had one, but do you think we should have DS style config as well?
Signed-off-by: Alyssa Wilk <[email protected]>
IMO compile time is probably fine and easier, but I don't feel strongly about it if you do. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks looks great. 1 comment about build options, docs, etc.
/wait
docs/root/intro/version_history.rst
Outdated
@@ -35,6 +35,7 @@ Version history | |||
* http: added support for :ref:`preserve_external_request_id<envoy_api_field_config.filter.network.http_connection_manager.v2.HttpConnectionManager.preserve_external_request_id>` that represents whether the x-request-id should not be reset on edge entry inside mesh | |||
* http: changed `sendLocalReply` to send percent-encoded `GrpcMessage`. | |||
* http: added :ref:`dynamic forward proxy <arch_overview_http_dynamic_forward_proxy>` support. | |||
* http: tracking the active stream and dumping state in Envoy crash handlers. This can be disabled by building with -DENVOY_SKIP_OBJECT_TRACE_ON_DUMP |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add a bazel config setting similar to https://github.com/envoyproxy/envoy/blob/master/bazel/BUILD#L98? And then also document here: https://github.com/envoyproxy/envoy/tree/master/bazel#disabling-optional-features? And maybe also define this in the compile time options CI build? Also, does disabling signal traces implicitly disable this? I think so but maybe I'm not reading the code correctly? If that's the case can we document that?
Signed-off-by: Alyssa Wilk <[email protected]>
Signed-off-by: Alyssa Wilk <[email protected]>
Signed-off-by: Alyssa Wilk <[email protected]>
Signed-off-by: Alyssa Wilk <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice!
Adding per-stream pretty printing, and tracking the active stream when decoding data and dumping it on crash
If we like how this looks, I'll do the same for the encode path (track scope when entering the router) and add an optional scope tracker to alarms and plumb existing resumption alarms.
Risk Level: Medium
Testing: new unit tests, manual testing.
Docs Changes: n/a
Release Notes: n/a
Part of #7300