-
Notifications
You must be signed in to change notification settings - Fork 453
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[dbnode] Avoid crashing on stack overflow #3401
Conversation
Codecov Report
@@ Coverage Diff @@
## master #3401 +/- ##
=======================================
Coverage 72.4% 72.4%
=======================================
Files 1100 1100
Lines 102700 102700
=======================================
Hits 74375 74375
Misses 23214 23214
Partials 5111 5111
Flags with carried forward coverage won't be shown. Click here to find out more. Continue to review full report at Codecov.
|
@robskillington @wesleyk any chance one of you can review this small PR? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for the contribution!
@@ -340,6 +359,9 @@ func (c *ctx) StartSampledTraceSpan(name string) (Context, opentracing.Span, boo | |||
|
|||
child := c.newChildContext() | |||
child.SetGoContext(childGoCtx) | |||
if child.DistanceFromRootContext() == maxDistanceFromRootContext { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we only want an equality check here and not >=
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is done to signify we only mark a single span with an error (the 100th one). Starting at 100th and any deeper, we will not permit to create additional child spans and return a no-op: it's happening in the beginning of the method.
Thanks @wesleyk for the review! What is the next step? |
@asafm can you update the branch to latest master? We can get it merged afterwards |
@wesleyk done |
What this PR does / why we need it:
Fixes #3312
When tracing is enabled, in a situation unknown yet, the server crashes on stack overflow, due to very deep parent-child relationship of Context object - i.e. very deep trace transaction.
This PR prevents the crash, and will help detect when such a case occurs and present the call-stack to help pinpoint the real bug by:
depth
property to Context to know how deep the context object is.StartSampledTraceSpan
), prevent creating trace "chains" deeper than 100 (seems like reasonably high depth), by marking a special error on the 100th trace span, which we can search for in our traces back-end, avoid creating a child context and return No Op span on any subsequent span (deeper than 100)The error marked on the 100th span, will be easily searchable and in effect details the call stack leading to creating this odd long chain, making it easier to detect and pin point the problem.
The motivation and more details are described in detail in the issue #3312
Does this PR require updating code package or user-facing documentation?:
I'm not sure yet if documentation is required for this change