Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[dbnode] Avoid crashing on stack overflow #3401

Merged
merged 6 commits into from
Apr 22, 2021
Merged

[dbnode] Avoid crashing on stack overflow #3401

merged 6 commits into from
Apr 22, 2021

Conversation

asafm
Copy link
Contributor

@asafm asafm commented Apr 6, 2021

What this PR does / why we need it:
Fixes #3312

When tracing is enabled, in a situation unknown yet, the server crashes on stack overflow, due to very deep parent-child relationship of Context object - i.e. very deep trace transaction.

This PR prevents the crash, and will help detect when such a case occurs and present the call-stack to help pinpoint the real bug by:

  1. Adds a depth property to Context to know how deep the context object is.
  2. When creating a parent-child relationship (only happens in StartSampledTraceSpan), prevent creating trace "chains" deeper than 100 (seems like reasonably high depth), by marking a special error on the 100th trace span, which we can search for in our traces back-end, avoid creating a child context and return No Op span on any subsequent span (deeper than 100)

The error marked on the 100th span, will be easily searchable and in effect details the call stack leading to creating this odd long chain, making it easier to detect and pin point the problem.

The motivation and more details are described in detail in the issue #3312

Does this PR require updating code package or user-facing documentation?:
I'm not sure yet if documentation is required for this change

@asafm asafm changed the title [dbnode] Avoid crashing on stackoverflow [dbnode] Avoid crashing on stack overflow Apr 6, 2021
@codecov
Copy link

codecov bot commented Apr 6, 2021

Codecov Report

Merging #3401 (a4cee97) into master (a4cee97) will not change coverage.
The diff coverage is n/a.

❗ Current head a4cee97 differs from pull request most recent head ec5cf5a. Consider uploading reports for the commit ec5cf5a to get more accurate results

Impacted file tree graph

@@           Coverage Diff           @@
##           master    #3401   +/-   ##
=======================================
  Coverage    72.4%    72.4%           
=======================================
  Files        1100     1100           
  Lines      102700   102700           
=======================================
  Hits        74375    74375           
  Misses      23214    23214           
  Partials     5111     5111           
Flag Coverage Δ
aggregator 76.9% <0.0%> (ø)
cluster 84.9% <0.0%> (ø)
collector 84.3% <0.0%> (ø)
dbnode 79.0% <0.0%> (ø)
m3em 74.4% <0.0%> (ø)
m3ninx 73.5% <0.0%> (ø)
metrics 19.7% <0.0%> (ø)
msg 74.3% <0.0%> (ø)
query 66.9% <0.0%> (ø)
x 80.3% <0.0%> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.


Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a4cee97...ec5cf5a. Read the comment docs.

@asafm
Copy link
Contributor Author

asafm commented Apr 19, 2021

@robskillington @wesleyk any chance one of you can review this small PR?

Copy link
Collaborator

@wesleyk wesleyk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for the contribution!

@@ -340,6 +359,9 @@ func (c *ctx) StartSampledTraceSpan(name string) (Context, opentracing.Span, boo

child := c.newChildContext()
child.SetGoContext(childGoCtx)
if child.DistanceFromRootContext() == maxDistanceFromRootContext {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we only want an equality check here and not >=?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is done to signify we only mark a single span with an error (the 100th one). Starting at 100th and any deeper, we will not permit to create additional child spans and return a no-op: it's happening in the beginning of the method.

@asafm
Copy link
Contributor Author

asafm commented Apr 22, 2021

Thanks @wesleyk for the review! What is the next step?

@wesleyk
Copy link
Collaborator

wesleyk commented Apr 22, 2021

@asafm can you update the branch to latest master? We can get it merged afterwards

@asafm
Copy link
Contributor Author

asafm commented Apr 22, 2021

@wesleyk done

@wesleyk wesleyk merged commit 66e6140 into m3db:master Apr 22, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Nodes crashing on stack overflow due to context finalizer endless recursive call
2 participants