Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ddtrace/tracer: Don't drop trace if some spans must be kept #963

Merged
merged 12 commits into from
Jul 22, 2021

Conversation

piochelepiotr
Copy link
Collaborator

The current behavior is:
Drop the trace except if all spans should be kept.

This doesn't work if:

  1. a downstream span has no error, is closed (so should be dropped) but then the rest of the trace ends up having an error so should be kept.
  2. a span should be kept for app analytics (event sampling), but the trace is a p0 so the whole trace is dropped.

The PR inverses the behavior:
Keep the trace except if all spans should be dropped.

ddtrace/tracer/spancontext.go Outdated Show resolved Hide resolved
ddtrace/tracer/spancontext.go Outdated Show resolved Hide resolved
@piochelepiotr piochelepiotr force-pushed the piotr-wolski/fix-app-analytics-sampling-when-dropping-p0 branch from 0ccd340 to 7336829 Compare July 15, 2021 10:45
@piochelepiotr piochelepiotr force-pushed the piotr-wolski/fix-app-analytics-sampling-when-dropping-p0 branch from 7336829 to 8ccbae4 Compare July 15, 2021 10:46
@piochelepiotr piochelepiotr added this to the 1.32.0 milestone Jul 15, 2021
@piochelepiotr piochelepiotr marked this pull request as ready for review July 15, 2021 10:50
ddtrace/tracer/tracer_test.go Outdated Show resolved Hide resolved
ddtrace/tracer/span.go Outdated Show resolved Hide resolved
ddtrace/tracer/span.go Show resolved Hide resolved
ddtrace/tracer/sampler.go Outdated Show resolved Hide resolved
@piochelepiotr piochelepiotr force-pushed the piotr-wolski/fix-app-analytics-sampling-when-dropping-p0 branch from 699790e to 49851a7 Compare July 16, 2021 13:56
@piochelepiotr piochelepiotr force-pushed the piotr-wolski/fix-app-analytics-sampling-when-dropping-p0 branch from 49851a7 to 834b0c0 Compare July 16, 2021 13:58
ddtrace/tracer/spancontext.go Outdated Show resolved Hide resolved
// we have a tracer that can receive completed traces.
atomic.AddInt64(&tr.spansFinished, int64(len(t.spans)))
sd := samplingDecision(atomic.LoadInt64((*int64)(&t.samplingDecision)))
if sd == decisionNone {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This ties decisionNone to p0 which is unfair. What if we just did this instead?

Suggested change
if sd == decisionNone {
if p, ok := t.samplingPriority(); ok && p == ext.PriorityAutoReject {

You should move this condition below, inside if sd != decisionKeep

The reason is: let's not make decisionNone be related to priority sampling. Let's just have it mean "undecided".

Copy link
Contributor

@knusbaum knusbaum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this pass? I think it probably should.

func TestSamplingDecision(t *testing.T) {
...
	t.Run("ratesampler", func(t *testing.T) {
		tracer, _, _, stop := startTestTracer(t)
		defer stop()
		tracer.config.serviceName = "test_service"
		tracer.config.sampler = NewRateSampler(0)
		span := tracer.StartSpan("name_1").(*span)
		child := tracer.StartSpan("name_2", ChildOf(span.context))
		child.Finish()
		span.Finish()
		assert.Equal(t, float64(ext.PriorityAutoReject), span.Metrics[keySamplingPriority])
		assert.Equal(t, decisionKeep, span.context.trace.samplingDecision)
	})
...

Is there a reason we need 3 states (None, Drop, Keep)? It seems more complicated than necessary. I'd prefer to just have 2 (Drop and Keep), with the default being Drop. Then the Sampler or any span can call keep() and the whole trace will be kept. It should be possible for any trace to transition from Drop -> Keep, but not from Keep -> Drop.

@piochelepiotr
Copy link
Collaborator Author

Is there a reason we need 3 states (None, Drop, Keep)? It seems more complicated than necessary. I'd prefer to just have 2 (Drop and Keep), with the default being Drop. Then the Sampler or any span can call keep() and the whole trace will be kept. It should be possible for any trace to transition from Drop -> Keep, but not from Keep -> Drop.

So the goal is not to change the current logic (except fixing the bug where we drop P0s with a span containing events or errors).

But the current logic is that if the root span is not sampled by the rate sampler, the trace is dropped (no matter what else happens).

Client sampling is for P0 and P1 traces, the only moment you would enable it when the agent / tracer can't cope with the volume.
So now that we can compute stats in the tracer, we will never enable the client sampling.

ddtrace/tracer/spancontext.go Outdated Show resolved Hide resolved
@piochelepiotr
Copy link
Collaborator Author

piochelepiotr commented Jul 19, 2021

@gbbr , so we need 3 states because:
trace is sampled if : root span sampled by rate sampler AND at least one span sampled by the other samplers.
We need one state to store the results of root span sampled by rate sampler, and 1 state to store if at least one span was sampled by other samplers.

In other words, the root span being sampled by the rate sampler doesn't imply that the trace is sent to the agent. Just that the trace is not dropped for sure.

t.mu.RLock()
defer t.mu.RUnlock()
// returns the sampling priority of the trace. Not thread safe.
func (t *trace) samplingPriorityUnsafe() (p int, ok bool) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
func (t *trace) samplingPriorityUnsafe() (p int, ok bool) {
func (t *trace) samplingPriorityLocked() (p int, ok bool) {

We use this pattern in multiple places in the repository, and generally the suffix is *Locked. Can you please rename it? (it's even used right here as setSamplingPriorityLocked)

@gbbr
Copy link
Contributor

gbbr commented Jul 20, 2021

I agree with @knusbaum that this is unfortunately still confusing with the 3 states.

  • decisionNone - means that the local rate sampler chose to keep this trace. In this state, a trace may still be dropped if the feature flags are on and it's a P0.
  • decisionDrop - the rate sampler chose to drop the trace. It will be dropped no matter what.
  • decisionKeep - keep it.

Both decisionNone and decisionKeep mean to keep the trace. The only difference is that this decision can be overridden if the trace is a P0 only for decisionNone. A trace with decisionKeep must be kept regardless of factors, even if it's a P0 (in cases where it has a span with an error for example).

Even writing this, I have a hard time wrapping my head around the logic. Even though this may be frustrating for such a small change, I think it's worthwhile we explore the easiest to understand solution. Please share ideas. Here is one (going a bit backwards to 2 booleans):

1. Use two boolean fields (again)

  • drop - reports whether the trace was dropped by the local rate sampler
  • important - reports whether this trace is important to the user (e.g. it contains an error or an event)

Then:

  1. If a trace is drop == true it gets dropped no matter what (important or not).
  2. If a trace is drop == false it may get dropped if the feature flag is enabled and it's a P0
  3. If a trace is drop == false and important == true it will not get dropped regardless of feature flag and priority.

2. Use a state

We modify the current type to be more descriptive. We add a field called state of the type traceState. Then we can have 3 states:

  • stateLocallySampled - sampled by the rate sampler (default state for 99% of use-cases). Includes P0 and P1.
  • stateLocallyDropped - dropped by the rate sampler.
  • stateImportant - this is a P0 trace that contains an event or an error

Note that you may think that this is the same as today with different names but notice that stateImportant is not valid for P1's by definition (whereas today stateKeep is).

3. Use bitmask for all booleans

This is an improvement over the first proposal. It implies merging all booleans in the trace structure that we currently have: locked, full, drop and important into one bitmask. It could be called flags as the field name.

type traceFlag uint8

const (
    // flagDropped reports whether the trace was dropped by the local sampler.
    flagDropped traceFlag = 1 << iota
    // flagImportant reports whether the trace is important (e.g. has an error or an event).
    flagImportant
    // flagPriorityLocked reports whether the sampling priority can no longer be changed.
    flagPriorityLocked
    // flagFullCapacity reports whether this trace is at full capacity and can buffer no more spans.
    flagFullCapacity
)

We can then use bit operations or add helper methods.

4. We keep what we have today.

We avoid the above and keep the current solution, since none of them aren't necessarily less confusing in practice.

What do you guys think?

P.S. I do realise that this is a small change in theory, and that it's a lot of philosophy for a bug fix, but I think we should carefully consider future maintenance of this code because if this is hard to understand now while it is fresh in our minds, it will be much harder later on once it fades.

Copy link
Contributor

@knusbaum knusbaum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1
I'm OK with this, since it seems to be necessary to manage the sampling decision state.

@gbbr Thanks for the alternatives. I think they're all about the same complexity level as this. Unless you think we should pick one over this one, I think we can leave it as-is.

@knusbaum knusbaum merged commit 9a5c7ca into v1 Jul 22, 2021
@knusbaum knusbaum deleted the piotr-wolski/fix-app-analytics-sampling-when-dropping-p0 branch July 22, 2021 21:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants