-
Notifications
You must be signed in to change notification settings - Fork 440
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ddtrace/tracer: Don't drop trace if some spans must be kept #963
ddtrace/tracer: Don't drop trace if some spans must be kept #963
Conversation
0ccd340
to
7336829
Compare
7336829
to
8ccbae4
Compare
699790e
to
49851a7
Compare
49851a7
to
834b0c0
Compare
ddtrace/tracer/spancontext.go
Outdated
// we have a tracer that can receive completed traces. | ||
atomic.AddInt64(&tr.spansFinished, int64(len(t.spans))) | ||
sd := samplingDecision(atomic.LoadInt64((*int64)(&t.samplingDecision))) | ||
if sd == decisionNone { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This ties decisionNone
to p0
which is unfair. What if we just did this instead?
if sd == decisionNone { | |
if p, ok := t.samplingPriority(); ok && p == ext.PriorityAutoReject { |
You should move this condition below, inside if sd != decisionKeep
The reason is: let's not make decisionNone
be related to priority sampling. Let's just have it mean "undecided".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this pass? I think it probably should.
func TestSamplingDecision(t *testing.T) {
...
t.Run("ratesampler", func(t *testing.T) {
tracer, _, _, stop := startTestTracer(t)
defer stop()
tracer.config.serviceName = "test_service"
tracer.config.sampler = NewRateSampler(0)
span := tracer.StartSpan("name_1").(*span)
child := tracer.StartSpan("name_2", ChildOf(span.context))
child.Finish()
span.Finish()
assert.Equal(t, float64(ext.PriorityAutoReject), span.Metrics[keySamplingPriority])
assert.Equal(t, decisionKeep, span.context.trace.samplingDecision)
})
...
Is there a reason we need 3 states (None, Drop, Keep)? It seems more complicated than necessary. I'd prefer to just have 2 (Drop and Keep), with the default being Drop. Then the Sampler or any span can call keep()
and the whole trace will be kept. It should be possible for any trace to transition from Drop -> Keep, but not from Keep -> Drop.
So the goal is not to change the current logic (except fixing the bug where we drop P0s with a span containing events or errors). But the current logic is that if the root span is not sampled by the rate sampler, the trace is dropped (no matter what else happens). Client sampling is for P0 and P1 traces, the only moment you would enable it when the agent / tracer can't cope with the volume. |
@gbbr , so we need 3 states because: In other words, the root span being sampled by the rate sampler doesn't imply that the trace is sent to the agent. Just that the trace is not dropped for sure. |
ddtrace/tracer/spancontext.go
Outdated
t.mu.RLock() | ||
defer t.mu.RUnlock() | ||
// returns the sampling priority of the trace. Not thread safe. | ||
func (t *trace) samplingPriorityUnsafe() (p int, ok bool) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
func (t *trace) samplingPriorityUnsafe() (p int, ok bool) { | |
func (t *trace) samplingPriorityLocked() (p int, ok bool) { |
We use this pattern in multiple places in the repository, and generally the suffix is *Locked
. Can you please rename it? (it's even used right here as setSamplingPriorityLocked
)
I agree with @knusbaum that this is unfortunately still confusing with the 3 states.
Both Even writing this, I have a hard time wrapping my head around the logic. Even though this may be frustrating for such a small change, I think it's worthwhile we explore the easiest to understand solution. Please share ideas. Here is one (going a bit backwards to 2 booleans): 1. Use two boolean fields (again)
Then:
2. Use a stateWe modify the current type to be more descriptive. We add a field called
Note that you may think that this is the same as today with different names but notice that 3. Use bitmask for all booleansThis is an improvement over the first proposal. It implies merging all booleans in the type traceFlag uint8
const (
// flagDropped reports whether the trace was dropped by the local sampler.
flagDropped traceFlag = 1 << iota
// flagImportant reports whether the trace is important (e.g. has an error or an event).
flagImportant
// flagPriorityLocked reports whether the sampling priority can no longer be changed.
flagPriorityLocked
// flagFullCapacity reports whether this trace is at full capacity and can buffer no more spans.
flagFullCapacity
) We can then use bit operations or add helper methods. 4. We keep what we have today.We avoid the above and keep the current solution, since none of them aren't necessarily less confusing in practice. What do you guys think? P.S. I do realise that this is a small change in theory, and that it's a lot of philosophy for a bug fix, but I think we should carefully consider future maintenance of this code because if this is hard to understand now while it is fresh in our minds, it will be much harder later on once it fades. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
I'm OK with this, since it seems to be necessary to manage the sampling decision state.
@gbbr Thanks for the alternatives. I think they're all about the same complexity level as this. Unless you think we should pick one over this one, I think we can leave it as-is.
The current behavior is:
Drop the trace except if all spans should be kept.
This doesn't work if:
The PR inverses the behavior:
Keep the trace except if all spans should be dropped.