Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable priority sampling #229

Merged
merged 11 commits into from
Nov 28, 2017
Merged

Enable priority sampling #229

merged 11 commits into from
Nov 28, 2017

Conversation

p-lambert
Copy link
Member

@p-lambert p-lambert commented Oct 23, 2017

Overview

Introduces Priority Sampling to force the sampler to keep a trace (or to discard it). This includes Distributed Tracing propagation for the priority sampling attribute.

@p-lambert p-lambert requested a review from palazzem October 23, 2017 16:40
@p-lambert p-lambert force-pushed the pedro/service-sampler branch from a634c32 to 4656856 Compare October 23, 2017 16:52
@palazzem palazzem added this to the 0.10.0 milestone Oct 24, 2017
@palazzem palazzem added the core Involves Datadog core libraries label Oct 24, 2017
@p-lambert p-lambert changed the title Add RateByServiceSampler Enable priority sampling Nov 3, 2017
@p-lambert p-lambert added the do-not-merge/WIP Not ready for merge label Nov 3, 2017
@ufoot ufoot self-requested a review November 9, 2017 19:45
Copy link
Contributor

@ufoot ufoot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the approach is fine, left a few comments but need to see this in action. Basically, the missing brick is getting the actual rates from the agent before injecting it into the priority sampler. But this could be done in another PR.

def initialize(rate = 1.0, opts = {})
@env = opts.fetch(:env, Datadog.tracer.tags[:env])
@mutex = Mutex.new
@fallback = RateSampler.new(rate)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's perfectly fine to have a fallback and identify it as such, however keep in mind that the agent will report a value for this (typically with a key service:,env:) and this should be used as the library can't really have a clue of this default "fallback" value unless it queries the agent. Eg: https://github.com/DataDog/dd-trace-py/blob/master/ddtrace/sampler.py#L90

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see! OK, I think I'll have to change this. Thanks for spotting that 👍

end

def sample(span)
span.sampling_priority = 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally speaking, while writing the Python impl. we acknowledged, at some point, that it was easier and made more sense API-wise to have all the sample() family of methods return true or false and not modify the object in place. As in: https://github.com/DataDog/dd-trace-py/blob/master/ddtrace/sampler.py#L22 It's easier to test and the caller can make whatever decision on what to do, also, span becomes a read-only parameter, which has some advantages. This involves quite a rewriting of code at this step (because the existing sampler should be updated, not only the Priority one, but I think it's worth it). Might want to ask for @palazzem advice here.

Copy link
Member Author

@p-lambert p-lambert Nov 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Humn, I guess I'll have to wait for @palazzem, because I might be drifting too much from the Python implementation then.

I agree that the #sample return value should be a boolean, but I'd rather have a function with side effects here than breaking encapsulation by having different code paths in the caller for each concrete implementation of the Sampler – I think if we have a common API for samplers, we should leverage that and make them transparent to the caller.

@p-lambert p-lambert force-pushed the pedro/service-sampler branch from 9114773 to 05c893a Compare November 14, 2017 15:19
@p-lambert p-lambert removed the do-not-merge/WIP Not ready for merge label Nov 14, 2017
ufoot
ufoot previously approved these changes Nov 14, 2017
Copy link
Contributor

@ufoot ufoot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GTM, as a side note I think we might need some more high-level integration tests here, as at this stage most tests are relying on mocks. But this could be saved for the upcoming PR glueing all the priority sampling parts together.

@@ -118,11 +118,17 @@ def configure(options = {})
hostname = options.fetch(:hostname, nil)
port = options.fetch(:port, nil)
sampler = options.fetch(:sampler, nil)
priority_sampling = options[:priority_sampling]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@enabled = enabled unless enabled.nil?
@writer.transport.hostname = hostname unless hostname.nil?
@writer.transport.port = port unless port.nil?
@sampler = sampler unless sampler.nil?

if priority_sampling
@sampler = PrioritySampler.new(base_sampler: @sampler)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, cool, this way, we can chain them.

true
end

def_delegators :@post_sampler, :update
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK so that's the entry point to udate data from JSON agent output.

@p-lambert p-lambert force-pushed the pedro/service-sampler branch 2 times, most recently from 6a60209 to ff8f009 Compare November 15, 2017 20:05
Copy link
Contributor

@palazzem palazzem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add some documentation for that such as: http://pypi.datadoghq.com/trace/docs/#priority-sampling

end

def sample(span)
span.sampling_priority = 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the sampling_priority attribute must be in the Context rather than Span. In that way you can set it with something similar to:

span.context.sampling_priority = 2

And at the end when the Context is finished, you just attach that metric to the root span (first element of the trace because it's a list). In Python we do: https://github.com/DataDog/dd-trace-py/blob/master/ddtrace/context.py#L148-L167

@@ -133,12 +133,14 @@ def sampled?
# can be re-used immediately.
#
# This operation is thread-safe.
def get
def get(priority_sampling = false)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I turned this into an argument because it would be tricky to get the value of priority_sampling from the Tracer on every instantiation of Context objects. If you have any other ideas, let me know.

@@ -6,6 +6,7 @@ module DistributedTracing
HTTP_HEADER_TRACE_ID = 'x-datadog-trace-id'.freeze
HTTP_HEADER_PARENT_ID = 'x-datadog-parent-id'.freeze
HTTP_HEADER_SAMPLING_PRIORITY = 'x-datadog-sampling-priority'.freeze
SAMPLING_PRIORITY_KEY = '_sampling_priority_v1'.freeze
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if there is a better place for this constant. Let me know your thoughts.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here is fine; when moving to OpenTracing we'll use directly their constants.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with it being here, also not totally happy, but think it's OK.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes we can simply use OpenTracing tags later on, when our clients have a proper migration.

headers[HTTP_HEADER_SAMPLING_PRIORITY] = span.sampling_priority.to_s
end
env.merge! headers
env.delete(HTTP_HEADER_SAMPLING_PRIORITY) unless span.sampling_priority
Copy link
Member Author

@p-lambert p-lambert Nov 21, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ufoot I know it would be good to keep this #delete. Any ideas how we could do this now? Perhaps this would be easier once we have the a globally accessible value for the priority_sampling flag.

@p-lambert p-lambert force-pushed the pedro/service-sampler branch from 63238b8 to 85d7ba0 Compare November 21, 2017 21:46
Copy link
Contributor

@ufoot ufoot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few questions, let's talk about this. You raise interesting issues BTW.

@mutex.synchronize do
return nil, nil unless check_finished_spans

trace = @trace
sampled = @sampled
attach_sampling_priority if sampled && priority_sampling
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure why you need to test priority_sampling here, at all. This "is priority sampling enabled" is useful when creating the tracer, so that we know wether start_span should add a priority sampling arg, but then, once it's been set, it's been set, there's, I think, no point to re-test it here again. What is the use case for this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

humn, I thought we had to check it here so we can avoid a scenario like this:

In my custom instrumentation, I have a code line like this:

span.context.sampling_priority = 3

But at some point, I want to disable the priority sampling altogether:

Datadog.tracer.configure(priority_sampling: false)

If we don't check for the priority sampling flag before flushing the trace, that line setting a sampling_priority will still affect the sampling upstream, right?

I don't understand what happens outside the client, so I might be wrong. But if that's not the case, I would also prefer to remove this check. We can take a stance as "if you want to disable priority_sampling altogether you shouldn't set it."

BTW, I thought this was the scenario you were trying to prevent in that delete call inside the HTTPPropagator.inject! method.

@@ -6,6 +6,7 @@ module DistributedTracing
HTTP_HEADER_TRACE_ID = 'x-datadog-trace-id'.freeze
HTTP_HEADER_PARENT_ID = 'x-datadog-parent-id'.freeze
HTTP_HEADER_SAMPLING_PRIORITY = 'x-datadog-sampling-priority'.freeze
SAMPLING_PRIORITY_KEY = '_sampling_priority_v1'.freeze
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with it being here, also not totally happy, but think it's OK.

end
env.merge! headers
env.delete(HTTP_HEADER_SAMPLING_PRIORITY) unless span.sampling_priority
def self.inject!(span, context, env)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I would not add he context as an arg to this function. Because, we're going to remove it some day, for sure. The final state is something like: context has it all, trace_id, current span_id, and sampling priority. As in https://github.com/DataDog/dd-trace-py/blob/master/ddtrace/context.py#L25

So in the current state, the Ruby impl. does not have full context support. Initially, I thought it was OK to keep the legacy "span-based" way of doing things. Which, actually, works, only you copy the information many times in all spans when once in the context should be enough. My point is: we need to choose, either keep it all in span, or move it all in context, but something in between is really uncomfortable. AFAIK in different places in Python/Ruby code there's this "it could be a span or a context, let's test at run-time" as in https://github.com/DataDog/dd-trace-py/blob/master/ddtrace/tracer.py#L159 and I think this is really fine. But two args (span+context) will be troublesome to migrate I fear.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should keep the compatibility that simplify our migration to OpenTracing (while not breaking a lot of stuff). In that case, the public API expects a span_context that in our case is our Context (this is an approximation like @ufoot said). Eventually the Span will have its SpanContext and our (Trace)Context will only be an implementation detail. For instance, In Python we're providing the propagator in that way: https://github.com/DataDog/dd-trace-py/blob/master/ddtrace/propagation/http.py

Reference for OpenTracing: https://github.com/opentracing/basictracer-python/blob/master/basictracer/tracer.py#L87

I'd suggest to expect a Context (so span.context that will be easy to migrate later) so:

  • the API is compliant with our Python implementation + OpenTracing
  • no confusion so users always pass a Context
  • since it's a new functionality, no breaking changes in the future

Copy link
Member Author

@p-lambert p-lambert Nov 22, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ufoot @palazzem OK, makes total sense. I will change things so we receive only the context then.

@p-lambert p-lambert force-pushed the pedro/service-sampler branch from 85d7ba0 to 2eeacfa Compare November 22, 2017 15:45
ufoot
ufoot previously approved these changes Nov 22, 2017
Copy link
Contributor

@ufoot ufoot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once the CI issues are fixed, looks good to me.

@@ -22,7 +22,7 @@ class Span
:start_time, :end_time,
:span_id, :trace_id, :parent_id,
:status, :sampled,
:tracer, :context, :sampling_priority
:tracer, :context
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@@ -8,18 +8,20 @@ def test_inject!

tracer.trace('caller') do |span|
env = { 'something' => 'alien' }
Datadog::HTTPPropagator.inject!(span, env)
Datadog::HTTPPropagator.inject!(span.context, env)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, span.context, perfect.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes always use the context attached to the Span

ufoot
ufoot previously approved these changes Nov 23, 2017
@palazzem
Copy link
Contributor

@ufoot @p-lambert before merging this one we should provide some documentation:

We should add some documentation for that such as:
http://pypi.datadoghq.com/trace/docs/#priority-sampling

* 1: The sampler automatically decided to keep the trace.
* 2: The user asked the keep the trace.

For now, priority sampling is disabled by default. Enabling it ensures that your sampled distributed traces will be complete. To enable the priorty sampling:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo priorty -> priority

@@ -719,6 +719,32 @@ overhead.
sampler = Datadog::RateSampler.new(0.5) # sample 50% of the traces
Datadog.tracer.configure(sampler: sampler)

#### Priority sampling

Priority sampling consists in deciding if a trace will be kept by using a priority attribute that will be propagated for distributed traces. Its value gives indication to the Agent and to the backend on how important the trace is.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's important to outline that priority sampling is interesting in the case of distributed tracing. Something as Priority sampling should be turned on when using [distributed tracing](#Distributed_Tracing), as it ensures all parts of a given trace are kept together and sent to the backend. maybe ? Just a thought.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ufoot I think that's a good idea. Do you think we should add that paragraph after the current one or replaced it altogether?

@p-lambert p-lambert force-pushed the pedro/service-sampler branch from 8397d4f to fd3a914 Compare November 27, 2017 16:18
@@ -50,6 +50,7 @@ def sampling_priority

def sampling_priority=(priority)
@mutex.synchronize do
@sampled = true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not work, for 2 reasons:

  • I think it would put sampled to true even if the arg (priority) is nil or 0
  • in all cases it breaks the stats, because it alters the numbers of traces we send without changing the sample_rate metric (from which we deduce the weight). In short, it breaks the stats.

@p-lambert p-lambert force-pushed the pedro/service-sampler branch from f1e7723 to fd3a914 Compare November 27, 2017 16:53
@p-lambert p-lambert force-pushed the pedro/service-sampler branch from fd3a914 to 1038ae1 Compare November 27, 2017 17:01
Copy link
Contributor

@ufoot ufoot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GTG

@palazzem palazzem merged commit 2078ab4 into master Nov 28, 2017
@palazzem palazzem deleted the pedro/service-sampler branch November 28, 2017 09:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Involves Datadog core libraries
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants