Skip to content
Adrian Cole edited this page Sep 11, 2019 · 13 revisions

Goal

Secondary sampling means participants choose trace data from a request even when it is not sampled with B3. This is particularly important with customer support or triage in large deployments. For example:

  • I want to see 10% of gateway requests with a path expression /play/*. However, I only want data from the gateway and playback services.
  • I want to see 15% of authUser() gRPC requests, but only data between the auth service and its cache.

This design allows multiple participants to perform investigations that possibly overlap, while only incurring overhead at most once. For example, if B3 is sampled, all investigations reuse its data. If B3 is not, investigations only record data at trigger points in the request.

The fundamentals of this design are the following:

  • A function of request creates zero or many "sampling keys". This function trigger anywhere in the service graph.
  • A header samplingkeys is co-propagated with B3 including these labels.
  • A delimited span tag sampled_keys is added to all recorded spans. sampled_keys is a subset of sampling-keys, relevant for this hop. Notably, it may include a keyword 'b3' if the span was B3 sampled.
  • A "trace forwarder" routes data to relevant participants by parsing the sampled_keys tag.

Background

Typically, a Zipkin trace is sampled up front and before any activity is recorded. B3 propagation conveys the sampling decision downwards consistently. In other words, a "no" the decision never changes from unsampled to sampled on the same request.

Many large sites use random sampling, to ensure a small percentage <5% result in a trace. While nuanced, it is important to note that even when random sampling, sites often have blacklists which prevent instrumentation from triggering at all. A prime example are health checks which are usually never recorded even if everything else is randomly sampled.

Many conflate Zipkin and B3 with pure random sampling, because initially that was the only choice. However times have changed. Sites often use conditions such as an http request to choose data. For example, record 100% of traffic at a specific endpoint (while randomly sampling other traffic). Choosing what to record based on context including request and node-specific state is called conditional sampling.

In either case of random or conditional sampling, there's other guards as well. For example, decisions are subject to a rate-limit. For example, up to 1000 traces per second for this endpoint means effectively 100% until/unless that cap is reached. Further concepts are available in William Louth's Scaling Distributed Tracing talk.

The important takeaway is that existing Zipkin sites select traces based on criteria visible at the beginning of the request. Once selected, this data is expected to be recorded into Zipkin consistently even if the request crosses 300 services.

For the rest of this document, we'll call this up front, consistent decision the "primary sampling decision". We'll understand that this primary decision is propagated in-process in a trace context and across nodes using B3 propagation.

Sampling Keys

Sampling keys are human readable labels corresponding to a trace participant. There's no established registry or mechanism for choosing these labels, as it is site-specific. An example might be auth15pct.

The samplingkeys header

Secondary sampling decisions can happen anywhere in the trace and can trigger recording anywhere also. For example, a gateway could add a sampling key that is triggered only upon reaching a specific service. The samplingkeys header (or specifically propagated field) carries the sampling keys and any state associated with them (such as TTL values).

The naming convention samplingkeys follows the same design concern as b3 single. Basically, hyphens cause problems across messaging links. By avoiding them, we allow the same system to work with message traces as opposed to just RPC ones, and with no conversion concerns.

Non-interference

The application is unaware secondary sampling. It is critical that this design and tooling in no way change the api surface to instrumentation libraries, such as what's used by frameworks like Spring Boot. This reduces implementation risk and also allows the feature to be enabled or disabled without affecting production code.

Moreover, the fact that there are multiple participants choosing data differently should not be noticeable by instrumentation. All participants use the same trace and span IDs, which means log correlation is not affected. Sharing instrumentation and reporting means we are not burdening the application with redundant overhead. It also means we are not requiring engineering effort to re-instrument each time a participant triggers recording.

The Trace Forwarder

Each participant in the trace could have different capacities, retention rates and billing implications. The responsibility for this is a zipkin-compatible endpoint, which routes the same data to participants associated with a sampling key. We'll call this the trace forwarder. Some examples are PitchFork and Zipkin Forwarder.

If the trace forwarder sees two keys b3 and gateway, it knows to forward the same span to the standard Zipkin backend as well as the API Gateway team's Zipkin.

The sampled_keys Tag

As trace data is completely out-of-band: it is decoupled from request headers. For example, if the forwarder needs to see sampled keys, they must be encoded into a tag sampled_keys.

The naming convention sampled_keys two important facets. One is that it is encoded lower_snake_case. This is to allow straight-forward json path expressions, like tags.sampled_keys. Secondly this is the word "sampled" to differentiate this from the samplingkeys header. Keys sampled are a subset of all sampling keys, hence the word "sampled" not "sampling". The value is comma separated as it is easy to tokenize. It isn't a list because Zipkin's data format only allows string values.

The b3 sampling key

The special sampling key b3 ensures secondarily sampled data are not confused with B3 sampled data. Remember, in normal Zipkin installs, presence of spans at all imply they were B3 sampled. Now that there are multiple destinations, we need to back-fill a tag to indicate the base case. This ensures the standard install doesn't accidentally receive more data than was B3 sampled. b3 should never appear in the samplingkeys header: it is a pointer to the sampling state of B3 headers.

Impact of skipping a service

It is possible that some sampling keys skip hops, or services, when recording. When this happens, parent IDs will be wrong, and also any dependency links will also be wrong. It may be heuristically possible to reconnect the spans, but this will push complexity into the forwarder, at least requiring it to buffer a trace based on a sampling key.

There are a couple ways to mitigate this. One is don't ever skip nodes! this is the easiest by far. Another way is to use a more complex state management which propagates the upstream context in the samplingkeys header similar to how tracestate was originally designed.

Implementation requirements

Not all tracing libraries have the same features. The following are required for this design to work:

  • ability to trigger a "local sampled" decision independent of the B3 decision, which propagates to child contexts
  • propagation components must be extensible such that the samplingkeys field can be extracted and injected
  • trace context extractors must see request objects, to allow for secondary request sampling decisions.
    • often they can only see headers, but they now need to see the entire request object (ex the http path)
  • ability to attach extra data to the trace context, in order to store sampling key state.
  • a span finished hook needs to be able write the sampled_keys tag based on this state.
  • the span reporter needs to be able to see all spans, not just B3 sampled ones.

Example participants

The following fictitious application is used for use case scenarios. It highlights that only sites with 10 or more applications will benefit from the added complexity of secondary sampling. Small sites may be fine just recording everything.

In most scenarios, the gateway provisions sampling keys even if they are triggered downstream.

gateway -> api -> auth -> cache  -> authdb
               -> recommendations -> cache -> recodb
               -> playback -> license -> cache -> licensedb
                           -> moviemetadata
                           -> streams

auth15pct

I want to see 15% of authUser() gRPC requests, but only data between the auth service and its cache.

This scenario is interesting as the decision happens well past the gateway. As the auth service only interacts with the database via the cache, and cache is its only downstream, it is easiest to implement this with ttl=1

play10pct

I want to see 10% of gateway requests with a path expression /play/*. However, I only want data from the gateway and playback services.

This use case is interesting because the trigger occurs at the same node that provisions the sampling key. Also, it involves skipping the api service, which may cause some technical concerns at the forwarding layer.

Clone this wiki locally