secondary sampling

Background

Typically, a Zipkin trace is sampled up front and before any activity is recorded. B3 propagation conveys the sampling decision downwards consistently. In other words, a "no" the decision never changes from unsampled to sampled on the same request.

Many large sites use random sampling, to ensure a small percentage <5% result in a trace. While nuanced, it is important to note that even when random sampling, sites often have blacklists which prevent instrumentation from triggering at all. A prime example are health checks which are usually never recorded even if everything else is randomly sampled.

Many conflate Zipkin and B3 with pure random sampling, because initially that was the only choice. However times have changed. Sites often use conditions such as an http request to choose data. For example, record 100% of traffic at a specific endpoint (while randomly sampling other traffic). Choosing what to record based on context including request and node-specific state is called conditional sampling.

In either case of random or conditional sampling, there's other guards as well. For example, decisions are subject to a rate-limit. For example, up to 1000 traces per second for this endpoint means effectively 100% until/unless that cap is reached. Further concepts are available in William Louth's Scaling Distributed Tracing talk.

The important takeaway is that existing Zipkin sites select traces based on criteria visible at the beginning of the request. Once selected, this data is expected to be recorded into Zipkin consistently even if the request crosses 300 services.

For the rest of this document, we'll call this up front, consistent decision the "primary sampling decision". We'll understand that this primary decision is propagated in-process in a trace context and across nodes using B3 propagation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

secondary sampling

Background

Clone this wiki locally