Skip to content

Commit

Permalink
Added more metadata to Series tracing; Improved tracing docs, tests; …
Browse files Browse the repository at this point in the history
…Added example.

Signed-off-by: Bartlomiej Plotka <[email protected]>
  • Loading branch information
bwplotka committed Sep 1, 2021
1 parent 8862ad5 commit ac16cea
Show file tree
Hide file tree
Showing 11 changed files with 279 additions and 84 deletions.
Binary file added docs/img/tracing.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/img/tracing2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
32 changes: 30 additions & 2 deletions docs/tracing.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ You can either pass YAML file defined below in `--tracing.config-file` or pass t

Don't be afraid of multiline flags!

In Kubernetes it is as easy as (on Thanos sidecar example):
In Kubernetes it is as easy as (using Thanos sidecar example):

```yaml
- args:
Expand Down Expand Up @@ -38,9 +38,37 @@ In Kubernetes it is as easy as (on Thanos sidecar example):

At that point, anyone can use your provider by spec.

See [this issue](https://github.com/thanos-io/thanos/issues/1972) to check our progress on moving to OpenTelemetry Go client library.

## Usage

Once tracing is enabled and sampling per backend is configured Thanos will generate traces for all gRPC and HTTP APIs thanks to generic "middlewares". Some more interesting to observe APIs like `query` or `query_range` have more low-level spans with focused metadata showing latency for important functionalities. For example Jaeger view of HTTP query_range API call might look as follows:

![view](img/tracing2.png)

As you can see it contains both HTTP request and spans around gRPC request, since [Querier](components/query.md) calls gRPC services to get fetch series data.

Each Thanos component generates spans related to its work and sends them to central place e.g Jaeger or OpenTelemetry collector. Such place is then responsible to tie all spans to a single trace, showing a request execution path.

### Obtaining Trace ID

Single trace is tied to a single, unique request to the system and is composed of many spans from different components. Trace is identifiable using `Trace ID`, which is a unique hash e.g `131da78f02aa3525`. This information can be also referred as `request id` and `operation id` in other systems. In order to use trace data you want to find trace IDs that explains the requests you are interested in e.g request with interesting error, or longer latency, or just debug call you just made.

When using tracing with Thanos, you can obtain trace ID in multiple ways:

* Search by labels/attributes/tags/time/component/latency e.g. using Jaeger indexing.
* [Exemplars](https://www.bwplotka.dev/2021/correlations-exemplars/)
* If request was sampled, response will have `X-Thanos-Trace-Id` response header with trace ID of this request as value.

![view](img/tracing.png)

### Forcing Sampling

Every request against any Thanos component's API with header `X-Thanos-Force-Tracing` will be sampled if tracing backend was configured.

## Configuration

Current tracing supported backends:
Currently supported tracing supported backends:

### Jaeger

Expand Down
48 changes: 42 additions & 6 deletions examples/interactive/interactive_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,8 @@ import (
"github.com/thanos-io/thanos/pkg/objstore/client"
"github.com/thanos-io/thanos/pkg/objstore/s3"
"github.com/thanos-io/thanos/pkg/testutil"
tracingclient "github.com/thanos-io/thanos/pkg/tracing/client"
"github.com/thanos-io/thanos/pkg/tracing/jaeger"
"gopkg.in/yaml.v2"
)

Expand Down Expand Up @@ -84,7 +86,8 @@ func createData() (perr error) {
return nil
}

// Test args: -test.timeout 9999m
// TestReadOnlyThanosSetup runs read only Thanos setup that has data from `maxTimeFresh - 2w` to `maxTimeOld`, with extra monitoring and tracing for full playground experience.
// Run with test args `-test.timeout 9999m`.
func TestReadOnlyThanosSetup(t *testing.T) {
t.Skip("This is interactive test - it will until you will kill it or curl 'finish' endpoint. Uncomment and run as normal test to use it!")

Expand Down Expand Up @@ -121,6 +124,21 @@ func TestReadOnlyThanosSetup(t *testing.T) {
testutil.Ok(t, exec("cp", "-r", store1Data+"/.", filepath.Join(m1.Dir(), "bkt1")))
testutil.Ok(t, exec("cp", "-r", store2Data+"/.", filepath.Join(m1.Dir(), "bkt2")))

// Setup Jaeger.
j := e.Runnable("tracing").WithPorts(map[string]int{"http-front": 16686, "jaeger.thrift": 14268}).Init(e2e.StartOptions{Image: "jaegertracing/all-in-one:1.25"})
testutil.Ok(t, e2e.StartAndWaitReady(j))

jaegerConfig, err := yaml.Marshal(tracingclient.TracingConfig{
Type: tracingclient.JAEGER,
Config: jaeger.Config{
ServiceName: "thanos",
SamplerType: "const",
SamplerParam: 1,
Endpoint: "http://" + j.InternalEndpoint("jaeger.thrift") + "/api/traces",
},
})
testutil.Ok(t, err)

// Create two store gateways, one for each bucket (access point to long term storage).
// ┌───────────┐
// │ │
Expand All @@ -144,7 +162,13 @@ func TestReadOnlyThanosSetup(t *testing.T) {
},
})
testutil.Ok(t, err)
store1 := e2edb.NewThanosStore(e, "store1", bkt1Config, e2edb.WithImage("thanos:latest"))
store1 := e2edb.NewThanosStore(
e,
"store1",
bkt1Config,
e2edb.WithImage("thanos:latest"),
e2edb.WithFlagOverride(map[string]string{"--tracing.config": string(jaegerConfig)}),
)

bkt2Config, err := yaml.Marshal(client.BucketConfig{
Type: client.S3,
Expand All @@ -157,7 +181,14 @@ func TestReadOnlyThanosSetup(t *testing.T) {
},
})
testutil.Ok(t, err)
store2 := e2edb.NewThanosStore(e, "store2", bkt2Config, e2edb.WithImage("thanos:latest"))

store2 := e2edb.NewThanosStore(
e,
"store2",
bkt2Config,
e2edb.WithImage("thanos:latest"),
e2edb.WithFlagOverride(map[string]string{"--tracing.config": string(jaegerConfig)}),
)

// Create two Prometheus replicas in HA, and one separate one (short term storage + scraping).
// Add a Thanos sidecar.
Expand Down Expand Up @@ -189,8 +220,8 @@ func TestReadOnlyThanosSetup(t *testing.T) {
promHA1 := e2edb.NewPrometheus(e, "prom-ha1")
prom2 := e2edb.NewPrometheus(e, "prom2")

sidecarHA0 := e2edb.NewThanosSidecar(e, "sidecar-prom-ha0", promHA0, e2edb.WithImage("thanos:latest"))
sidecarHA1 := e2edb.NewThanosSidecar(e, "sidecar-prom-ha1", promHA1, e2edb.WithImage("thanos:latest"))
sidecarHA0 := e2edb.NewThanosSidecar(e, "sidecar-prom-ha0", promHA0, e2edb.WithImage("thanos:latest"), e2edb.WithFlagOverride(map[string]string{"--tracing.config": string(jaegerConfig)}))
sidecarHA1 := e2edb.NewThanosSidecar(e, "sidecar-prom-ha1", promHA1, e2edb.WithImage("thanos:latest"), e2edb.WithFlagOverride(map[string]string{"--tracing.config": string(jaegerConfig)}))
sidecar2 := e2edb.NewThanosSidecar(e, "sidecar2", prom2, e2edb.WithImage("thanos:latest"))

testutil.Ok(t, exec("cp", "-r", prom1Data+"/.", promHA0.Dir()))
Expand Down Expand Up @@ -273,7 +304,9 @@ global:
sidecarHA0.InternalEndpoint("grpc"),
sidecarHA1.InternalEndpoint("grpc"),
sidecar2.InternalEndpoint("grpc"),
}, e2edb.WithImage("thanos:latest"),
},
e2edb.WithImage("thanos:latest"),
e2edb.WithFlagOverride(map[string]string{"--tracing.config": string(jaegerConfig)}),
)
testutil.Ok(t, e2e.StartAndWaitReady(query1))

Expand All @@ -285,6 +318,9 @@ global:
testutil.Ok(t, e2einteractive.OpenInBrowser(fmt.Sprintf("http://%s/%s", query1.Endpoint("http"), path)))
testutil.Ok(t, e2einteractive.OpenInBrowser(fmt.Sprintf("http://%s/%s", prom2.Endpoint("http"), path)))

// Tracing endpoint.
testutil.Ok(t, e2einteractive.OpenInBrowser("http://"+j.Endpoint("http-front")))
// Monitoring Endpoint.
testutil.Ok(t, m.OpenUserInterfaceInBrowser())
testutil.Ok(t, e2einteractive.RunUntilEndpointHit())
}
2 changes: 1 addition & 1 deletion pkg/api/api.go
Original file line number Diff line number Diff line change
Expand Up @@ -195,7 +195,7 @@ func GetRuntimeInfoFunc(logger log.Logger) RuntimeInfoFn {

type InstrFunc func(name string, f ApiFunc) http.HandlerFunc

// Instr returns a http HandlerFunc with the instrumentation middleware.
// GetInstr returns a http HandlerFunc with the instrumentation middleware.
func GetInstr(
tracer opentracing.Tracer,
logger log.Logger,
Expand Down
Loading

0 comments on commit ac16cea

Please sign in to comment.