Implement Elasticsearch query tracing to a source in Kibana #101587

mshustov · 2021-06-08T13:10:44Z

While this issue aims to address #97934 main concern: provide the ability to trace ES query back to a source in Kibana code that initiated the request, we want to lay the foundation for e2e tracing in the whole Stack. To make it happen, Kibana will rely on the built-in capabilities of APM-RUM and nodejs APM agents, and their integration with Elasticsearch service.

High-level picture

Kibana Frontend

Context should allow Kibana users to unambiguously identify the source of a query in the Kibana App in the browser, Kibana server, or the task manager.

interface KibanaExecutionContext {
  // kibana entity type
  type: 'visualization' | 'actions' | 'alert' | ..;
  // kibana entity id
  id: string;
  // human readable description, a vis title, action name,
  description: string;
  // in browser - url to navigate to a current page, on server - endpoint path, for task: task SO url
  url?: string;
}

APM RUM agent doesn't provide support for async context propagation in the browser. Kibana will have to implement manual context passing.

A plugin creates an execution context object with API provided by Core. Returned value is opaque to the plugin.

const executionContext: KibanaExecutionContext = createExecutionContext({ .. })

Obtained execution context should be passed to the Kibana server manually through all the layers of abstractions in Kibana. Kibana sets it as a custom request header before issuing a request to the Kibana server:

await fetch('/api/something', {
  headers: {
    'kbn-context': executionContext.toString(),
  }
});
await fetch('/api/something', {
  method: 'post',
  body: {
    contest: executionContext.toJSON(),
  }
});

For the first implementation, we start with context capturing the single context level - visualizations.
In the next iteration, we can add support for nested execution contexts. It can be used to compose execution context relationships across different apps.
Application service context --> Dashboard context --> Visualization context.

Server-side

Depends on: APM agents can be used without APM server elastic/apm-agent-nodejs#2101

The APM Node.js agent intercepts all the incoming requests and creates an APM transaction.
The APM Node.js agent instruments all the requests to the Elasticsearch server to pass the current transaction id via the traceparent header.
Elasticsearch team is working on adding support for tracing headers Adds minimal traceparent header support to Elasticsearch elasticsearch#74210
We need to get their commitment shipping it in v7.15.
This traceparent header will be used for log correlation across Kibana and Elasticsearch server. To make it possible, Kibana should add trace.id to the log records.
TODO: discuss with the Elasticsearch team in what form they are going to include it into the Elasticsearch logs. It's likely will be present in ECS-JSON logs by default. Presence in the Text logs is discussable.
Kibana intercepts all the incoming requests and retrieves execution context from the 'kbn-context' header. The context + trace.id are emitted to Kibana logs. The minimal subset of the execution context data, in the form kibana:type:name:id (kibana:visualization:gauge:1234-5678, for example) is attached to the current APM transaction as kibanaContext label.
Kibana server plugins may create execution context on the server-side as well. The context passing works in the same way as for the client-side counterpart.
Whenever Kibana requests Elasticsearch server, Kibana adds the kibanaContext label to x-opaque-id header. It allows Stack users to identify the source of a query in slowlogs without the necessity to inspect Kibana logs.
TODO: discuss with the Elasticsearch team trace.id is included in the slowlogs as well.

Instrumentation

The list of instrumentation points should be discussed with every team separately. We are primarily interested in instrumenting plugins that may cause performance problems in Elasticsearch:

During the initial implementation, the Core team will instrument several plugins and implements integration testing as an example. Later, we will create separate issues for code owners to help us with this work.

List of sub-tasks

Context propagation

Implement context management service on the client-side Implement execution context management service #102626
Implement manual context propagation for Kibana Entities: [Meta] Implement context propagation for Kibana entities #102629
Provide recommendations on debugging Kibana with data sent to Elasticsearch slowlogs

Log correlation

update APM nodejs agent update APM nodejs agent to a version usable without APM-server #102624
Refactor logging system to include trace.id in the logs for log correlation purposes. Include tracing information in the log records #102699
- align with the Elasticsearch team on the logging format
Provide settings to run Kibana with APM agent enabled, APM agent disabled, APM agent working in the tracing mode (without sending data to APM server) Support configuring APM modes #102704
Measure the solution overhead and its influence on the Kibana performance Measure performance overhead of tracing solution for Kibana server #102706
updated APM RUM agent updated APM RUM agent to version instrumenting tracing headers for custom transactions #102625

The text was updated successfully, but these errors were encountered:

elasticmachine · 2021-06-08T13:10:46Z

Pinging @elastic/kibana-core (Team:Core)

mshustov · 2021-06-15T11:36:42Z

@jtibshirani @imotov I have a couple of questions about the integration with ES slow query logs:

What header should Kibana use to propagate context with? We can start with x-opaque-id or stats and switch to the standard observability headers (baggage, for example) later.
In what format should the data be transmitted? We can start with something as simple as JSON.

{ "x-opaque-id": "{ \"id\": ....}

But I'm open to any suggestions. Right now, Kibana passes x-opaque-id value as uuid string x-opaque-id: 6c4e0436-86d7-4c55-bb21-e522a5afc0f2

Could you confirm that Elasticsearch includes x-opaque-id header into the slow logs and no additional work is required from your side?

danhermann · 2021-06-15T12:53:10Z

I don't have information on the slow logs, but the recently-added HTTP client stats in ES report the first observed x-opaque-id for each HTTP client. Additional context from Kibana would be great and if implemented in the x-opaque-id field as proposed here, the HTTP client stats in ES should be changed to report the most recently observed ID rather than just the first observed id.

imotov · 2021-06-15T18:02:34Z

Could you confirm that Elasticsearch includes x-opaque-id header into the slow logs and no additional work is required from your side?

Yes, since elastic/elasticsearch#31539

felixbarny · 2021-06-17T14:49:32Z

In the next iteration, we will use the APM RUM agent for content propagation to get rid of the custom 'kbn-context' header.

Adding a bit of background why we've decided to not use the RUM agent for that particular part.

This would require support for baggage, and either the ability to manually inject headers (elastic/apm-agent-rum-js#468) or a context management API (elastic/apm-agent-rum-js#1040). As that's a lot of dependencies and a non-trivial amount of work for the RUM agent, I think it's easier to manually propagate context from the Kibana frontend to the Kibana backend using a custom header.
An important factor is that this custom header is contained within Kibana (frontend -> backend). This means it's an internal implementation detail that can change later on. The custom Kibana context will not be sent to Elasticsearch.

Whenever Kibana requests Elasticsearch server, Kibana adds the kibanaContext label to x-opaque-id header. It allows Stack users to identify the source of a query in slowlogs without the necessity to inspect Kibana logs.

I'm ok with that but I hope we can view that as a stretch goal. One thing that we might want to discuss is whether we even want the labels to store data when tracing is turned off in the Node.js agent vs labels acting as a noop in that setting. In the future, we probably want to remove X-Opaque-Id completely in favor of traceparent and baggage.

trentm · 2021-06-17T17:10:46Z

One thing that we might want to discuss is whether we even want the labels to store data when tracing is turned off in the Node.js agent vs labels acting as a noop in that setting.

@felixbarny We aren't turning off agent tracing here, though, are we? We are just not sending trace data on to an APM server.

felixbarny · 2021-06-18T06:40:33Z

No we wouldn’t turn off tracing completely when setting disable_send=true. We'd work in a mode that's similar to 0% sampling where we may want to noop some things. For example not storing labels, not collecting the ES query and other things that reduce memory and runtime overhead. If we expose getters for labels, that may not be possible anymore. OTel doesn't expose getters for their attributes for that reason.
The semantics for baggage are defined differently. IINM, you can set and get values no matter the sampling decision.

pmuellr · 2021-08-05T18:09:14Z

Have we looked at "higher-level" ways of passing the execution context, rather than just on specific requests? For alerting and action tasks, we provide the task with an es client, and that would be a place we could add an execution context to be associated with all the calls made with it - no changes to actually es call sites within all the rule/action types would be required. Or maybe this is something we could do already with the existing es client?

pmuellr · 2021-08-05T18:17:35Z

Wondering if it would be possible to associate multiple "things" with a request. For example, for an alerting rule execution, it might be nice to mark a request as "from alerting" and then also "from rule type XYZ", and then you could even imagine a rule type adding additional "markers" to differentiate multiple requests it's making.

We'll definitely be wanting to associate es queries with specific rule types, but I'm curious - once we start collecting this data - if it would also be useful to see requests in the scope of "all alerting uses". Without having to add up a bunch of numbers ourselves.

mshustov · 2021-08-09T10:40:30Z

Have we looked at "higher-level" ways of passing the execution context, rather than just on specific requests?
we provide the task with an es client, and that would be a place we could add an execution context to be associated with all the calls made with it

@pmuellr in #107523 I added withContext wrapper

kibana/src/core/server/execution_context/execution_context_service.ts

Lines 59 to 63 in 27aca6c

    
              * Keeps track of execution context while the passed function is executed. 
        
              * Data are carried over all async operations spawned by the passed function. 
        
              * The nested calls stack the registered context on top of each other. 
        
              **/ 
        
             withContext<R>(context: KibanaExecutionContext | undefined, fn: (...args: any[]) => R): R;

I'd expect we wrap task.run with withContext to provide a task-specific context

kibana/x-pack/plugins/task_manager/server/task_running/task_runner.ts

Line 264 in 27aca6c

    
           const result = await withSpan({ name: 'run', type: 'task manager' }, () => this.task!.run());

Or maybe this is something we could do already with the existing es client?

The Core will inject context details in the ES client calls automatically. What we need from the alerting plugin is to provide the context data with `withContext wrapper.

Wondering if it would be possible to associate multiple "things" with a request. For example, for an alerting rule execution, it might be nice to mark a request as "from alerting" and then also "from rule type XYZ", and then you could even imagine a rule type adding additional "markers" to differentiate multiple requests it's making.

That makes sense. withContext is similar to withSpan here: it creates a nested record, so we alerting can specify the details of the context for some operations:

// ctx: undefined
withContext({ type: 'a' }, () => { // ctx: {type: 'a'}
  // ...
  withContext({ type: 'b' }, () => { // ctx: {type: 'b', parent: {type: 'a'}}
  });
}); // ctx: {type: 'a'}

if it would also be useful to see requests in the scope of "all alerting uses".

Sorry, I'm not quite following. Could you elaborate on it, please?

pmuellr · 2021-08-09T14:09:00Z

if it would also be useful to see requests in the scope of "all alerting uses".

Sorry, I'm not quite following. Could you elaborate on it, please?

The reason I asked about associating multiple "things" with an ES call, is that we can somehow run some aggs over the logs (assuming they are ingested into ES), looking for "all ES calls associated with alerting" as well as "all ES calls associated with this alert type" etc. Basically, have a super-general "this is from an alerting rule" but also associate the specific rule types, or if the rule type has different types of queries that it wants to do special accounting for, via aggs.

So, however we store these "multiple contexts", we'd like to be able to query on specific ones. I assume this won't be a problem, just wanted to mention it. I'd be happy even if the mappings for these aren't available (type object / enabled false), or hard to access (type nested | type flattened), as long as we can access via runtime fields.

lizozom · 2022-03-03T15:23:34Z

This issue is mostly resolved by #124996
All searches executed will now have the context set to the context provided by the application or to the app name if the application didn't provide top level context by calling useExecutionContext.

We'll use #102629 to track solutions use of useExecutionContext.

mshustov added the Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc label Jun 8, 2021

mshustov self-assigned this Jun 14, 2021

mshustov mentioned this issue Jun 15, 2021

Trace Elasticsearch query to the origin #97934

Closed

mshustov mentioned this issue Jun 18, 2021

Use APM RUM agent to pass Kibana execution context #102615

Closed

trentm mentioned this issue Jun 18, 2021

add support for explicitly disabling sending to APM server elastic/apm-agent-nodejs#2101

Closed

mshustov mentioned this issue Jun 21, 2021

Measure performance overhead of tracing solution for Kibana server #102706

Closed

jtibshirani mentioned this issue Jun 24, 2021

Include tracing headers in hot_threads output? elastic/elasticsearch#74580

Open

mshustov mentioned this issue Jun 28, 2021

[Search service] Add search IDs to requests made to Elasticsearch #16493

Closed

mshustov mentioned this issue Jul 12, 2021

Instrument vis_type_vislib, lens and vis_type_timeseries with execution context service #105206

Merged

lizozom added the performance label Nov 11, 2021

lukasolson mentioned this issue Nov 16, 2021

[data.search] Make execution context required #118760

Closed

11 tasks

droberts195 mentioned this issue Dec 6, 2021

X-Opaque-ID contains UUID causing ES deduplication to fail #120124

Closed

lizozom closed this as completed Mar 3, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement Elasticsearch query tracing to a source in Kibana #101587

Implement Elasticsearch query tracing to a source in Kibana #101587

mshustov commented Jun 8, 2021 •

edited

Loading

elasticmachine commented Jun 8, 2021

mshustov commented Jun 15, 2021

danhermann commented Jun 15, 2021

imotov commented Jun 15, 2021

felixbarny commented Jun 17, 2021

trentm commented Jun 17, 2021

felixbarny commented Jun 18, 2021

pmuellr commented Aug 5, 2021

pmuellr commented Aug 5, 2021

mshustov commented Aug 9, 2021

pmuellr commented Aug 9, 2021

lizozom commented Mar 3, 2022

Implement Elasticsearch query tracing to a source in Kibana #101587

Implement Elasticsearch query tracing to a source in Kibana #101587

Comments

mshustov commented Jun 8, 2021 • edited Loading

High-level picture

Kibana Frontend

Server-side

Instrumentation

List of sub-tasks

Context propagation

Log correlation

elasticmachine commented Jun 8, 2021

mshustov commented Jun 15, 2021

danhermann commented Jun 15, 2021

imotov commented Jun 15, 2021

felixbarny commented Jun 17, 2021

trentm commented Jun 17, 2021

felixbarny commented Jun 18, 2021

pmuellr commented Aug 5, 2021

pmuellr commented Aug 5, 2021

mshustov commented Aug 9, 2021

pmuellr commented Aug 9, 2021

lizozom commented Mar 3, 2022

mshustov commented Jun 8, 2021 •

edited

Loading