Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trace Elasticsearch query to the origin #97934

Closed
kobelb opened this issue Apr 21, 2021 · 51 comments
Closed

Trace Elasticsearch query to the origin #97934

kobelb opened this issue Apr 21, 2021 · 51 comments
Labels
enhancement New value added to drive a business result performance Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc Team:Performance

Comments

@kobelb
Copy link
Contributor

kobelb commented Apr 21, 2021

Kibana executes a lot of queries against Elasticsearch. Some of them originate from internal sources (task-manager claiming tasks, queries that are run on Kibana startup, etc.) and some of them are a result of a user's interaction with Kibana (opening a Visualization, using the Index-Management UI, etc).

Some of these queries can be slow and show up in the Elasticsearch slow log. However, when they do, it's not obvious what caused them to occur in the first place. This makes it very difficult for administrators to do anything to prevent them from occurring in the future.

It'd be incredibly helpful for us to be able to correlate a slow Elasticsearch query to the specific usage of a feature. For example, it should be possible to correlate an Elasticsearch query to a specific Visualization that's embedded in a Dashboard. This would allow the administrator to take corrective action.

Orthogonal or related issues

#77214 proposes custom request headers for the sake of usage analysis; however, there's potentially some overlap with a solution here.

#60122 proposes including the "execution context" to all Kibana log entries, but it doesn't propose forwarding these to Elasticsearch for correlation.

@kobelb kobelb changed the title Trace Elasticsearch Query to the origin Kibana feature Trace Elasticsearch query to the origin Kibana feature Apr 21, 2021
@kobelb kobelb added the Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc label Apr 21, 2021
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-core (Team:Core)

@kobelb kobelb added the enhancement New value added to drive a business result label Apr 21, 2021
@pgayvallet
Copy link
Contributor

#77214 proposes custom request headers for the sake of usage analysis; however, there's potentially some overlap with a solution here.

To reverse the question: Which additional information would you like us to send to ES here? I don't think we can automate/detect more than the calling plugin from Core. Do you have something particular in mind? Should we let callers of the ES API send their own custom 'meta' headers in addition to the x-elastic-plugin-origin planned to be added in #77214? Should they be able to specify an additional x-elastic-feature-origin or something?

@mshustov
Copy link
Contributor

mshustov commented Apr 22, 2021

#60122 proposes including the "execution context" to all Kibana log entries, but it doesn't propose forwarding these to Elasticsearch for correlation.

Do you want Kibana to forward them to the Elasticsearch server? execution context + Kibana slow logs #58086 would be enough to find a culprit plugin on Kibana side?

@kobelb
Copy link
Contributor Author

kobelb commented Apr 22, 2021

Which additional information would you like us to send to ES here? I don't think we can automate/detect more than the calling plugin from Core. Do you have something particular in mind? Should we let callers of the ES API send their own custom 'meta' headers in addition to the x-elastic-plugin-origin planned to be added in #77214? Should they be able to specify an additional x-elastic-feature-origin or something?

To be honest, I was intentionally being vague here because I don't have a solution in mind. Users are primarily looking for a way to answer the question of "What in the world is causing this expensive query to be executed?". I'm open to all solutions that fully or partially allow this question to be answered based on the feasibility and maintainability.

@VijayDoshi
Copy link

I believe we need to be more specific than the "feature" if I understand that correctly. It would be incredibly useful to pinpoint the "thing" that called Elasticsearch like "Vijay's Visualization" in dashboard "foo" vs. "Lens". @kobelb Can you clarify the level of detail you are proposing for "feature"?

The goal is for an admin looking at an ES slow/expensive query to rapidly find the culprit in Kibana so they can address the stability of their environment.

@kobelb
Copy link
Contributor Author

kobelb commented Apr 27, 2021

I believe we need to be more specific than the "feature" if I understand that correctly. It would be incredibly useful to pinpoint the "thing" that called Elasticsearch like "Vijay's Visualization" in dashboard "foo" vs. "Lens". @kobelb Can you clarify the level of detail you are proposing for "feature"?

For sure, happy to revise the original description to make this more clear.

@kobelb kobelb changed the title Trace Elasticsearch query to the origin Kibana feature Trace Elasticsearch query to the origin Apr 27, 2021
@afharo
Copy link
Member

afharo commented Apr 27, 2021

It would be incredibly useful to pinpoint the "thing" that called Elasticsearch like "Vijay's Visualization" in dashboard "foo" vs. "Lens".

I agree it would be useful. However, how does this play with PII concerns? If we send the name of the visualization/dashboard and/or even the user who performed the action could unintentionally leak private information.

What do you think? @kobelb @VijayDoshi

@pgayvallet
Copy link
Contributor

However, how does this play with PII concerns? If we send the name of the visualization/dashboard and/or even the user who performed the action could unintentionally leak private information.

We could just send the id of the SO instead (when working with SO's)

Note that tracking the feature/app that performed initiating SO requests doesn't seems like the part with the most value. I would think that it's currently harder to correlate requests performed against data indices.

@VijayDoshi
Copy link

VijayDoshi commented Apr 29, 2021

We could just send the id of the SO instead (when working with SO's)

@mikecote my understanding is that alerts are not SOs in the traditional sense, would we need special handling for alerts if we used SO ids?

@mikecote
Copy link
Contributor

We could just send the id of the SO instead (when working with SO's)

@mikecote my understanding is that alerts are not SOs in the traditional sense, would we need special handling for alerts if we used SO ids?

If we're using the new terminology; correct, alerts are not SOs but do have unique identifiers. We can however use the rule ID because rules are SOs.

@mikecote
Copy link
Contributor

@VijayDoshi ^^

@mshustov
Copy link
Contributor

mshustov commented May 3, 2021

The challenging part here is not only the async nature of js but also a necessity to trace an operation across different environments.

browser --> (proxy) --> Kibana server --> (proxy) --> elasticsearch
browser --> Kibana server --> elasticsearch
Kibana server --> elasticsearch
a stack solution (fleet, e.g.) --> Kibana server --> elasticsearch
3rd party app --> Kibana server --> elasticsearch

Ideally, there is a single mechanism covering all aforementioned use cases.

It looks like a classical Observability problem when we want to answer questions:

  • what service initiated a request?
  • what services a request go through?
  • what time we spend in every intermediate service?
  • where an error happened?

Given that we already ship the built-in APM agent in the product, how realistic does it look to use it for debugging?
It may be a separate debugging mode, not enabled by default.

It's not a simple task including several steps (I hope, I didn't miss anything):

  • spin up APM server
  • configure Kibana to enable APM agent
  • perform an operation(s)
  • identify a bottleneck
  • tear down the APM server
  • cleanup APM data in Elasticsearch

But it sounds like we could provide a debug tool that automates all these steps.
Not only it allows us not to re-invent the wheel but also dogfood the company solution.

opening a Visualization, using the Index-Management UI,

Almost all the requests to the Kibana server go through data plugin. That means there is a single point to the instrument the code on the client-side. For example, it can require a unique identifier for every operation, which can be used to correlate async operations in the logs.

task-manager claiming tasks, queries that are run on Kibana startup

The number of sources is more diverse on the server-side: SO client, ES client, data plugin, task manager. With multiple abstraction layers in between, the passing context through several layers might be tedious. That's why I'm thinking about reusing the APM agent infrastructure or at least borrow implementation ideas from it.

However, how does this play with PII concerns? If we send the name of the visualization/dashboard and/or even the user who performed the action could unintentionally leak private information.

Not sure all that information is needed. We are interested in an operation source (for example, type: dashboard) and its unique identifier mostly. The rest of the data can be obtained from these.

@kobelb
Copy link
Contributor Author

kobelb commented May 4, 2021

Given that we already ship the built-in APM agent in the product, how realistic does it look to use it for debugging?

That's a great question and one that'll require additional effort to get an answer to. I'm on board with investigating the feasibility of using APM for these types of situations, as it's the ideal solution for distributed tracing. @joshdover and I just recently discussed this, and there's a cross-team effort that we're trying to get going to answer this question.

But it sounds like we could provide a debug tool that automates all these steps.

Long-term, absolutely! For the time being, I think we should figure out whether we're comfortably recommending that users enable APM integration in Kibana and Elasticsearch to a dedicated cluster to diagnose these difficult isssues.

@graphaelli
Copy link
Member

whether we're comfortably recommending that users enable APM integration in Kibana and Elasticsearch to a dedicated cluster to diagnose these difficult isssues

We're certainly comfortable enabling our agents in any rum+nodejs application. That said, the results are probably not well suited for user consumption as currently instrumented - the results are most useful to the developers of the application, mostly us in the case of Kibana.

I'd be happy to help coordinate pulling the right folks in from APM depending your goals. @dgieselaar in particular has put a lot of effort into improving both kibana instrumentation and node agent capabilities and can likely provide expertise across topics.

@joshdover
Copy link
Contributor

That said, the results are probably not well suited for user consumption as currently instrumented - the results are most useful to the developers of the application, mostly us in the case of Kibana.

I think we need to be adding human-readable transactions and spans for key user interactions. For example, visualization renders, dashboard renders, async search, discover search, alert executions, action executions etc. With these in place, we should be able to give users much more detailed information about what user-created data may be causing problems in their cluster.

Some folks from the Kibana Core team and APM team are meeting today to discuss what's possible and craft a plan of next steps for leveraging APM more widely. I'll report back here with our results.

@joshdover
Copy link
Contributor

While APM doesn't yet provide a curated UI for doing this type of analysis, the raw data is indeed present and with a few manual steps in Discover we can easily trace a slow Elasticsearch query to a specific HTTP route or Task Manager task in Kibana today, without any changes to Elasticsearch. For many use cases, this should solve the problem.

APM doesn't have a UI that allows you to filter for specific span types, however it's possible to drop into Discover and find Elasticsearch queries originating from Kibana that are slower than 60s with a filter like: span.subtype: "elasticsearch" and span.duration.us >= 60000000. You can then pull the trace.id field from a query you're interested in and filter by that trace ID in the APM UI to find the specific Kibana transaction (eg. HTTP route, Task Manager task) this query originated from, along with any other traces of the same transaction.

Because a large portion of our HTTP routes are plugin-specific, this should allow us to easily determine which plugin is triggering ES slow queries in many cases.

For Kibana HTTP routes that are used by many different UIs, we'll need to do some additional work in adding instrumentation to the client-side in order to be able to tease out which UI triggered the underlying slow query. I think we can solve this with a combination of Core-level changes (eg. adding the plugin ID information to the HTTP client exposed to plugins on core.http.fetch) and plugin-specific changes to add more detailed information about the calling UI (eg. the data plugin's search utilities).

All this said, this is dependent on the customer being able to turn on the APM instrumentation. There's plenty of things we could do to make this easier for customers (eg. add a --send-support-diagnostics CLI flag that enables the APM instrumentation and sends data to a central cluster that we control). Other options include writing the APM output data to disk and having the customer send us the logs.

Is there any reason to believe that we'd still need an alternative non-APM solution for this? If not, I think focusing our efforts on improving the existing APM instrumentation would give us the best ROI on our time.

@mshustov
Copy link
Contributor

Yesterday we had a meeting with the Elasticsearch team about improving request traceability across the Stack.
In short-team, we decided to start leveraging existing APM infrastructure for code instrumentation, request correlation, and traceability UI. It allows Kibana users to debug performance bottle-neck in the user data flow already in the next release.
However, it comes with the cost:

  • instrumentation overhead, which negatively impacts Kibana performance
  • an additional APM server to process data is required
  • an additional APM cluster to store data is required
  • Kibana server should be re-configured to enable or disable the APM agent

In the long term, we consider building a custom solution to overcome the problem listed above.
This solution would not be a substitute for all the existing APM functionality, but rather it will complement it with a focus on a specific use-case to correlate a slow Elasticsarch query to a source in Kibana (visualization, SIEM detection engine, alert, etc.) without the necessity to spin up additional servers and with a smaller runtime overhead.

@kobelb
Copy link
Contributor Author

kobelb commented May 28, 2021

During a standup today, @cyrille-leclerc mentioned the W3C distributed trace context: Baggage as a potential solution for communicating the plugin-specific information from the client-side in the distributed traces. I was told that @dgieselaar knows the details of this. Have we looked into taking advantage of this? If it'd solve the APM solution, could we use it for the long-term solution where this context is included in the appropriate Elasticsearch logs?

@cyrille-leclerc
Copy link
Contributor

Thanks @kobelb . @jasonrhodes and I will be happy to help.
I'm not clear on the comments regarding the overhead of instrumentation. Do we have this cost as well if tracing is disabled? I think we could reuse as many distributed tracing standards as possible (W3C Trace Context, Propagation format for distributed trace context: Baggage, a standard/durable tracing API) and enable traces just when needed if the performance impact is too high. This should propagate all the context even when tracing is disabled.

@joshdover
Copy link
Contributor

joshdover commented May 31, 2021

Have we looked into taking advantage of this? If it'd solve the APM solution, could we use it for the long-term solution where this context is included in the appropriate Elasticsearch logs?

The primary challenge we have right now is not the transmission of the context from the client to the server, but the actual passing of the context data itself through all the application layers to the client-side code that initiates requests to the server.

This is a challenge in JavaScript due to it's async, event-driven nature which prevents us from easily being able to determine which user action triggered which Elasticsearch queries without manual code changes. On the server, Node.js's async_hooks API allows us to associate data with asynchronous "stack frames" in order to connect the initiating context with any asynchronous service calls. The Node.js APM agent uses this API already to provide context correlation.

On the browser, no such API exists so we'll need to manually instrument each call path to ensure context is traced through each layer in our JavaScript services on the client. There is some degree of out-of-the-box global context tracing we can achieve, but for heavily used common services like the data plugin (which backs many Kibana applications), doing this in an automatic way is not currently possible.

We can definitely leverage the Baggage proposal for transmission of this context, however I don't believe the Baggage proposal will help us with actually delivering this context from the UI layers the JS service layers where we make the requests to Kibana's server side. The OpenTelemetry (aka OpenTracing) protocol that APM already supports does use the Baggage proposal, but at this time APM uses it's own format for sending key-value pairs.

I do think using Baggage for sending key-value pair context information for direct use by Elasticsearch's slow log would be a better option over overloading the x-opaque-id header. However, I think it'd be acceptable to use x-opaque-id until the Elasticsearch team can schedule implementing support for the propagating the Baggage header to the slow log.

I'm not clear on the comments regarding the overhead of instrumentation. Do we have this cost as well if tracing is disabled?

This initial conversation about instrumentation overhead was in reference the overhead of using APM as a storage output for this context data. Now that we know we can leverage Elasticsearch's slow log directly without necessarily needing to use APM, we need to investigate whether or not we can implement this tracing without the same overhead as the APM agent.

It's likely that we'll want to use that async_hooks API on the server-side for any non-APM tracing. Using this API allows us to avoid a lot of manual instrumentation for context passing, however it does incur some overhead. @mshustov did just find that this API saw a considerable performance increase in Node.js v16.2 (nodejs/node#36394). We plan to upgrade to v16 as soon as it is in the LTS phase, which is targeted for release in October.

In summary:

  1. There's a considerable amount of effort needed to trace an Elasticsearch query initiated by Kibana back to its source in the UI.
  2. Either an APM-based or slow log-based solution will require that we implement this tracing on the client-side. An APM-based solution gives us one benefit of not needing to implement as much context tracing fundamentals on the server-side.
  3. The performance overhead of adding this context tracing using our own async_hooks implementation is unknown at this time. It's likely not zero, but it may be low enough to be acceptable to enable this tracing by default. The alternative is more manual tracing work within Kibana.
  4. In the short-term, overloading the x-opaque-id header on requests to Elasticsearch seems like a fine approach until we can get support for the Baggage header from the Elasticsearch team, which would serve as a better long-term solution to make consumption of this data much more practical + searchable.

@kobelb
Copy link
Contributor Author

kobelb commented Jun 1, 2021

Thanks for the great explanation, @joshdover. I appreciate you and your team putting such careful thought into this.

@jasonrhodes
Copy link
Member

jasonrhodes commented Jun 1, 2021

@joshdover do you have everything you need from the observability side on this? I know @graphaelli has been involved, but feel free to bring myself and/or @Mpdreamz (APM tech lead) in at any point if we can help support this effort. Thanks for the update!

@mshustov
Copy link
Contributor

mshustov commented Jun 2, 2021

Problem statement

Elastic Stack has multiple components, while Kibana has got multiple layers of abstraction.
In case of slowness, it might be hard to understand whether the problem originated from Kibana itself (excessive number of requests to Elasticsearch or another service, complex computations in Kibana applications, etc) or Elasticsearch server (a complex query, a large time range, misconfigured indices, etc).

When a Kibana user wants to understand why it takes a minute to render a dashboard or why requests to Elasticsearch queries stuck in the request queue, they need to answer the following questions:

  • What service initiated a request?
    It might be triggered by user interaction in UI, a task scheduled by Kibana itself, a user hitting a Kibana endpoint, a Stack solution hitting a Kibana endpoint, etc.
    A Kibana user should be able to track an operation to a source triggering it.

  • What services did the request go through?
    Often Kibana plugins do not query the Elasticsearch server directly but use other plugins abstract access to data storage. Without tracing the operation through the services it might be impossible to understand which plugin issues a request or defines query parameters causing performance problems.

  • What time did we spend in every intermediate service?
    A single request retrieving data from Elasticsearch might cause several requests performed by Kibana for internal needs (checking the permission level for a given user, storing data in the access log, etc.). These requests can have an impact both on perceived performance for the end-user and the Elasticsearch server handling a large number of incoming requests.

  • Where did an error happen?
    An exception might be raised at any level deep in the Kibana code, it should be possible to unwind the error stack trace to the source of the operation and collect parameters causing the problem.

In the current issue, we focus on the use case covering 80% of slowness problems in the Stack: detecting a slow Elasticsarch query and tracing it back to a Kibana plugin triggering the query.
Thereby, we are trying to answer the single question: What service initiated a request?

In the long term, Kibana should provide tooling providing insights about Stack performance, Stack component interactions with the ability to drill to every component in the system. This functionality is out of the scope of the current issue.

Functional requirements

A Kibana user should be able:

  • to detect Elasticsearch slow queries.
  • to trace the slow query to a user-facing feature in Kibana
    • Visualization (vis type + id)
    • Task (type + id)
    • Request handler (endpoint path)
  • to detect what domain-specific parameters causing the performance problem. Optional if it can be derived from a query.
  • to distinguish between several sources initiating an operation in parallel

Implementation options

APM agent

Kibana is already shipped with a built-in Elastic APM agent. The agent instruments Kibana code to collect runtime stats. The runtime stats are sent to an APM server to be processed and stored in the APM cluster within Elasticsearch.

Use case
What service initiated a request? No, requires manual instrumentation to correlate a transaction with a user actions in UI or plugin triggering a request
What services did the request go through?
  • - doesn’t support async tracing in the browser
  • - doesn’t correlate browser requests to the kibana server with transactions in UI
  • + supports async tracing on Kibana server
  • + correlates Kibana server requests to Elasticsearch with server transactions.
What time did we spend in every intermediate service? Requires manual instrumentation to mark plugin-specific operations in an async stack trace.
Where did an error happen? Requires manual instrumentation to add plugin-specific context to the async stack trace.

Cons:

  • Requires the APM server running
  • Requires deployment of APM clusters in Elasticsearch
  • Management overhead for APM data stored in Elasticsearch
  • Doesn’t support async tracing in the browser
  • Doesn’t integrate with Elasticserch slow query logging
  • Doesn’t support alternative output destinations (Kibana logs, Kibana slow logs, telemetry, etc.)

Custom solution

Right now, the APM agent doesn’t support integration with Elasticserch slow query logging, async tracing in the browser. These are critical requirements for the current issue and most likely, it will not be possible to fix these drawbacks quickly That’s why it may make sense to start with a custom solution and consider switching to the APM agent later.
Below I will use execution context term with the meaning of a container for runtime execution meta data to emphasize it's not the same as an APM transaction.

Use case
What service initiated a request? Supports creating execution context manually in server- and client-side environments, carry over execution context information between Kibana parts and Stack components.
What services did the request go through? not supported
What time did we spend in every intermediate service? not supported
Where did an error happen? not supported

Cons:

  • the necessity to support another custom solution
  • duplicates some of the existing APM functionality on the server
  • requires additional code instrumentation. esp. even for places already instrumented with APM agent (Core, task_manager)

Client-side

  • A plugin creates an execution context with api provided by Core. Returned value is opaque to the plugin.
const executionContext = createExecutionContext({
  name: plugin name,
  type: plugin element name,
  url?: /path/to/element/in/ui
  description?: human readable description
})
  • Obtained execution context should be passed to the Kibana server. Right now we have two options:
    • Use an equivalent of async_hook API in the browser environment (for example, zone.js) to store and retrieve execution context before transmitting a request to the Kibana server over the wire. To investigate: how much overhead the package adds and what network libs should be instrumented (kibana.http, XMLHttpRequest, axios, anything else?)
    • Pass execution context manually through all the layers of abstractions in Kibana and set it as a request header before issuing a request to the Kibana server. To investigate: how much refactoring it would require to pass the execution context meta information, maybe it’s not that hard considering expressions plugin passes request parameters via ExecutionContext for every request.
  • execution context is serialized and added as request's Baggage header every time Kibana issues a request to the Kibana server.

Server-side

  • Kibana Core intercepts all the incoming requests and retrieves execution context from Baggage header. The value is stored in the Asynchronous Local Storage until the server responds to the request. As a bonus, Core could create an APM span populated with data retrieved from execution context
  • Kibana Core retrieves the execution context from Asynchronous Local Storage whenever a plugin issues an Elasticsearch request and attaches the value to x-opaque-id header. execution context can be emitted to Kibana elasticsearch service logs for debugging purposes to avoid the necessity to enable slow logs in the Elasticsearch server.
  • A Kibana Plugin can use createExecutionContext on the server-side as well, it can be used by server-side plugin initiating an asynchronous operation (such as task_manager). In this case, Core stores the created execution context value in Asynchronous Local Storage and attaches it to Elasticsearch request as described above.

Open questions:

  • Should Kibana enable tracing by default, or should it be configurable?
  • Where should Kibana write the tracking information?
    • Logging
    • Elasticsearch storage
    • Request header (x-opaque-id, baggage)
  • Should Kibana provide UI for the tracing functionality?
  • What minimal set of data allows tracing the request back to a source? See an example of ES slow query log record?
  • What plugins should be reported as sources?
  • Should Kibana support creating nested execution contexts?
  • Who is responsible for instrumenting the Kibana code?

@kobelb @joshdover and others. I'd appreciate your input on the implementation options and the Open questions section.

My thoughts:

Should Kibana enable tracing by default, or should it be configurable?

Depends on the final implementation overhead. Maybe it should be enabled by default to be accessible by Kibana or Elastcisearch. Elasticsearch slow logging is configured via logging level

Where should Kibana write the tracking information?

Required: x-opaque-id or another header when requesting Elasticserch service.
Optional: Kibana logs

Should Kibana provide UI for the tracing functionality?

not at the current stage

What minimal set of data allows tracing the request back to a source?

  • a Kibana component name. should be part of public API.
  • A link to component (link to a dashboard, visualization, alert) to inspect it in UI.
  • Human-readable description of component (a dashboard title, alert name)
  • What plugins should be reported as sources

in order of priority:

  • Visualizations
  • Lens
  • TSVB
  • Line charts
  • Discover
  • Tasks
    • Actions
    • Alerts
    • Reporting
  • Request handler
  • Observability / APM / Security solutions
  • Should Kibana support creating nested execution contexts?

If we encounter such a use case.

  • Who is responsible for instrumenting the Kibana code?

To speed up the process, it makes sense to have Core team do it, in advance consulting with the owners of the code.


@jasonrhodes @graphaelli @Mpdreamz does it make sense for you if the Core team implements a custom solution instead of relying on APM for this specific use case of tracing an Elasticsearch query to a Kibana source? What do we lose by giving up the APM solution?

@Mpdreamz
Copy link
Member

Mpdreamz commented Jun 2, 2021

Short reply will go in depth deeper tomorrow but I feel strongly we should be eachother's worst/best customer.

If we built something custom now it'll be a missed opportunity to invest in a deep stack integration.

For instance @SylvainJuge explored first class support for APM within Elasticsearch at several levels. Even if Elasticsearch does not ship with APM anytime soon we could look at it supporting trace headers and persisting these to Elasticsearch logs in addition to x-opaque-id.

This way even if APM is not installed kibana can still manually correlate traces rather then building a custom scheme for this.

@trentm
Copy link
Member

trentm commented Jun 4, 2021

Let me back up a bit to make sure I understand some goals/decisions.

A large part of this effort is about getting some tracing info ("where did this slow ES query originate?") without relying on having an APM server setup. Is this based on the assumption/understanding/agreement that in many scenarios it isn't practical or likely that the owner of that stack deployment will setup to use the APM system? And that is still true for something like --send-support-diagnostics that @joshdover mentioned?

add a --send-support-diagnostics CLI flag that enables the APM instrumentation and sends data to a central cluster that we control

I am not challenging, just trying to understand.


Add support for explicitly disabling sending to APM server, which will save on time spent serializing those events.

I don't know the apm agent internals, but we might need to refactor buffer logic to FIFO as well. From the maxQueueSize description: If the queue is full, events are rejected which means you will lose transactions and spans.

No. At least the way I think we'd implement "hey APM agent, do your thing, but there is no APM server for now", there would be no buffering of events at all.


2\. As soon as the APM nodejs client is able to work without having to run the APM server, we can switch from the custom solution for the Kibana server to the solution based on `APM nodejs agent`. ...

To be clear, the current Node.js APM agent will currently work without an APM server, it will just log.error a bit and have some wasted overhead.

Are you thinking of using baggage (both the header, and generic Baggage-related APM agent APIs) similar to what I described above to transfer context info between client and server? If so, that would require that the client-side (Kibana browser app) be sending context info in the baggage header and Baggage format. I'm not sure if that would require some support added to the RUM APM agent. (There is otel baggage util TypeScript code here, if that would help.)

There are two separate Node.js APM agent pieces of work here:

  • Make the agent have a nicer run mode where it doesn't expect to have an APM server.
  • Add baggage support.

Given @trentm's notes, this seems pretty close already. Trent, is it possible to get a rough idea of how implementing this could fit into your roadmap:

  • Add support for explicitly disabling sending to APM server, which will save on time spent serializing those events.
  • Look into a Async Local Storage-based context manager for the APM agent. It should eventually grow one.

I will bring it up with my team so we can discuss prioritization and get back here. Running without APM server and baggage support I expect will be fine. Adding Async Local Storage-based context management is probably unlikely for the short term.


given that the RUM agent can commit to providing this feature at some point

@joshdover, did you mean "the RUM agent cannot commit"?

@mshustov
Copy link
Contributor

mshustov commented Jun 7, 2021

add a --send-support-diagnostics CLI flag that enables the APM instrumentation and sends data to a central cluster that we control

Yes, Elasticsearch already provides slow query tracing. From the UX point of view, it makes sense to focus on end-to-end integration for the entire stack because the proposed alternative has some drawbacks:

  • users have to enable debug mode explicitly - it's hard to sport and investigate sporadic problems
  • users are limited to use APM UI and cannot use other alternative communication channels: logging, REST API
  • someone has to spin up and maintain that APM infrastructure. Even if we provide centralized storage, some users don't want to provide their data for analysis or cannot do that from an internal corporate network.

To be clear, the current Node.js APM agent will currently work without an APM server, it will just log.error a bit and have some wasted overhead.

As I understand, the APM client hasn't been created with this use case in mind. I would wait until this mode is officially supported. Even if if just log.error a bit and have some wasted overhead., it adds unnecessary noise in logging output, which might increase log storage size and hurts the usefulness of the logs.

Are you thinking of using baggage (both the header, and generic Baggage-related APM agent APIs) similar to what I described above to transfer context info between client and server?

Probably, the mechanism of data transfer between the Kibana browser app and the server app is an internal implementation detail. I don't think that Add baggage support. is a mandatory requirement for the first stage. We can switch to it later, once the APM nodejs agent supports it.

@mshustov
Copy link
Contributor

mshustov commented Jun 7, 2021

On the Kibana side, I'd rather we opt towards the manual context passing approach until we have something more sophisticated available from the RUM agent,

I started looking at zone.js, even though it's leveraged by serious projects like opentelemetry there are a few problems that might be considered as critical for this option:

  • it's reported to add a significant runtime overhead. Kibana doesn't have the tooling to measure how bad it affects the runtime performance in the browser.
  • it doesn't support native async/await syntax. Of course, we can work it around with downgrading the compilation target, but the problem doesn't seem to be fixable until zone.js-like functionality is provided natively.
  • it seriously affects the debuggability of the code.

@joshdover
Copy link
Contributor

@joshdover, did you mean "the RUM agent cannot commit"?

No, I meant that we should probably avoid building our own async context tracing solution in the browser if the RUM agent can commit to providing one for us in the future. That way we don't duplicate any efforts.

However, given the current state of the art though, it doesn't appear that there may be any viable solution to this yet. I'd love to find out more about why the TC39 proposal for Zones was withdrawn and if there's a replacement proposal in the works.

@trentm
Copy link
Member

trentm commented Jun 8, 2021

Regarding changes for the Node.js agent:

  1. "Add support for explicitly disabling sending to APM server, which will save on time spent serializing those events." This sounds reasonable. I'll get a ticket created for this today (or @mshustov feel free to create one). I'll see about getting that on the 7.14 (or is 7.15 the timeframe you are working on here?) milestone. -> add support for explicitly disabling sending to APM server apm-agent-nodejs#2101 (targeting 7.15)
  2. "Look into a Async Local Storage-based context manager for the APM agent. It should eventually grow one." I don't believe this is something we'll get to in the short term, so I don't have a strong opinion whether or not we create an issue for it now.
  3. "Add baggage support." I can create an issue for this as well. I'll then create a draft PR for my PoC from earlier. Given this is optional and that there is likely a straightforward workaround (Kibana server-side code could use an apm agent "label" and manually add the x-opaque-id header itself), this will probably get a lower priority. -> add W3C baggage support apm-agent-nodejs#2102

@felixbarny
Copy link
Member

  • Requires the APM server running
  • Requires deployment of APM clusters in Elasticsearch
  • Management overhead for APM data stored in Elasticsearch
  • Doesn’t integrate with Elasticserch slow query logging
  • Doesn’t support alternative output destinations (Kibana logs, Kibana slow logs, telemetry, etc.)

These should be addressed with elastic/apm-agent-nodejs#2101 and by adding traceparent-based log correlation to both Kibana and Elasticsearch.

Using log correlation, it will be possible to correlate the logs form Kibana and Elasticsearch to a particular request.
The way I see it, this is more powerful than propagating proprietary/custom context to the baggage header.
As mentioned by others already, I'd also like to see ES to move away from X-Opaque-Id in favor of adding the trace.id from the traceparent header to the logs.

If and when APM is enabled (I'm advocating for turning it on by default on cloud but that's a different discussion), we wouldn't even necessarily need the slow log but could query the APM index for slow ES queries. It would then be possible to view the full request from frontend to backend so that we exactly know where a slow query originates from, rather than just which component.

  • Doesn’t support async tracing in the browser

Seems like this needs manual context propagation in the short to mid-term. But instead of doing something completely custom for Kibana, I'd suggest using the RUM agent's API to do the context propagation. I believe those APIs don't exist today but I think it's worth investing in adding them rather than doing something outside of APM.

@mshustov
Copy link
Contributor

These should be addressed with elastic/apm-agent-nodejs#2101 and by adding traceparent-based log correlation to both Kibana and Elasticsearch.

@felixbarny we are going to switch to APM solution as soon elastic/apm-agent-nodejs#2101 is done. As I can see, the APM agent team is planning to work on it during v7.15. In the meantime, Kibana can start with a custom solution not being blocked by APM agent improvements. I believe the APM-based solution will be able to reuse most of the logic introduced during #101587, namely: Core code instrumentation, Kibana plugin code instrumentation, integration tests.

I believe those APIs don't exist today but I think it's worth investing in adding them rather than doing something outside of APM.

Agree, until it's provided by the APM RUM agent, we will have to use a custom solution. Is there an issue to track the progress of the enhancement?

I'd also like to see ES to move away from X-Opaque-Id in favor of adding the trace.id from the traceparent header to the logs.

It's planned to be done at the later stage as well #101708
Let me know if are concerned about the order of improvements or you can see a better way to organize this work.

@felixbarny
Copy link
Member

As I can see, the APM agent team is planning to work on it during v7.15. In the meantime, Kibana can start with a custom solution not being blocked by APM agent improvements.

What's your timeline to add that to Kibana? If we'd add that in the early 7.14 or early 7.15 time frame, can we avoid the interim solution?

Let me know if are concerned about the order of improvements or you can see a better way to organize this work.

I fear that working on a custom solution now will make it much harder to migrate to an APM-based solution in the future.

Agree, until it's provided by the APM RUM agent, we will have to use a custom solution. Is there an issue to track the progress of the enhancement?

I don't think it's completely clear what functionality you'd need from the RUM agent. Could you open a ticket on https://github.com/elastic/apm-agent-rum-js or another Kibana issue to elaborate on the things you need?

@AlexanderWert
Copy link
Member

AlexanderWert commented Jun 16, 2021

@mshustov we reprioritized elastic/apm-agent-nodejs#2101 and will address this next (after current in-progress stuff on Node.js Agent), so you can hopefully move forward with that very soon.

@mshustov
Copy link
Contributor

We've spent several days talking with @felixbarny about implementing this functionality and improving the observability for the whole Stack at the same time. I updated the #101587 description to reflect the current plan. Feel free to chime in.

@mshustov
Copy link
Contributor

mshustov commented Jul 8, 2021

The Elasticsearch performance team is interested in analyzing the timing of requests issued by Kibana to make sure any changes in the Elasticsearch server do not lead to regression in Kibana UI performance.
Their current approach with relying on APM RUM agent capturing queries after a page load does not work well, since Kibana data management might perform a search request with a delay after the page load event, so APM agent cannot associate a request with page load transaction. Also data plugin sends search requests in a batch, which might break the APM RUM agent instrumentation of the outbound requests.

Yesterday we had a meeting with @gingerwizard, @danielmitterdorfer, @vigneshshanmugam, @wylieconlon, @ppisljar about improving the Kibana client-side instrumentation with the APM RUM agent.
@gingerwizard is working on collecting requirements for performance metrics the Elasticsearch performance team wants to get from the Kibana side.
Depending on the requirements we will start an investigation for the Kibana side instrumentation.

Some notes from the meeting:
If we are interested in tracing of search queries only, the AppServices team might instrument the search session mechanism to create a parent transaction for all the spans wrapping search requests issued while a search session is active. The result won't be very precise but might be good enough to reflect the influence of ES performance on Kibana UI.

A more detailed tracing mechanism might require Kibana plugins to instrument their data management code manually.

@arisonl
Copy link
Contributor

arisonl commented Nov 3, 2022

@lukeelmers @rudolf @kobelb

  • what is the current state of conversation on this?
  • on the custom vs APM debate: would a shorter-term solution based on the traceparent standard be usable in the longer term "right solution" (distributed tracing)? My high level understanding is that this is possible, in which case, awesome!

Context: Recent important escalation:

@lukeelmers
Copy link
Member

lukeelmers commented Nov 3, 2022

@arisonl I think it depends on what specific question we are trying to answer.

If the question is "Today can we trace slow Elasticsearch queries to their origin in Kibana?", the answer is yes, in 8.1+, you can trace queries back to Dashboards/Visualizations/Lens/Discover/Alerts/Maps using Kibana's execution_context feature (enabled by default), which passes this info along in the x-opaque-id header. (Docs)

If the question is "Today can we trace slow Elasticsearch queries back to a specific user?", the answer is yes, with some caveats (You need to be using audit logs, and can't be using server.requestId.allowFromAnyIp or server.requestId.ipAllowlist settings in Kibana. Long story, but I have a writeup on this I can point you to if it helps.)

If the question is "Today can we trace slow Elasticsearch queries using traceparent?", the answer is no, we're still using x-opaque-id with the execution_context feature. To support this, APM would need to be enabled on a Kibana deployment, and we'd have to do some work on our end to make sure everything still works as expected and audit log correlation doesn't break.

(@rudolf @kobelb keep me honest if I've missed anything on this) 😉

@felixbarny
Copy link
Member

What are the main blockers that prevent you to use APM for these use cases? We've made several improvements in the Node.js agent to improve its performance and to make it possible to use it only for context propagation purposes, without it reaching out to the server. In the meantime, Elasticsearch has also adopted APM so that we can have full distributed tracing across Kibana and Elasticsearch, which should IMO be superior to manual correlation of a slowlog file.
We also do have the vision of enabling APM on cloud by default. If we don't all agree on that we should have another discussion to align our vision.

The APM Team has a big interest in making this a success and we're here to support you.

@arisonl
Copy link
Contributor

arisonl commented Nov 7, 2022

@felixbarny thank you, this sounds great. Some problems related to the APM approach were summarized earlier in this thread but this is over a year old now. And APM distributed tracing does sound like the right end state. We have many requests asking to provide a way for ES queries to be traceable to user actions in Kibana or system initiated tasks (such as alerts) and so I think that we should pick this conversation up and review where we stand and what the most reasonable next steps would be. In addition, we also have use cases that involve being able to trace back in cross cluster search scenarios.

@felixbarny
Copy link
Member

we should pick this conversation up and review where we stand and what the most reasonable next steps would be

I agree. It's hard to keep track of this long thread. Maybe we should close this and create a new one which describes the current state, the remaining issues, and the next steps. Who can own doing that?

we also have use cases that involve being able to trace back in cross cluster search scenarios.

Generally, CCS should just work™️ with APM.

@lukeelmers
Copy link
Member

At this point I don't think we are blocked on anything here. It mostly comes down to:

  • Needing to resolve this issue on the core side to ensure the switch to traceparent doesn't break audit logs
  • Making sure we have product alignment on the requirement of having APM enabled
    • the existing execution_context service is enabled by default, so we would need to ask customers to set up APM if they want to do this type of query tracing moving forward (I'm not really too worried about that, but @arisonl maybe you have thoughts on it)

@rudolf
Copy link
Contributor

rudolf commented Mar 15, 2023

I think we have already achieved the outcome in the issue description. As per https://www.elastic.co/guide/en/kibana/master/kibana-troubleshooting-trace-query.html we can attribute an Elasticsearch slowlog query to the dashboard/visualisation that caused it. Not all apps capture the execution context but there is a generic solution being tracked in #132629

While APM can provide even richer data I'm not sure this is the right tool for most users use case:

  1. APM does not sample every request so unlike the slowlogs it might not always have the context of that one query that brought the cluster to its knees
  2. APM spans are really technical and include a lot of detail that would overwhelm the average user. Users typically just want to know which dashboard and user caused the problem, not every single span associated with this trace id.

There is one piece of critical information that's currently missing which is who created the query, the Elasticsearch team is investigating exposing the user id's from the slowlog. So I'm going ahead and close this issue, feel free to re-open if there's shortcomings in the current approach that I missed that we should address.

Having said that, there might be benefit in Kibana switching to APM's execution context instead of our own implementation.
@felixbarny #101711 lists the APM blockers that prevents us from doing that.

@rudolf rudolf closed this as completed Mar 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New value added to drive a business result performance Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc Team:Performance
Projects
None yet
Development

No branches or pull requests