-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Instrument Elasticsearch with APM #84369
Comments
Pinging @elastic/es-delivery (Team:Delivery) |
Pinging @elastic/es-core-infra (Team:Core/Infra) |
Part of elastic#84369. Split out from elastic#87696. Rework how some work is executed by creating child tasks for them, so that when traced by APM, it results in more meaningful parent and child tasks in the UI. It also improves how Elasticsearch is modelling the work.
Do you have a rough estimate what would be the initial release version of this? |
Split out from elastic#88443. Part of elastic#84369. Use the tracing API that was added in elastic#87921 in TaskManager. This won't actually do anything until we provide a tracer with an actual implemenation.
Part of elastic#84369. Split out from elastic#88443. This PR wraps parts of the code either in a new tracing context. This is necessary so that a tracing implementation can use the thread context to propagate tracing headers, but without the code attempting to set the same key twice in the thread context, which is illegal. Note that in some places we actually clear the tracing context completely. This is done where the operation to be performed should have no association with the current trace context. For example, when creating a new index via a REST request, the resulting background tasks for the index should not be associated with the REST request in perpetuity.
Part of #84369. Split out from #88443. This PR wraps parts logic in `InternalExecutePolicyAction` in a new tracing context. This is necessary so that a tracing implementation can use the thread context to propagate tracing headers, but without the code attempting to set the same key twice in the thread context, which is illegal.
Part of #84369. Split out from #88443. This PR wraps parts logic in `AsyncTaskManagementService` in a new tracing context. This is necessary so that a tracing implementation can use the thread context to propagate tracing headers, but without the code attempting to set the same key twice in the thread context, which is illegal.
Part of #84369. Split out from #88443. This PR wraps parts logic in `TransportSubmitAsyncSearchAction` in a new tracing context. This is necessary so that a tracing implementation can use the thread context to propagate tracing headers, but without the code attempting to set the same key twice in the thread context, which is illegal.
Part of #84369. ML uses the task framework to register a tasks for each loaded model. These tasks are not executed in the usual sense, and it does not make sense to trace them using APM. Therefore, make it possible to register a task without also starting tracing.
Part of #84369. Split out from #88443. This PR wraps parts of the code in a new tracing context. This is necessary so that a tracing implementation can use the thread context to propagate tracing headers, but without the code attempting to set the same key twice in the thread context, which is illegal. In order to avoid future diff noise, the wrapped code has mostly been refactored into methods. Note that in some places we actually clear the tracing context completely. This is done where the operation to be performed should have no association with the current trace context. For example, when creating a new index via a REST request, the resulting background tasks for the index should not be associated with the REST request in perpetuity.
Part of #84369. Implement the `Tracer` interface by providing a module that uses OpenTelemetry, along with Elastic's APM agent for Java. See the file `TRACING.md` for background on the changes and the reasoning for some of the implementation decisions. The configuration mechanism is the most fiddly part of this PR. The Security Manager permissions required by the APM Java agent make it prohibitive to start an agent from within Elasticsearch programmatically, so it must be configured when the ES JVM starts. That means that the startup CLI needs to assemble the required JVM options. To complicate matters further, the APM agent needs a secret token in order to ship traces to the APM server. We can't use Java system properties to configure this, since otherwise the secret will be readable to all code in Elasticsearch. It therefore has to be configured in a dedicated config file. This in itself is awkward, since we don't want to leave secrets in config files. Therefore, we pull the APM secret token from the keystore, write it to a config file, then delete the config file after ES starts. There's a further issue with the config file. Any options we set in the APM agent config file cannot later be reconfigured via system properties, so we need to make sure that only "static" configuration goes into the config file. I generated most of the files under `qa/apm` using an APM test utility (I can't remember which one now, unfortunately). The goal is to setup up a complete system so that traces can be captured in APM server, and the results in Elasticsearch inspected.
@lizozom It should be in 8.5.0. |
I think we've done everything that we intended to be covered by this issue 🎉 |
@pugnascotia is this ever going to be released publicly or will customers ever be able to take advantage of tracing/APM metrics for monitoring our Elasticsearch clusters? |
It depends what you mean by "publicly" - on-prem customers could use this today. Cloud customers cannot, since APM data collection is not multi-tenant. If a Cloud customer has an issue that required APM data to resolve, they'd have to engage with Support. This would likely be necessary in any case since the APM data is a very low-level tool for investigating Elasticsearch issues. |
We are an enterprise customer with several on-prem installations (aws/govcloud) and have a use for this. |
In that case you can definitely configure APM yourselves. We don't have user-facing documentation yet, but you can consult TRACING.md for how to get started. |
Hi @pugnascotia is there similar feature available for kibana? I run elasticsearch on our own and i want to understand if i wanted to pass on additional resource attributes like data_stream.dataset, data_stream.namespace etc is this possible? so that i could send these traces to separate datastream in elasticsearch |
@ramdaspotale , yes you have Just keep in mind that changing the data_stream and namespace can have negative consequences if you do not do it correctly (index templates, component templates). Use the reroute processor (https://www.elastic.co/guide/en/elasticsearch/reference/current/reroute-processor.html). |
Hi @philippkahr - i am using this elastic/apm-data#201 feature from apm server to segregate traces coming from different applications in our environment so that search against each application can be done separately without overloading ES. using traces@custom ingest pipeline with reroute processor would add some load on ES given how busy our prod elasticsearch is. and as this feature is already there in APM 8.13 i was curious if i could use it in this scenario as well. add |
Thanks very much for your interest in Elasticsearch @ramdaspotale. This appears to be a user question, and we'd like to direct these kinds of things to the Elasticsearch forum. If you can move this conversation there, we'd appreciate it. This allows us to use GitHub for verified bug reports, feature requests, and pull requests. There's an active community in the forum that should be able to help get an answer to your question. As such, I hope you don't mind that I am marking this thread as resolved. |
NOTE: this issue will evolve as we scope out this work.
Description
"Why is Elasticsearch slow?" is a common question from users. We have tools to investigate certain aspects of this question already, for instance the search slowlog (good if the shard-level searches are slow) and the hot threads API (good if the slowness is an ongoing thing) but there are many gaps too. For instance, how would we discover that a Kibana dashboard triggers unreasonably many searches if each of those searches completes fairly quickly? How would we discover that requests are spending unexpectedly long in queues? How do we see if the slow steps all involve a particular node? What if that node is on a remote cluster? It's hard to take a structured approach to performance questions with the tools we have today.
Distributed tracing is a great way to answer questions of this nature. Elastic has a distributed tracing product, APM, which sits on top of Elasticsearch, but today Elasticsearch itself is opaque to APM: we cannot trace the execution of a request through Elasticsearch. Let's fix that.
This work will build on an existing exploratory project that instrumented a number of "tasks" in Elasticsearch. More types of tasks will be instrumented, as well as requests / responses at the REST level.
Tasks
Make sampling rate configurable- handled by APM agentMore flexible configuration of connection to APM server (TLS features, proxy support, protocol selection etc)- handled by APM agentOut-of-scope
The focus of this work is making is instrumenting Elasticsearch for Elastic's own purposes. Making it available to users and licensing it for that purpose is not currently in scope.
The text was updated successfully, but these errors were encountered: