Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Instrument Elasticsearch with APM #84369

Closed
17 of 21 tasks
pugnascotia opened this issue Feb 24, 2022 · 13 comments
Closed
17 of 21 tasks

Instrument Elasticsearch with APM #84369

pugnascotia opened this issue Feb 24, 2022 · 13 comments
Assignees
Labels
:Core/Infra/Core Core issues without another label :Delivery/Tooling Developer tooliing and automation >feature Team:Core/Infra Meta label for core/infra team Team:Delivery Meta label for Delivery team

Comments

@pugnascotia
Copy link
Contributor

pugnascotia commented Feb 24, 2022

NOTE: this issue will evolve as we scope out this work.

Description

"Why is Elasticsearch slow?" is a common question from users. We have tools to investigate certain aspects of this question already, for instance the search slowlog (good if the shard-level searches are slow) and the hot threads API (good if the slowness is an ongoing thing) but there are many gaps too. For instance, how would we discover that a Kibana dashboard triggers unreasonably many searches if each of those searches completes fairly quickly? How would we discover that requests are spending unexpectedly long in queues? How do we see if the slow steps all involve a particular node? What if that node is on a remote cluster? It's hard to take a structured approach to performance questions with the tools we have today.

Distributed tracing is a great way to answer questions of this nature. Elastic has a distributed tracing product, APM, which sits on top of Elasticsearch, but today Elasticsearch itself is opaque to APM: we cannot trace the execution of a request through Elasticsearch. Let's fix that.

This work will build on an existing exploratory project that instrumented a number of "tasks" in Elasticsearch. More types of tasks will be instrumented, as well as requests / responses at the REST level.

Tasks

Out-of-scope

The focus of this work is making is instrumenting Elasticsearch for Elastic's own purposes. Making it available to users and licensing it for that purpose is not currently in scope.

@pugnascotia pugnascotia added >feature :Core/Infra/Core Core issues without another label :Delivery/Tooling Developer tooliing and automation labels Feb 24, 2022
@pugnascotia pugnascotia self-assigned this Feb 24, 2022
@elasticmachine elasticmachine added Team:Delivery Meta label for Delivery team Team:Core/Infra Meta label for core/infra team labels Feb 24, 2022
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-delivery (Team:Delivery)

@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra (Team:Core/Infra)

pugnascotia added a commit to pugnascotia/elasticsearch that referenced this issue Jun 22, 2022
Part of elastic#84369. Split out from elastic#87696. Rework how some work is executed
by creating child tasks for them, so that when traced by APM, it results
in more meaningful parent and child tasks in the UI. It also improves
how Elasticsearch is modelling the work.
pugnascotia added a commit that referenced this issue Jul 5, 2022
Part of #84369. Split out from #87696. Rework how some work is executed
by creating child tasks for them, so that when traced by APM, it results
in more meaningful parent and child tasks in the UI. It also improves
how Elasticsearch is modelling the work.
@lizozom
Copy link

lizozom commented Jul 20, 2022

Do you have a rough estimate what would be the initial release version of this?

elasticsearchmachine pushed a commit that referenced this issue Jul 25, 2022
Part of #84369. Split out from #87696. Introduce tracing interfaces in
advance of adding APM support to Elasticsearch. The only implementation
at this point is a no-op class.
pugnascotia added a commit to pugnascotia/elasticsearch that referenced this issue Jul 28, 2022
Split out from elastic#88443. Part of elastic#84369. Use the tracing API that was
added in elastic#87921 in TaskManager. This won't actually do anything until we
provide a tracer with an actual implemenation.
pugnascotia added a commit that referenced this issue Jul 28, 2022
Split out from #88443. Part of #84369. Use the tracing API that was
added in #87921 in TaskManager. This won't actually do anything until we
provide a tracer with an actual implemenation.
pugnascotia added a commit to pugnascotia/elasticsearch that referenced this issue Jul 28, 2022
Part of elastic#84369. Split out from elastic#88443. This PR wraps parts of the code
either in a new tracing context. This is necessary so that a tracing
implementation can use the thread context to propagate tracing headers,
but without the code attempting to set the same key twice in the thread
context, which is illegal.

Note that in some places we actually clear the tracing context
completely. This is done where the operation to be performed should have
no association with the current trace context. For example, when
creating a new index via a REST request, the resulting background tasks
for the index should not be associated with the REST request in
perpetuity.
pugnascotia added a commit that referenced this issue Aug 2, 2022
Part of #84369. Split out from #88443. This PR wraps parts logic in
`InternalExecutePolicyAction` in a new tracing context. This is
necessary so that a tracing implementation can use the thread context
to propagate tracing headers, but without the code attempting to set the
same key twice in the thread context, which is illegal.
pugnascotia added a commit that referenced this issue Aug 2, 2022
Part of #84369. Split out from #88443. This PR wraps parts logic in
`AsyncTaskManagementService` in a new tracing context. This is
necessary so that a tracing implementation can use the thread context
to propagate tracing headers, but without the code attempting to set the
same key twice in the thread context, which is illegal.
pugnascotia added a commit that referenced this issue Aug 2, 2022
Part of #84369. Split out from #88443. This PR wraps parts logic in
`TransportSubmitAsyncSearchAction` in a new tracing context. This is
necessary so that a tracing implementation can use the thread context
to propagate tracing headers, but without the code attempting to set the
same key twice in the thread context, which is illegal.
pugnascotia added a commit that referenced this issue Aug 2, 2022
Part of #84369.

ML uses the task framework to register a tasks for each loaded model.
These tasks are not executed in the usual sense, and it does not make
sense to trace them using APM. Therefore, make it possible to register
a task without also starting tracing.
pugnascotia added a commit that referenced this issue Aug 3, 2022
Part of #84369. Split out from #88443. This PR wraps parts of the code
in a new tracing context. This is necessary so that a tracing
implementation can use the thread context to propagate tracing headers,
but without the code attempting to set the same key twice in the thread
context, which is illegal. In order to avoid future diff noise, the wrapped
code has mostly been refactored into methods.

Note that in some places we actually clear the tracing context
completely. This is done where the operation to be performed should have
no association with the current trace context. For example, when
creating a new index via a REST request, the resulting background tasks
for the index should not be associated with the REST request in
perpetuity.
pugnascotia added a commit that referenced this issue Aug 3, 2022
Part of #84369. Implement the `Tracer` interface by providing a
module that uses OpenTelemetry, along with Elastic's APM
agent for Java.

See the file `TRACING.md` for background on the changes and the
reasoning for some of the implementation decisions.

The configuration mechanism is the most fiddly part of this PR. The
Security Manager permissions required by the APM Java agent make
it prohibitive to start an agent from within Elasticsearch
programmatically, so it must be configured when the ES JVM starts.
That means that the startup CLI needs to assemble the required JVM
options.

To complicate matters further, the APM agent needs a secret token
in order to ship traces to the APM server. We can't use Java system
properties to configure this, since otherwise the secret will be
readable to all code in Elasticsearch. It therefore has to be
configured in a dedicated config file. This in itself is awkward,
since we don't want to leave secrets in config files. Therefore,
we pull the APM secret token from the keystore, write it to a config
file, then delete the config file after ES starts.

There's a further issue with the config file. Any options we set
in the APM agent config file cannot later be reconfigured via system
properties, so we need to make sure that only "static" configuration
goes into the config file.

I generated most of the files under `qa/apm` using an APM test
utility (I can't remember which one now, unfortunately). The goal
is to setup up a complete system so that traces can be captured in
APM server, and the results in Elasticsearch inspected.
@pugnascotia
Copy link
Contributor Author

Do you have a rough estimate what would be the initial release version of this?

@lizozom It should be in 8.5.0.

@pugnascotia
Copy link
Contributor Author

I think we've done everything that we intended to be covered by this issue 🎉

@nicholas-r-king
Copy link

@pugnascotia is this ever going to be released publicly or will customers ever be able to take advantage of tracing/APM metrics for monitoring our Elasticsearch clusters?

@pugnascotia
Copy link
Contributor Author

It depends what you mean by "publicly" - on-prem customers could use this today. Cloud customers cannot, since APM data collection is not multi-tenant. If a Cloud customer has an issue that required APM data to resolve, they'd have to engage with Support. This would likely be necessary in any case since the APM data is a very low-level tool for investigating Elasticsearch issues.

@nicholas-r-king
Copy link

We are an enterprise customer with several on-prem installations (aws/govcloud) and have a use for this.

@pugnascotia
Copy link
Contributor Author

In that case you can definitely configure APM yourselves. We don't have user-facing documentation yet, but you can consult TRACING.md for how to get started.

@ramdaspotale
Copy link
Contributor

ramdaspotale commented Sep 25, 2024

Hi @pugnascotia is there similar feature available for kibana?

I run elasticsearch on our own and i want to understand if i wanted to pass on additional resource attributes like data_stream.dataset, data_stream.namespace etc is this possible? so that i could send these traces to separate datastream in elasticsearch

@philippkahr
Copy link
Contributor

@ramdaspotale , yes you have traces@custom as an ingest pipeline and there you can do whatever you want to the data, same for metrics, logs, so you would do that after the data is sent and not inside the APM.

Just keep in mind that changing the data_stream and namespace can have negative consequences if you do not do it correctly (index templates, component templates).

Use the reroute processor (https://www.elastic.co/guide/en/elasticsearch/reference/current/reroute-processor.html).

@ramdaspotale
Copy link
Contributor

Hi @philippkahr - i am using this elastic/apm-data#201 feature from apm server to segregate traces coming from different applications in our environment so that search against each application can be done separately without overloading ES.

using traces@custom ingest pipeline with reroute processor would add some load on ES given how busy our prod elasticsearch is.

and as this feature is already there in APM 8.13 i was curious if i could use it in this scenario as well. add data_stream.dataset, data_stream.namespace and rest assured that it will happen automatically.

@DaveCTurner
Copy link
Contributor

DaveCTurner commented Sep 25, 2024

Thanks very much for your interest in Elasticsearch @ramdaspotale.

This appears to be a user question, and we'd like to direct these kinds of things to the Elasticsearch forum. If you can move this conversation there, we'd appreciate it. This allows us to use GitHub for verified bug reports, feature requests, and pull requests.

There's an active community in the forum that should be able to help get an answer to your question. As such, I hope you don't mind that I am marking this thread as resolved.

@elastic elastic locked as resolved and limited conversation to collaborators Sep 25, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
:Core/Infra/Core Core issues without another label :Delivery/Tooling Developer tooliing and automation >feature Team:Core/Infra Meta label for core/infra team Team:Delivery Meta label for Delivery team
Projects
None yet
Development

No branches or pull requests

7 participants