Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate ES with APM #87696

Closed
wants to merge 109 commits into from

Conversation

pugnascotia
Copy link
Contributor

Supersedes #84631. Part of #84369.

This PR takes the proof-of-concept work for adding tracing to Elasticsearch and carries it to an initial completion. It adds an apm-integration X-Pack plugin, which uses the OpenTelemetry API to create spans. Also if APM is enabled, Elasticsearch is started using the APM java agent as a Java agent. This is responsible for ships spans and other metrics to an APM server.

There are also a number of changes that are concerned with expressing some activity in ES more clearly in terms of tasks, since this work maps tasks to spans.

Notes to reviewers

Start with the file TRACING.md, it provides useful context for this work. If you want to see it in action, I can point you at an existing APM server, or give you a docker-compose.yml which will let you run your own stack with distributed tracing.

I have run several benchmarks with APM disabled, and could see nothing consistently slower. I still need to run some benchmarks with APM enabled, though my hope is that this isn't a blocker to merging, and that we can iterate post-merge if there are hot areas.

I would particularly like feedback about the config mechanism.

Todo:

  • Run benchmarks with APM enabled
  • Docs

tlrx and others added 30 commits November 15, 2021 08:50
…c#80762)

This pull request adds support for HTPP/gRPC with the OpenTelemetry SDK
in the `apm-integration` plugin. It adds the required security
permissions to the plugin to make it work - we should try to reduce this
list one day. The plumbing to pass Elastic's Cloud APM server
credentials is not fully there yet but the current `ApmIT` test can be
executed with `-Dtests.apm.endpoint=https://... -Dtests.apm.token=ABC`
and traces should be sent to Elastic APM. _Works on my machine™_
Task now implements the Traceable interface. This allows other traceable
entities to be developed.
…lastic#80758)

Pass a task's span context via the ThreadContext to parent nested tasks.
Adds a test case that creates and populates a couple of multi-shard
indices and executes a search against them.
* Can be turned off with xpack.security.authz.tracing: false
* AuthZ is tied to relevant task by traceparent (it however does not
  cover the first authZ)
* x-opaque-id is auto configured at rest layer if not already exists.
  This helps chain all relevant actions together.
* ApmIT now has security enabled.
…80871)

This commit adds a dynamic cluster setting called `xpack.apm.tracing.names.include`
which allows the user to filter on the names of the transactions for which
tracing is enabled.
@pugnascotia pugnascotia requested a review from DaveCTurner June 15, 2022 13:47
@elasticmachine elasticmachine added Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. Team:Core/Infra Meta label for core/infra team Team:Delivery Meta label for Delivery team labels Jun 15, 2022
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-delivery (Team:Delivery)

@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra (Team:Core/Infra)

@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

@pugnascotia pugnascotia changed the title Apm integration with agent Integration ES with APM Jun 15, 2022
@pugnascotia pugnascotia changed the title Integration ES with APM Integrate ES with APM Jun 15, 2022
Files.copy(file, dest);
}
}
final Path distConfigDir = getDistroDir().resolve("config");
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes were ultimately unnecessary, but I've kept them because the make it possible to retain the config file hierarchy when starting a test cluster.

@@ -42,7 +42,7 @@ public static List<String> bootstrapJvmOptions(Path plugins) throws IOException
private static List<PluginInfo> getPluginInfo(Path plugins) throws IOException {
final List<PluginInfo> pluginInfo = new ArrayList<>();

final List<Path> pluginDirs = Files.list(plugins).collect(Collectors.toList());
final List<Path> pluginDirs = listFiles(plugins);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes just ensure that the Stream from Files.list(Path) is closed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you split this out to a separate PR?

Copy link
Member

@rjernst rjernst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is quite large. Can it be split out into a few PRs? For example, working through the Tracing interface and initialization before anything actually uses it, then adding the task/etc uses of tracer.

Some overall thoughts after a skim of the code:

  • The tracer looks very similar to logging. I wonder if we could configure it statically like logging, so that we could have eg TraceManager.getTracer(...). This would seem closer to the openTelemetry object in the open telemetry api, ie the thing to get a tracer from.
  • The tracer in open telemetry just has a single method to get a span. Why not replicate this API?

@@ -42,7 +42,7 @@ public static List<String> bootstrapJvmOptions(Path plugins) throws IOException
private static List<PluginInfo> getPluginInfo(Path plugins) throws IOException {
final List<PluginInfo> pluginInfo = new ArrayList<>();

final List<Path> pluginDirs = Files.list(plugins).collect(Collectors.toList());
final List<Path> pluginDirs = listFiles(plugins);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you split this out to a separate PR?

try (KeyStoreWrapper keyStoreWrapper = KeyStoreWrapper.load(config)) {
if (keyStoreWrapper != null) {
try {
keyStoreWrapper.decrypt(args.keystorePassword().clone().getChars());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we need the keystore here, we should rework ServerArgs so we load the keystore just once (see where we load it in ServerCli to find the password).

Copy link
Contributor

@henningandersen henningandersen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++ on splitting into multiple PRs, otherwise getting the right folks to review the individual parts will be difficult. I'd think all the libs and the Tracer interface itself could go in one PR and then each usage more or less in its own PR.

@pugnascotia
Copy link
Contributor Author

I'll work on splitting this up. It grew over time.

The tracer looks very similar to logging. I wonder if we could configure it statically like logging, so that we could have eg TraceManager.getTracer(...). This would seem closer to the openTelemetry object in the open telemetry api, ie the thing to get a tracer from.

@rjernst at the moment we can't configure it statically, because the tracer implementation lives in a module. We could move this into server, if you'd be OK with that? Thinking about it, this is now more practical with the modularization work, since we can avoid exporting anything that we want to keep private.

The tracer in open telemetry just has a single method to get a span. Why not replicate this API?

I'm not sure I follow, can you elaborate. A better comparator would be the SpanBuilder in the OTel API, since having started building a span, you need to make various calls on the builder before finally getting a Span out of it.

pugnascotia added a commit to pugnascotia/elasticsearch that referenced this pull request Jun 22, 2022
Part of elastic#84369. Split out from elastic#87696. Rework how some work is executed
by creating child tasks for them, so that when traced by APM, it results
in more meaningful parent and child tasks in the UI. It also improves
how Elasticsearch is modelling the work.
@rjernst
Copy link
Member

rjernst commented Jun 30, 2022

the tracer implementation lives in a module. We could move this into server, if you'd be OK with that?

I could see a small part of this being in server.

The tracer in open telemetry just has a single method to get a span. Why not replicate this API?

I'm not sure I follow, can you elaborate. A better comparator would be the SpanBuilder in the OTel API

I was looking at the Tracer class in the opentelemetry java sdk. That class has a single method to return a SpanBuilder. It looks stuff from their SpanBuilder is much of what is on the Tracer in this PR. Why the collapse? ie why is there no SpanBuilder here?

Something else that I could not understand from my brief reading of this PR is what thread safety or multi threaded expectations exist in in this new api (mostly on the Tracer class, which appears to be a single instance for the node).


Instead, when Elasticsearch bootstraps itself, it compiles all APM settings
together, including any `secret_key` or `api_key` values from the ES keystore,
and writes out a temporary APM config file containin all static configuration
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
and writes out a temporary APM config file containin all static configuration
and writes out a temporary APM config file containing all static configuration

pugnascotia added a commit that referenced this pull request Jul 5, 2022
Part of #84369. Split out from #87696. Rework how some work is executed
by creating child tasks for them, so that when traced by APM, it results
in more meaningful parent and child tasks in the UI. It also improves
how Elasticsearch is modelling the work.
@pugnascotia
Copy link
Contributor Author

@rjernst

I was looking at the Tracer class in the opentelemetry java sdk. That class has a single method to return a SpanBuilder. It looks stuff from their SpanBuilder is much of what is on the Tracer in this PR. Why the collapse? ie why is there no SpanBuilder here?

The Tracer isn't trying to replicate the OTel API, just present enough of a surface area to allow useful tracing. The actual business of using the OTel API with how we model work with tasks in ES is actually quite fiddly, so even if we did have a builder class, we'd still need to centralise some of the code to make sure we were always building spans correctly.

Part of the problem is that we might start a span on one thread and end it on another, which makes using the OTel (or OTel-like) API harder. It expects you to use thread-locals to propagate context automatically, and so we have to do something different. There's also all the faffing about to get distributed tracing to work across nodes. Encapsulating away the details of the OTel API is better, IMO.

If we later found it useful to be able to return a builder from the tracer then we can add that, but right now the idea is that we pass something that is traceable to the tracer, and it takes care of translating that to whatever tracing API we're using.

Something else that I could not understand from my brief reading of this PR is what thread safety or multi threaded expectations exist in in this new api (mostly on the Tracer class, which appears to be a single instance for the node).

We expect there to be a single APMTracer per node. The only point of concurrency in that class is the spans property, which is a ConcurrentHashMap. After that, we rely on the OTel API implementation to properly handle concurrency, which in our case is the APM Java Agent.

@elasticsearchmachine elasticsearchmachine changed the base branch from master to main July 22, 2022 23:06
elasticsearchmachine pushed a commit that referenced this pull request Jul 25, 2022
Part of #84369. Split out from #87696. Introduce tracing interfaces in
advance of adding APM support to Elasticsearch. The only implementation
at this point is a no-op class.
@pugnascotia
Copy link
Contributor Author

This PR has been broken up into smaller PRs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cloud-deploy Publish cloud docker image for Cloud-First-Testing :Core/Infra/Core Core issues without another label :Delivery/Tooling Developer tooliing and automation :Distributed Coordination/Task Management Issues for anything around the Tasks API - both persistent and node level. >feature Team:Core/Infra Meta label for core/infra team Team:Delivery Meta label for Delivery team Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. v8.4.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.