Integrate ES with APM #87696

pugnascotia · 2022-06-15T13:47:26Z

Supersedes #84631. Part of #84369.

This PR takes the proof-of-concept work for adding tracing to Elasticsearch and carries it to an initial completion. It adds an apm-integration X-Pack plugin, which uses the OpenTelemetry API to create spans. Also if APM is enabled, Elasticsearch is started using the APM java agent as a Java agent. This is responsible for ships spans and other metrics to an APM server.

There are also a number of changes that are concerned with expressing some activity in ES more clearly in terms of tasks, since this work maps tasks to spans.

Notes to reviewers

Start with the file TRACING.md, it provides useful context for this work. If you want to see it in action, I can point you at an existing APM server, or give you a docker-compose.yml which will let you run your own stack with distributed tracing.

I have run several benchmarks with APM disabled, and could see nothing consistently slower. I still need to run some benchmarks with APM enabled, though my hope is that this isn't a blocker to merging, and that we can iterate post-merge if there are hot areas.

I would particularly like feedback about the config mechanism.

Todo:

Run benchmarks with APM enabled
Docs

…c#80762) This pull request adds support for HTPP/gRPC with the OpenTelemetry SDK in the `apm-integration` plugin. It adds the required security permissions to the plugin to make it work - we should try to reduce this list one day. The plumbing to pass Elastic's Cloud APM server credentials is not fully there yet but the current `ApmIT` test can be executed with `-Dtests.apm.endpoint=https://... -Dtests.apm.token=ABC` and traces should be sent to Elastic APM. _Works on my machine™_

Task now implements the Traceable interface. This allows other traceable entities to be developed.

…lastic#80758) Pass a task's span context via the ThreadContext to parent nested tasks.

Adds a test case that creates and populates a couple of multi-shard indices and executes a search against them.

…tion

* Can be turned off with xpack.security.authz.tracing: false * AuthZ is tied to relevant task by traceparent (it however does not cover the first authZ) * x-opaque-id is auto configured at rest layer if not already exists. This helps chain all relevant actions together. * ApmIT now has security enabled.

…earch into feature/apm-integration

…80871) This commit adds a dynamic cluster setting called `xpack.apm.tracing.names.include` which allows the user to filter on the names of the transactions for which tracing is enabled.

…ration

elasticmachine · 2022-06-15T13:47:33Z

Pinging @elastic/es-delivery (Team:Delivery)

elasticmachine · 2022-06-15T13:47:33Z

Pinging @elastic/es-core-infra (Team:Core/Infra)

elasticmachine · 2022-06-15T13:47:33Z

Pinging @elastic/es-distributed (Team:Distributed)

pugnascotia · 2022-06-15T13:51:41Z

build-tools/src/main/java/org/elasticsearch/gradle/testclusters/ElasticsearchNode.java

-                    Files.copy(file, dest);
-                }
-            }
+            final Path distConfigDir = getDistroDir().resolve("config");


These changes were ultimately unnecessary, but I've kept them because the make it possible to retain the config file hierarchy when starting a test cluster.

pugnascotia · 2022-06-15T13:52:24Z

...ibution/tools/server-cli/src/main/java/org/elasticsearch/server/cli/BootstrapJvmOptions.java

@@ -42,7 +42,7 @@ public static List<String> bootstrapJvmOptions(Path plugins) throws IOException
    private static List<PluginInfo> getPluginInfo(Path plugins) throws IOException {
        final List<PluginInfo> pluginInfo = new ArrayList<>();

-        final List<Path> pluginDirs = Files.list(plugins).collect(Collectors.toList());
+        final List<Path> pluginDirs = listFiles(plugins);


These changes just ensure that the Stream from Files.list(Path) is closed.

Can you split this out to a separate PR?

rjernst

This PR is quite large. Can it be split out into a few PRs? For example, working through the Tracing interface and initialization before anything actually uses it, then adding the task/etc uses of tracer.

Some overall thoughts after a skim of the code:

The tracer looks very similar to logging. I wonder if we could configure it statically like logging, so that we could have eg TraceManager.getTracer(...). This would seem closer to the openTelemetry object in the open telemetry api, ie the thing to get a tracer from.
The tracer in open telemetry just has a single method to get a span. Why not replicate this API?

rjernst · 2022-06-21T22:32:54Z

...ibution/tools/server-cli/src/main/java/org/elasticsearch/server/cli/BootstrapJvmOptions.java

@@ -42,7 +42,7 @@ public static List<String> bootstrapJvmOptions(Path plugins) throws IOException
    private static List<PluginInfo> getPluginInfo(Path plugins) throws IOException {
        final List<PluginInfo> pluginInfo = new ArrayList<>();

-        final List<Path> pluginDirs = Files.list(plugins).collect(Collectors.toList());
+        final List<Path> pluginDirs = listFiles(plugins);


Can you split this out to a separate PR?

rjernst · 2022-06-21T22:37:02Z

distribution/tools/server-cli/src/main/java/org/elasticsearch/server/cli/JvmOptionsParser.java

+        try (KeyStoreWrapper keyStoreWrapper = KeyStoreWrapper.load(config)) {
+            if (keyStoreWrapper != null) {
+                try {
+                    keyStoreWrapper.decrypt(args.keystorePassword().clone().getChars());


If we need the keystore here, we should rework ServerArgs so we load the keystore just once (see where we load it in ServerCli to find the password).

henningandersen

++ on splitting into multiple PRs, otherwise getting the right folks to review the individual parts will be difficult. I'd think all the libs and the Tracer interface itself could go in one PR and then each usage more or less in its own PR.

pugnascotia · 2022-06-22T10:18:45Z

I'll work on splitting this up. It grew over time.

The tracer looks very similar to logging. I wonder if we could configure it statically like logging, so that we could have eg TraceManager.getTracer(...). This would seem closer to the openTelemetry object in the open telemetry api, ie the thing to get a tracer from.

@rjernst at the moment we can't configure it statically, because the tracer implementation lives in a module. We could move this into server, if you'd be OK with that? Thinking about it, this is now more practical with the modularization work, since we can avoid exporting anything that we want to keep private.

The tracer in open telemetry just has a single method to get a span. Why not replicate this API?

I'm not sure I follow, can you elaborate. A better comparator would be the SpanBuilder in the OTel API, since having started building a span, you need to make various calls on the builder before finally getting a Span out of it.

Part of elastic#84369. Split out from elastic#87696. Rework how some work is executed by creating child tasks for them, so that when traced by APM, it results in more meaningful parent and child tasks in the UI. It also improves how Elasticsearch is modelling the work.

rjernst · 2022-06-30T05:19:28Z

the tracer implementation lives in a module. We could move this into server, if you'd be OK with that?

I could see a small part of this being in server.

The tracer in open telemetry just has a single method to get a span. Why not replicate this API?

I'm not sure I follow, can you elaborate. A better comparator would be the SpanBuilder in the OTel API

I was looking at the Tracer class in the opentelemetry java sdk. That class has a single method to return a SpanBuilder. It looks stuff from their SpanBuilder is much of what is on the Tracer in this PR. Why the collapse? ie why is there no SpanBuilder here?

Something else that I could not understand from my brief reading of this PR is what thread safety or multi threaded expectations exist in in this new api (mostly on the Tracer class, which appears to be a single instance for the node).

breskeby · 2022-06-30T10:50:58Z

TRACING.md

+
+Instead, when Elasticsearch bootstraps itself, it compiles all APM settings
+together, including any `secret_key` or `api_key` values from the ES keystore,
+and writes out a temporary APM config file containin all static configuration


Suggested change

and writes out a temporary APM config file containin all static configuration

and writes out a temporary APM config file containing all static configuration

Part of #84369. Split out from #87696. Rework how some work is executed by creating child tasks for them, so that when traced by APM, it results in more meaningful parent and child tasks in the UI. It also improves how Elasticsearch is modelling the work.

pugnascotia · 2022-07-07T13:48:17Z

@rjernst

I was looking at the Tracer class in the opentelemetry java sdk. That class has a single method to return a SpanBuilder. It looks stuff from their SpanBuilder is much of what is on the Tracer in this PR. Why the collapse? ie why is there no SpanBuilder here?

The Tracer isn't trying to replicate the OTel API, just present enough of a surface area to allow useful tracing. The actual business of using the OTel API with how we model work with tasks in ES is actually quite fiddly, so even if we did have a builder class, we'd still need to centralise some of the code to make sure we were always building spans correctly.

Part of the problem is that we might start a span on one thread and end it on another, which makes using the OTel (or OTel-like) API harder. It expects you to use thread-locals to propagate context automatically, and so we have to do something different. There's also all the faffing about to get distributed tracing to work across nodes. Encapsulating away the details of the OTel API is better, IMO.

If we later found it useful to be able to return a builder from the tracer then we can add that, but right now the idea is that we pass something that is traceable to the tracer, and it takes care of translating that to whatever tracing API we're using.

Something else that I could not understand from my brief reading of this PR is what thread safety or multi threaded expectations exist in in this new api (mostly on the Tracer class, which appears to be a single instance for the node).

We expect there to be a single APMTracer per node. The only point of concurrency in that class is the spans property, which is a ConcurrentHashMap. After that, we rely on the OTel API implementation to properly handle concurrency, which in our case is the APM Java Agent.

Part of #84369. Split out from #87696. Introduce tracing interfaces in advance of adding APM support to Elasticsearch. The only implementation at this point is a no-op class.

pugnascotia · 2022-07-26T16:47:19Z

This PR has been broken up into smaller PRs.

tlrx and others added 30 commits November 15, 2021 08:50

Base module for APM Tracing (elastic#80705)

9325cbf

First set of deps (elastic#80720)

22ecf6f

Integrate tracer with task manager (elastic#80721)

ba34d70

Merge branch 'master' into feature/apm-integration

13eddb9

Add Traceable interface (elastic#80788)

34239d4

Task now implements the Traceable interface. This allows other traceable entities to be developed.

Capture task span context in thread context to parent nested tasks (e…

2f51e5f

…lastic#80758) Pass a task's span context via the ThreadContext to parent nested tasks.

[APM] Add multi-shard search test case (elastic#80792)

a23bf66

Adds a test case that creates and populates a couple of multi-shard indices and executes a search against them.

Merge branch 'master' into feature/apm-integration

69caefd

Remove unused TracingPlugin interface (elastic#80799)

3a304f2

single service + few attributes

8bd8a24

tune a few minor things

110bb00

adding dynamic setting xpack.apm.tracing.enabled (elastic#80796)

4a1a899

Merge branch 'feature/apm-integration' into sylvain

d03a457

Spotless

9f68a26

Merge remote-tracking branch 'origin/master' into feature/apm-integra…

76a9414

…tion

Merge branch 'feature/apm-integration' of github.com:elastic/elastics…

a7e2359

…earch into feature/apm-integration

use otel sem attributes when we can

ed6223c

Merge branch 'master' into feature/apm-integration

79f0da3

Trace recoveries and cluster state updates (elastic#80875)

d95c634

Add xpack.apm.tracing.names.include setting for filtering (elastic#…

0d58db7

…80871) This commit adds a dynamic cluster setting called `xpack.apm.tracing.names.include` which allows the user to filter on the names of the transactions for which tracing is enabled.

Merge remote-tracking branch 'upstream/master' into feature/apm-integ…

04c76c6

…ration

Merge remote-tracking branch 'upstream/master' into feature/apm-integ…

7e0c606

…ration

Fix compilation issue

8132527

Update SHAs

84b558d

Compilation fix

3badd42

Tweaks

3d35bd4

Formatting

2e3aba1

Fix 3rd party errors

36c8943

pugnascotia requested a review from DaveCTurner June 15, 2022 13:47

elasticmachine added Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. Team:Core/Infra Meta label for core/infra team Team:Delivery Meta label for Delivery team labels Jun 15, 2022

pugnascotia changed the title ~~Apm integration with agent~~ Integration ES with APM Jun 15, 2022

pugnascotia changed the title ~~Integration ES with APM~~ Integrate ES with APM Jun 15, 2022

pugnascotia commented Jun 15, 2022

View reviewed changes

pugnascotia added 3 commits June 15, 2022 14:53

Remove debug gradle config

52fd4d3

Fix typo

5e0892b

Remove run script

4912eec

pugnascotia mentioned this pull request Jun 15, 2022

Integrate ES with APM #84631

Closed

rjernst reviewed Jun 21, 2022

View reviewed changes

henningandersen reviewed Jun 22, 2022

View reviewed changes

This was referenced Jun 22, 2022

Refactor tasks to improve APM support #87917

Merged

Introduce tracing interfaces #87921

Merged

breskeby reviewed Jun 30, 2022

View reviewed changes

pugnascotia mentioned this pull request Jul 11, 2022

Provide tracing implementation using OpenTelemetry + APM agent #88443

Merged

elasticsearchmachine changed the base branch from master to main July 22, 2022 23:06

pugnascotia closed this Jul 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integrate ES with APM #87696

Integrate ES with APM #87696

pugnascotia commented Jun 15, 2022

elasticmachine commented Jun 15, 2022

elasticmachine commented Jun 15, 2022

elasticmachine commented Jun 15, 2022

pugnascotia Jun 15, 2022

pugnascotia Jun 15, 2022

rjernst Jun 21, 2022

rjernst left a comment

rjernst Jun 21, 2022

rjernst Jun 21, 2022

henningandersen left a comment

pugnascotia commented Jun 22, 2022

rjernst commented Jun 30, 2022

breskeby Jun 30, 2022

pugnascotia commented Jul 7, 2022

pugnascotia commented Jul 26, 2022

	and writes out a temporary APM config file containin all static configuration
	and writes out a temporary APM config file containing all static configuration

Integrate ES with APM #87696

Integrate ES with APM #87696

Conversation

pugnascotia commented Jun 15, 2022

Notes to reviewers

elasticmachine commented Jun 15, 2022

elasticmachine commented Jun 15, 2022

elasticmachine commented Jun 15, 2022

pugnascotia Jun 15, 2022

Choose a reason for hiding this comment

pugnascotia Jun 15, 2022

Choose a reason for hiding this comment

rjernst Jun 21, 2022

Choose a reason for hiding this comment

rjernst left a comment

Choose a reason for hiding this comment

rjernst Jun 21, 2022

Choose a reason for hiding this comment

rjernst Jun 21, 2022

Choose a reason for hiding this comment

henningandersen left a comment

Choose a reason for hiding this comment

pugnascotia commented Jun 22, 2022

rjernst commented Jun 30, 2022

breskeby Jun 30, 2022

Choose a reason for hiding this comment

pugnascotia commented Jul 7, 2022

pugnascotia commented Jul 26, 2022