Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate ES with APM #87696

Closed
wants to merge 109 commits into from
Closed
Show file tree
Hide file tree
Changes from 106 commits
Commits
Show all changes
109 commits
Select commit Hold shift + click to select a range
9325cbf
Base module for APM Tracing (#80705)
tlrx Nov 15, 2021
22ecf6f
First set of deps (#80720)
tlrx Nov 15, 2021
ba34d70
Integrate tracer with task manager (#80721)
DaveCTurner Nov 15, 2021
13eddb9
Merge branch 'master' into feature/apm-integration
DaveCTurner Nov 15, 2021
d5a2503
Use OpenTelemetry with HTTP/gRPC exporters in apm-integration (#80762)
tlrx Nov 17, 2021
34239d4
Add Traceable interface (#80788)
ywangd Nov 17, 2021
2f51e5f
Capture task span context in thread context to parent nested tasks (#…
dimitris-athanasiou Nov 17, 2021
a23bf66
[APM] Add multi-shard search test case (#80792)
DaveCTurner Nov 17, 2021
69caefd
Merge branch 'master' into feature/apm-integration
DaveCTurner Nov 18, 2021
3a304f2
Remove unused TracingPlugin interface (#80799)
DaveCTurner Nov 18, 2021
8bd8a24
single service + few attributes
SylvainJuge Nov 18, 2021
110bb00
tune a few minor things
SylvainJuge Nov 18, 2021
4a1a899
adding dynamic setting `xpack.apm.tracing.enabled` (#80796)
idegtiarenko Nov 18, 2021
d03a457
Merge branch 'feature/apm-integration' into sylvain
DaveCTurner Nov 18, 2021
9f68a26
Spotless
DaveCTurner Nov 18, 2021
76a9414
Merge remote-tracking branch 'origin/master' into feature/apm-integra…
ywangd Nov 19, 2021
f3f9835
Add tracing for authorization (#80815)
ywangd Nov 19, 2021
a7e2359
Merge branch 'feature/apm-integration' of github.com:elastic/elastics…
SylvainJuge Nov 19, 2021
ed6223c
use otel sem attributes when we can
SylvainJuge Nov 19, 2021
79f0da3
Merge branch 'master' into feature/apm-integration
DaveCTurner Nov 22, 2021
d95c634
Trace recoveries and cluster state updates (#80875)
DaveCTurner Nov 22, 2021
0d58db7
Add `xpack.apm.tracing.names.include` setting for filtering (#80871)
dimitris-athanasiou Nov 22, 2021
04c76c6
Merge remote-tracking branch 'upstream/master' into feature/apm-integ…
pugnascotia Mar 1, 2022
7e0c606
Merge remote-tracking branch 'upstream/master' into feature/apm-integ…
pugnascotia Mar 3, 2022
8132527
Fix compilation issue
pugnascotia Mar 3, 2022
84b558d
Update SHAs
pugnascotia Mar 3, 2022
3badd42
Compilation fix
pugnascotia Mar 3, 2022
3d35bd4
Tweaks
pugnascotia Mar 3, 2022
2e3aba1
Formatting
pugnascotia Mar 3, 2022
36c8943
Fix 3rd party errors
pugnascotia Mar 3, 2022
eed58a6
Merge remote-tracking branch 'upstream/master' into feature/apm-integ…
pugnascotia Mar 8, 2022
99b948c
WIP - hacks to make distributed tracing work
pugnascotia Mar 11, 2022
2dec258
WIP - trying to get REST tracing working
pugnascotia Mar 15, 2022
4e7f9dc
WIP - more messing around
pugnascotia Mar 16, 2022
11dc5e1
HACK HACK HACK
pugnascotia Mar 16, 2022
e7cca58
OMG I think it's working
pugnascotia Mar 16, 2022
461226b
Seems to be working now :tada:
pugnascotia Mar 17, 2022
b646a69
Hacks to try to use the APM Java agent
pugnascotia Mar 17, 2022
fddc9e8
Merge remote-tracking branch 'upstream/master' into feature/apm-integ…
pugnascotia Mar 18, 2022
64adb49
Formatting
pugnascotia Mar 18, 2022
cd6f1ef
Improve REST tracing
pugnascotia Mar 21, 2022
f04c6ff
Update to latest APM agent
pugnascotia Mar 22, 2022
b772714
Don't log graphviz by default
pugnascotia Mar 23, 2022
2f45a51
Tweaks
pugnascotia Mar 23, 2022
ec612a6
Rework trace header stashing
pugnascotia Mar 29, 2022
6ee831f
Merge branch 'feature/apm-integration' into apm-integration-with-agent
pugnascotia Mar 29, 2022
b633f7e
Fixes
pugnascotia Mar 29, 2022
4810a7f
Managed to get traces to ship if I hack the APM agent
pugnascotia Mar 30, 2022
bda6ea2
Move java agent CLI option into plugin descriptor
pugnascotia Mar 30, 2022
fb87e79
Tweak for adding java opts via modules
pugnascotia Mar 30, 2022
b76910b
Header fixes
pugnascotia Mar 30, 2022
a7266b3
Add run script
pugnascotia Mar 30, 2022
48bebd3
Header fixes
pugnascotia Mar 30, 2022
1436637
Tweaks
pugnascotia Mar 31, 2022
dfb8f8b
Detach tracing when starting an index's background tasks
pugnascotia Mar 31, 2022
d94ec59
Detach tracing when starting an index's background tasks
pugnascotia Mar 31, 2022
d834dd5
Start a doc about tracing
pugnascotia Mar 31, 2022
b611835
Span attribte tweaks
pugnascotia Mar 31, 2022
53058b3
Span attribte tweaks
pugnascotia Mar 31, 2022
0989f17
Add extra docker tag
pugnascotia Mar 31, 2022
1be7e7e
Tweaks
pugnascotia Apr 1, 2022
fd5f3cb
Bump APM agent
pugnascotia Apr 12, 2022
ef6ae5f
Get tracing across nodes working again
pugnascotia Apr 13, 2022
a0978c9
Compilation fixes
pugnascotia Apr 13, 2022
48a6486
Merge remote-tracking branch 'upstream/master' into feature/apm-integ…
pugnascotia Apr 13, 2022
ca896fb
Bump version in run.sh
pugnascotia Apr 13, 2022
7e0c535
Merge branch 'feature/apm-integration' into apm-integration-with-agent
pugnascotia Apr 13, 2022
01d9c42
Fix
pugnascotia Apr 13, 2022
fdbe843
Fixes for using latest agent version
pugnascotia Apr 13, 2022
6a23b24
Fully configure the APM via config file in the module
pugnascotia Apr 14, 2022
694180b
Tidy up
pugnascotia Apr 14, 2022
89f90c2
Mass-refactoring
pugnascotia Apr 18, 2022
d2efa7e
Fixes
pugnascotia Apr 18, 2022
6bb80b6
Formatting
pugnascotia Apr 18, 2022
eefb53a
Beginnings of an end-to-end APM test
pugnascotia Apr 20, 2022
e9695d2
Get the APM integration test working
pugnascotia Apr 21, 2022
7164dd8
Test fixes
pugnascotia Apr 26, 2022
e6d5e4f
Merge remote-tracking branch 'upstream/master' into apm-integration-w…
pugnascotia Apr 26, 2022
e56427b
Add support for opening Scope via the Tracer
pugnascotia Apr 27, 2022
aad7f4c
Make it possible to configure APM agent via settings API
pugnascotia Apr 29, 2022
08da3a3
Fix apm settings to work under assertions
pugnascotia Apr 29, 2022
f63a154
Updates to TRACING.md
pugnascotia Apr 29, 2022
31ff299
Tweaks
pugnascotia Apr 29, 2022
fd6a9a9
More testing
pugnascotia Apr 29, 2022
562e6ed
More testing
pugnascotia Apr 29, 2022
f8431e7
More TaskManager unit tests
pugnascotia May 2, 2022
c55632a
Add unit testing
pugnascotia May 3, 2022
d329b7a
Make qa test work again
pugnascotia May 3, 2022
854c8c0
Merge remote-tracking branch 'origin/apm-integration-with-agent' into…
pugnascotia May 3, 2022
9dcf369
More notes on tracing
pugnascotia May 3, 2022
e0f70de
Merge remote-tracking branch 'upstream/master' into apm-integration-w…
pugnascotia May 6, 2022
3c4e323
Add an exclude filter and filtering unit tests
pugnascotia May 11, 2022
76ebf99
Switch to automaton instead of regexes
pugnascotia May 12, 2022
ccc47c6
Redact sensitive http headers
pugnascotia Jun 2, 2022
0b911af
Merge remote-tracking branch 'upstream/master' into apm-integration-w…
pugnascotia Jun 6, 2022
f0dbe4a
Post-merge fixes
pugnascotia Jun 8, 2022
55772a2
Tweak log4j security policy
pugnascotia Jun 8, 2022
8be329f
Switch to auto-generating an APM config
pugnascotia Jun 14, 2022
feb3b3f
Shorten APM settings prefix
pugnascotia Jun 14, 2022
762d6fe
Bump APM agent to 1.32.0
pugnascotia Jun 14, 2022
9b2685d
Upgrade opentelemetry
pugnascotia Jun 15, 2022
e21346c
Javadoc
pugnascotia Jun 15, 2022
7c3084e
Merge remote-tracking branch 'upstream/master' into apm-integration-w…
pugnascotia Jun 15, 2022
8f98e7e
Update TRACING.md
pugnascotia Jun 15, 2022
2d3b30b
General fixing and polishing
pugnascotia Jun 15, 2022
15a1baf
Merge remote-tracking branch 'upstream/master' into apm-integration-w…
pugnascotia Jun 15, 2022
52fd4d3
Remove debug gradle config
pugnascotia Jun 15, 2022
5e0892b
Fix typo
pugnascotia Jun 15, 2022
4912eec
Remove run script
pugnascotia Jun 15, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
141 changes: 141 additions & 0 deletions TRACING.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,141 @@
# Tracing in Elasticsearch

Elasticsearch is instrumented using the [OpenTelemetry][otel] API, which allows
us to gather traces and analyze what Elasticsearch is doing.

## How is tracing implemented?

The Elasticsearch server code contains a [`tracing`][tracing] package, which is
an abstraction over the OpenTelemetry API. All locations in the code that
perform instrumentation and tracing must use these abstractions.

Separately, there is the [`apm-integration`](./x-pack/plugins/apm-integration/)
module, which works with the OpenTelemetry API directly to record trace data.
Underneath the OTel API, we use Elastic's [APM agent for Java][agent], which
attaches at runtime to the Elasticsearch JVM and removes the need for
Elasticsearch to hard-code the use of an SDK.

## How is tracing configured?

* The `xpack.apm.enabled` setting must be set to `true`
* You must supplied credentials for the APM server. See below.

All APM settings live under `xpack.apm`. All settings related to the Java agent
go under `xpack.apm.agent`. Anything you set under there will be propagated to
the agent.

For agent settings that can be changed dynamically, you can use the cluster
settings REST API. For example, to change the sampling rate:

curl -XPUT \
-H "Content-type: application/json" \
-u "$USERNAME:$PASSWORD" \
-d '{ "persistent": { "xpack.apm.agent.transaction_sample_rate": "0.75" } }' \
https://localhost:9200/_cluster/settings

### More details about configuration

For context, the APM agent pulls configuration from [multiple
sources][agent-config], with a hierarchy that means, for example, that options
set in the config file cannot be overridden via system properties.

Now, in order to send tracing data to the APM server, ES needs to configured with
either a `secret_key` or an `api_key`. We could configure these in the agent via
system properties, but then their values would be available to any Java code
that can read system properties.

Instead, when Elasticsearch bootstraps itself, it compiles all APM settings
together, including any `secret_key` or `api_key` values from the ES keystore,
and writes out a temporary APM config file containin all static configuration
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
and writes out a temporary APM config file containin all static configuration
and writes out a temporary APM config file containing all static configuration

(i.e. values that cannot change after the agent starts). This file is deleted
soon after ES starts up. Settings that are not sensitive and can be changed
dynamically are configure via system properties. Calls to the ES settings REST
API are translated into system property writes, which the agent later picks up
and applies.

## Where is tracing data sent?

You need to have an APM server running somewhere. For example, you can
create a deployment in Elastic Cloud with Elastic's APM integration.

## What do we trace?

We primarily trace "tasks". The tasks framework in Elasticsearch allows work to
scheduled for execution, cancelled, executed in a different thread pool, and so
on. Tracing a task results in a "span", which represents the execution of the
task in the tracing system. We also instrument REST requests, which are not (at
present) modelled by tasks.

A span can be associated with a parent span, which allows all spans in, for
example, a REST request to be grouped together. Spans can track work across
different Elasticsearch nodes.

Elasticsearch also supports distributed tracing via [W3c Trace Context][w3c]
headers. If clients of Elasticsearch send these headers with their requests,
then that data will be forwarded to the APM server in order to yield a trace
across systems.

## Thread contexts and nested spans

When a span is started, Elasticsearch tracks information about that span in the
current [thread context][thread-context]. If a new thread context is created,
then current span information is propagated but renamed, so that (1) it doesn't
interfere when new trace information is set in the context, and (2) the previous
trace information is available to establish a parent / child span relationship.

Sometimes we need to detach new spans from their parent. For example, creating
an index starts some related background tasks, but these shouldn't be associated
with the REST request, otherwise all the background task spans will be
associated with the REST request for as long as Elasticsearch is running.
`ThreadContext` provides the `clearTraceContext`() method for this purpose.

## How to I trace something that isn't a task?

First work out if you can turn it into a task. No, really.

If you can't do that, you'll need to ensure that your class can get access to a
`Tracer` instance (this is available to inject, or you'll need to pass it when
your class is created). Then you need to call the appropriate methods on the
tracer when a span should start and end.

## What additional attributes should I set?

That's up to you. Be careful not to capture anything that could leak sensitive
or personal information.

## What is "scope" and when should I used it?

Usually you won't need to.

That said, sometimes you may want more details to be captured about a particular
section of code. You can think of "scope" as representing the currently active
tracing context. Using scope allows the APM agent to do the following:

* Enables automatic correlation between the "active span" and logging, where
logs have also been captured.
* Enables capturing any exceptions thrown when the span is active, and linking
those exceptions to the span
* Allows the sampling profiler to be used as it allows samples to be linked to
the active span (if any), so the agent can automatically get extra spans
without manual instrumentation.

However, a scope must be closed in the same thread in which it was opened, which
cannot be guaranteed when using tasks.

In the OpenTelemetry documentation, spans, scope and context are fairly
straightforward to use, since `Scope` is an `AutoCloseable` and so can be
easily created and cleaned up use try-with-resources blocks. Unfortunately,
Elasticsearch is a complex piece of software, and also extremely asynchronous,
so the typical OpenTelemetry examples do not work.

Nonetheless, it is possible to manually use scope where we need more detail by
explicitly opening a scope via the `Tracer`.


[otel]: https://opentelemetry.io/
[thread-context]: ./server/src/main/java/org/elasticsearch/common/util/concurrent/ThreadContext.java).
[w3c]: https://www.w3.org/TR/trace-context/
[tracing]: ./server/src/main/java/org/elasticsearch/tracing/
[config]: ./x-pack/plugin/apm-integration/src/main/config/elasticapm.properties
[agent-config]: https://www.elastic.co/guide/en/apm/agent/java/master/configuration.html
[agent]: https://www.elastic.co/guide/en/apm/agent/java/current/index.html
4 changes: 4 additions & 0 deletions build-tools-internal/version.properties
Original file line number Diff line number Diff line change
Expand Up @@ -54,3 +54,7 @@ jimfs_guava = 30.1-jre

# test framework
networknt_json_schema_validator = 1.0.48

# tracing
apm_agent = 1.32.0
opentelemetry = 1.14.0
Original file line number Diff line number Diff line change
Expand Up @@ -65,11 +65,14 @@
import java.io.UncheckedIOException;
import java.net.URL;
import java.nio.charset.StandardCharsets;
import java.nio.file.FileVisitResult;
import java.nio.file.Files;
import java.nio.file.NoSuchFileException;
import java.nio.file.Path;
import java.nio.file.SimpleFileVisitor;
import java.nio.file.StandardCopyOption;
import java.nio.file.StandardOpenOption;
import java.nio.file.attribute.BasicFileAttributes;
import java.time.Instant;
import java.util.ArrayList;
import java.util.Arrays;
Expand Down Expand Up @@ -1356,19 +1359,12 @@ private void createConfiguration() {
StandardOpenOption.CREATE
);

final List<Path> configFiles;
try (Stream<Path> stream = Files.list(getDistroDir().resolve("config"))) {
configFiles = stream.collect(Collectors.toList());
}
logToProcessStdout("Copying additional config files from distro " + configFiles);
for (Path file : configFiles) {
Path dest = configFile.getParent().resolve(file.getFileName());
if (Files.exists(dest) == false) {
Files.copy(file, dest);
}
}
final Path distConfigDir = getDistroDir().resolve("config");
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes were ultimately unnecessary, but I've kept them because the make it possible to retain the config file hierarchy when starting a test cluster.

final RecursiveCopyFileVisitor visitor = new RecursiveCopyFileVisitor(distConfigDir);
Files.walkFileTree(distConfigDir, visitor);
logToProcessStdout("Copied additional config files from distro: " + visitor.getCopiedFiles());
} catch (IOException e) {
throw new UncheckedIOException("Could not write config file: " + configFile, e);
throw new UncheckedIOException("Could not write config file: " + e.getMessage(), e);
}

tweakJvmOptions(configFileRoot);
Expand Down Expand Up @@ -1686,4 +1682,37 @@ private static class LinkCreationException extends UncheckedIOException {
super(message, cause);
}
}

private class RecursiveCopyFileVisitor extends SimpleFileVisitor<Path> {
private final Path sourceDir;
private final List<Path> copiedFiles;

RecursiveCopyFileVisitor(Path sourceDir) {
this.sourceDir = sourceDir;
this.copiedFiles = new ArrayList<>();
}

public List<Path> getCopiedFiles() {
return copiedFiles;
}

@Override
public FileVisitResult preVisitDirectory(Path sourceDir, BasicFileAttributes attrs) throws IOException {
final Path relativePath = this.sourceDir.relativize(sourceDir);
final Path destPath = configFile.getParent().resolve(relativePath);
if (Files.notExists(destPath)) {
Files.createDirectory(destPath);
}
return FileVisitResult.CONTINUE;
}

@Override
public FileVisitResult visitFile(Path sourcePath, BasicFileAttributes attrs) throws IOException {
final Path relativePath = sourceDir.relativize(sourcePath);
final Path destPath = configFile.getParent().resolve(relativePath);
Files.copy(sourcePath, destPath, StandardCopyOption.REPLACE_EXISTING);
copiedFiles.add(sourcePath);
return FileVisitResult.CONTINUE;
}
}
}
Loading