-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Integrate ES with APM #87696
Integrate ES with APM #87696
Conversation
…c#80762) This pull request adds support for HTPP/gRPC with the OpenTelemetry SDK in the `apm-integration` plugin. It adds the required security permissions to the plugin to make it work - we should try to reduce this list one day. The plumbing to pass Elastic's Cloud APM server credentials is not fully there yet but the current `ApmIT` test can be executed with `-Dtests.apm.endpoint=https://... -Dtests.apm.token=ABC` and traces should be sent to Elastic APM. _Works on my machine™_
Task now implements the Traceable interface. This allows other traceable entities to be developed.
…lastic#80758) Pass a task's span context via the ThreadContext to parent nested tasks.
Adds a test case that creates and populates a couple of multi-shard indices and executes a search against them.
* Can be turned off with xpack.security.authz.tracing: false * AuthZ is tied to relevant task by traceparent (it however does not cover the first authZ) * x-opaque-id is auto configured at rest layer if not already exists. This helps chain all relevant actions together. * ApmIT now has security enabled.
…earch into feature/apm-integration
…80871) This commit adds a dynamic cluster setting called `xpack.apm.tracing.names.include` which allows the user to filter on the names of the transactions for which tracing is enabled.
Pinging @elastic/es-delivery (Team:Delivery) |
Pinging @elastic/es-core-infra (Team:Core/Infra) |
Pinging @elastic/es-distributed (Team:Distributed) |
Files.copy(file, dest); | ||
} | ||
} | ||
final Path distConfigDir = getDistroDir().resolve("config"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These changes were ultimately unnecessary, but I've kept them because the make it possible to retain the config file hierarchy when starting a test cluster.
@@ -42,7 +42,7 @@ public static List<String> bootstrapJvmOptions(Path plugins) throws IOException | |||
private static List<PluginInfo> getPluginInfo(Path plugins) throws IOException { | |||
final List<PluginInfo> pluginInfo = new ArrayList<>(); | |||
|
|||
final List<Path> pluginDirs = Files.list(plugins).collect(Collectors.toList()); | |||
final List<Path> pluginDirs = listFiles(plugins); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These changes just ensure that the Stream
from Files.list(Path)
is closed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you split this out to a separate PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR is quite large. Can it be split out into a few PRs? For example, working through the Tracing interface and initialization before anything actually uses it, then adding the task/etc uses of tracer.
Some overall thoughts after a skim of the code:
- The tracer looks very similar to logging. I wonder if we could configure it statically like logging, so that we could have eg TraceManager.getTracer(...). This would seem closer to the openTelemetry object in the open telemetry api, ie the thing to get a tracer from.
- The tracer in open telemetry just has a single method to get a span. Why not replicate this API?
@@ -42,7 +42,7 @@ public static List<String> bootstrapJvmOptions(Path plugins) throws IOException | |||
private static List<PluginInfo> getPluginInfo(Path plugins) throws IOException { | |||
final List<PluginInfo> pluginInfo = new ArrayList<>(); | |||
|
|||
final List<Path> pluginDirs = Files.list(plugins).collect(Collectors.toList()); | |||
final List<Path> pluginDirs = listFiles(plugins); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you split this out to a separate PR?
try (KeyStoreWrapper keyStoreWrapper = KeyStoreWrapper.load(config)) { | ||
if (keyStoreWrapper != null) { | ||
try { | ||
keyStoreWrapper.decrypt(args.keystorePassword().clone().getChars()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we need the keystore here, we should rework ServerArgs so we load the keystore just once (see where we load it in ServerCli to find the password).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
++ on splitting into multiple PRs, otherwise getting the right folks to review the individual parts will be difficult. I'd think all the libs and the Tracer interface itself could go in one PR and then each usage more or less in its own PR.
I'll work on splitting this up. It grew over time.
@rjernst at the moment we can't configure it statically, because the tracer implementation lives in a module. We could move this into
I'm not sure I follow, can you elaborate. A better comparator would be the |
Part of elastic#84369. Split out from elastic#87696. Rework how some work is executed by creating child tasks for them, so that when traced by APM, it results in more meaningful parent and child tasks in the UI. It also improves how Elasticsearch is modelling the work.
I could see a small part of this being in server.
I was looking at the Tracer class in the opentelemetry java sdk. That class has a single method to return a SpanBuilder. It looks stuff from their SpanBuilder is much of what is on the Tracer in this PR. Why the collapse? ie why is there no SpanBuilder here? Something else that I could not understand from my brief reading of this PR is what thread safety or multi threaded expectations exist in in this new api (mostly on the Tracer class, which appears to be a single instance for the node). |
|
||
Instead, when Elasticsearch bootstraps itself, it compiles all APM settings | ||
together, including any `secret_key` or `api_key` values from the ES keystore, | ||
and writes out a temporary APM config file containin all static configuration |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and writes out a temporary APM config file containin all static configuration | |
and writes out a temporary APM config file containing all static configuration |
The Part of the problem is that we might start a span on one thread and end it on another, which makes using the OTel (or OTel-like) API harder. It expects you to use thread-locals to propagate context automatically, and so we have to do something different. There's also all the faffing about to get distributed tracing to work across nodes. Encapsulating away the details of the OTel API is better, IMO. If we later found it useful to be able to return a builder from the tracer then we can add that, but right now the idea is that we pass something that is traceable to the tracer, and it takes care of translating that to whatever tracing API we're using.
We expect there to be a single |
This PR has been broken up into smaller PRs. |
Supersedes #84631. Part of #84369.
This PR takes the proof-of-concept work for adding tracing to Elasticsearch and carries it to an initial completion. It adds an
apm-integration
X-Pack plugin, which uses the OpenTelemetry API to create spans. Also if APM is enabled, Elasticsearch is started using the APM java agent as a Java agent. This is responsible for ships spans and other metrics to an APM server.There are also a number of changes that are concerned with expressing some activity in ES more clearly in terms of tasks, since this work maps tasks to spans.
Notes to reviewers
Start with the file
TRACING.md
, it provides useful context for this work. If you want to see it in action, I can point you at an existing APM server, or give you adocker-compose.yml
which will let you run your own stack with distributed tracing.I have run several benchmarks with APM disabled, and could see nothing consistently slower. I still need to run some benchmarks with APM enabled, though my hope is that this isn't a blocker to merging, and that we can iterate post-merge if there are hot areas.
I would particularly like feedback about the config mechanism.
Todo: