From b8e63ff5bb9742aa09d3fc67d601c1e98c6ad333 Mon Sep 17 00:00:00 2001 From: Fabrizio Ferri-Benedetti Date: Fri, 19 Jul 2024 08:04:08 +0200 Subject: [PATCH] Add Performance doc to Java (#4828) Co-authored-by: Tiffany Hrabusa <30397949+tiffany76@users.noreply.github.com> Co-authored-by: opentelemetrybot <107717825+opentelemetrybot@users.noreply.github.com> Co-authored-by: Phillip Carter Co-authored-by: Jay DeLuca Co-authored-by: Trask Stalnaker --- content/en/docs/languages/java/performance.md | 194 ++++++++++++++++++ 1 file changed, 194 insertions(+) create mode 100644 content/en/docs/languages/java/performance.md diff --git a/content/en/docs/languages/java/performance.md b/content/en/docs/languages/java/performance.md new file mode 100644 index 000000000000..d47dae31e231 --- /dev/null +++ b/content/en/docs/languages/java/performance.md @@ -0,0 +1,194 @@ +--- +title: Performance +description: Performance reference for the OpenTelemetry Java agent +weight: 75 +--- + +The OpenTelemetry Java agent instruments your application by running inside the +same Java Virtual Machine (JVM). Like any other software agent, the Java agent +requires system resources like CPU, memory, and network bandwidth. The use of +resources by the agent is called agent overhead or performance overhead. The +OpenTelemetry Java agent has minimal impact on system performance when +instrumenting JVM applications, although the final agent overhead depends on +multiple factors. + +Some factors that might increase agent overhead are environmental, such as the +physical machine architecture, CPU frequency, amount and speed of memory, system +temperature, and resource contention. Other factors include virtualization and +containerization, the operating system and its libraries, the JVM version and +vendor, JVM settings, the algorithmic design of the software being monitored, +and software dependencies. + +Due to the complexity of modern software and the broad diversity in deployment +scenarios, it is impossible to come up with a single agent overhead estimate. To +find the overhead of any instrumentation agent in a given deployment, you have +to conduct experiments and collect measurements directly. Therefore, treat all +statements about performance as general information and guidelines that are +subject to evaluation in a specific system. + +The following sections describe the minimum requirements of the OpenTelemetry +Java agent, as well as potential constraints impacting performance, and +guidelines to optimize and troubleshoot the performance of the agent. + +## Guidelines to reduce agent overhead + +The following best practices and techniques might help reduce overhead caused by +the Java agent. + +### Configure trace sampling + +The volume of spans processed by the instrumentation might impact agent +overhead. You can configure trace sampling to adjust the span volume and reduce +resource usage. See [Sampling](/docs/languages/java/sampling). + +### Turn off specific instrumentations + +You can further reduce agent overhead by turning off instrumentations that +aren't needed or are producing too many spans. To turn off an instrumentation, +use `-Dotel.instrumentation..enabled=false` or the +`OTEL_INSTRUMENTATION__ENABLED` environment variable, where `` is +the name of the instrumentation. + +For example, the following option turns off the JDBC instrumentation: +`-Dotel.instrumentation.jdbc.enabled=false` + +### Allocate more memory for the application + +Increasing the maximum heap size of the JVM using the `-Xmx` option might +help in alleviating agent overhead issues, as instrumentations can generate a +large number of short-lived objects in memory. + +### Reduce manual instrumentation to what you need + +Too much manual instrumentation might introduce inefficiencies that increase +agent overhead. For example, using `@WithSpan` on every method results in a high +span volume, which in turn increases noise in the data and consumes more system +resources. + +### Provision adequate resources + +Make sure to provision enough resources for your instrumentation and for the +Collector. The amount of resources such as memory or disk depend on your +application architecture and needs. For example, a common setup is to run the +instrumented application on the same host as the OpenTelemetry Collector. In +that case, consider rightsizing the resources for the Collector and optimize its +settings. See [Scaling](/docs/collector/scaling/). + +## Constraints impacting the performance of the Java agent + +In general, the more telemetry you collect from your application, the greater +the the impact on agent overhead. For example, tracing methods that aren't +relevant to your application can still produce considerable agent overhead +because tracing such methods is computationally more expensive than running the +method itself. Similarly, high cardinality tags in metrics might increase memory +usage. Debug logging, if turned on, also increases write operations to disk and +memory usage. + +Some instrumentations, for example JDBC or Redis, produce high span volumes that +increase agent overhead. For more information on how to turn off unnecessary +instrumentations, see +[Turn off specific instrumentations](#turn-off-specific-instrumentations). + +> [!NOTE] Experimental features of the Java agent might increase agent overhead +> due to the experimental focus on functionality over performance. Stable +> features are safer in terms of agent overhead. + +## Troubleshooting agent overhead issues + +When troubleshooting agent overhead issues, do the following: + +- Check minimum requirements. See + [Prerequisites](/docs/languages/java/getting-started/#prerequisites). +- Use the latest compatible version of the Java agent. +- Use the latest compatible version of your JVM. + +Consider taking the following actions to decrease agent overhead: + +- If your application is approaching memory limits, consider giving it more + memory. +- If your application is using all the CPU, you might want to scale it + horizontally. +- Try turning off or tuning metrics. +- Tune trace sampling settings to reduce span volume. +- Turn off specific instrumentations. +- Review manual instrumentation for unnecessary span generation. + +## Guidelines for measuring agent overhead + +Measuring agent overhead in your own environment and deployments provides +accurate data about the impact of instrumentation on the performance of your +application or service. The following guidelines describe the general steps for +collecting and comparing reliable agent overhead measurements. + +### Decide what you want to measure + +Different users of your application or service might notice different aspects of +agent overhead. For example, while end users might notice degradation in service +latency, power users with heavy workloads pay more attention to CPU overhead. On +the other hand, users who deploy frequently, for example due to elastic +workloads, care more about startup time. + +Reduce your measurements to factors that are sure to impact user experience, so +your datasets don't contain irrelevant information. Some examples of +measurements include the following: + +- User average, user peak, and machine average CPU usage +- Total memory allocated and maximum heap used +- Garbage collection pause time +- Startup time in milliseconds +- Average and percentile 95 (p95) service latency +- Network read and write average throughput + +### Prepare a suitable test environment + +By measuring agent overhead in a controlled test environment you can better +identify the factors affecting performance. When preparing a test environment, +complete the following: + +1. Make sure that the configuration of the test environment resembles + production. +2. Isolate the application under test from other services that might interfere. +3. Turn off or remove all unnecessary system services on the application host. +4. Ensure that the application has enough system resources to handle the test + workload. + +### Create a battery of realistic tests + +Design the tests that you run against the test environment to resemble typical +workloads as much as possible. For example, if some REST API endpoints of your +service are susceptible to high request volumes, create a test that simulates +heavy network traffic. + +For Java applications, use a warm-up phase prior to starting measurements. The +JVM is a highly dynamic machine that performs a large number of optimizations +through just-in-time compilation (JIT). The warm-up phase helps the application +to finish most of its class loading and gives the JIT compiler time to run the +majority of optimizations. + +Make sure to run a large number of requests and to repeat the test pass many +times. This repetition helps to ensure a representative data sample. Include +error scenarios in your test data. Simulate an error rate similar to that of a +normal workload, typically between 2% and 10%. + +{{% alert title="Note" color="info" %}} + +Tests might increase costs when targeting observability backends and other +commercial services. Plan your tests accordingly or consider using alternative +solutions, such as self-hosted or locally run backends. + +{{% /alert %}} + +### Collect comparable measurements + +To identify which factors might be affecting performance and causing agent +overhead, collect measurements in the same environment after modifying a single +factor or condition. + +### Analyze the agent overhead data + +After collecting data from multiple passes, you can plot results in a chart or +compare averages using statistical tests to check for significant differences. + +Consider that different stacks, applications, and environments might result in +different operational characteristics and different agent overhead measurement +results.