performance: high latency when using otel libs inside common endpoints #4741

danstarns · 2024-05-29T11:56:52Z

What happened?

I am writing this issue to express the performance issues I've encountered using your libraries within common Node.js APIs such as HTTP, Express, and GraphQL. I have some links, benchmarks and examples to share with you below.

We are building an observability platform leveraging OTEL JS on the users’ client side. It was reported to us that using our libraries with their APIs resulted in massive latency issues. We did some investigation broken down in full detail here:

GraphQL Debugger Performance rocket-connect/graphql-debugger#322

Our implementation essentially wraps a GraphQL resolver, picking up on values in the tree info argument, capturing attributes, and starting a span. As described in our adjacent issue, we took several necessary steps to ensure it wasn’t an issue with our implementation. Additionally, we created a few benchmarks with and without GraphQL usage to compare performance impacts.

Hypothesis

We see that the code added in the following diff is causing a huge increase in latency to this endpoint.

import opentelemetry from "@opentelemetry/api";
import { OTLPTraceExporter } from "@opentelemetry/exporter-trace-otlp-http";
import { Resource } from "@opentelemetry/resources";
import {
  BasicTracerProvider,
  SimpleSpanProcessor,
} from "@opentelemetry/sdk-trace-base";
import { SemanticResourceAttributes } from "@opentelemetry/semantic-conventions";

const provider = new BasicTracerProvider({
  resource: new Resource({
    [SemanticResourceAttributes.SERVICE_NAME]: "basic-service",
  }),
});

const exporter = new OTLPTraceExporter({});
provider.addSpanProcessor(new SimpleSpanProcessor(exporter));
provider.register();

app.use("/hello", async (req, res) => {
+  const tracer = opentelemetry.trace.getTracer("hello-tracer");
+  const span = tracer.startSpan("hello");
+  span.setAttribute("value", "world");
+  span.end();

  res.json({ message: "Hello World" }).end();
});

Alongside our benchmarks, as detailed below, we saw the following as confirmation of our hypothesis:

Benchmarks

When we ran with our hypothesis, we found a bunch of benchmarks like opentelemetry-java/benchmarks/ and #4171. We found these benchmarks to be not that helpful, and we were unable to compare the two languages or understand how OTEL JS affects performance inside an endpoint.

From this, we created a few types of benchmarks to test our findings and improve our performance.

Isolated OTEL JS Benchmarks

https://github.com/rocket-connect/otel-js-server-benchmarks

These benchmarks demonstrate using OTEL inside a basic endpoint. Our results show that adding just a few lines of OTEL code to your handler will result in a significant reduction in the performance of your API. For example, a basic http endpoint operating at 6.26ms latency more than triples the average time to 22.03ms when OTEL is added, rendering it unusable for any production setting.

GraphQL Debugger Benchmarks

GraphQL Specific fork of isolated benchmarks:

rocket-connect/graphql-debugger/benchmarks

The text was updated successfully, but these errors were encountered:

pichlermarc · 2024-06-04T13:29:27Z

Hi thanks for the detailed report!

I've had a quick look at the benchmark code, and it looks like the SimpleSpanProcessor is used there. It will export every ended span individually causing the exporter to send an HTTP request for each of these spans. The SimpleSpanProcessor is best used for debugging and local development with the ConsoleSpanExporter as it provides quick feedback.

Consider using the BatchSpanProcessor for production use, this one will actually batch spans together and only pass them on to the SpanExporter in a batched manner, which may improve performance by reducing the amount of export requests made significantly.

danstarns · 2024-06-05T12:33:25Z

Thank you for your response @pichlermarc

I added batch versions of the benchmarks:

Isolated OTEL JS Benchmarks

https://github.com/rocket-connect/otel-js-server-benchmarks

GraphQL Debugger Benchmarks

https://github.com/rocket-connect/graphql-debugger/tree/main/benchmarks

Summary

The benchmarks indicate that using the BatchProcessor is significantly more performant than the SimpleProcessor. Thank you again for pointing this out. I have some questions:

Our benchmarks only produce a single span. Given there is nothing to batch, how does changing this processor affect the performance? Does this mean the SimpleProcessor is exporting the spans inline synchronously? Is this even possible?

Are the differences between the two processors documented anywhere?

Could the logic be wrapped in a setTimeout to prevent blocking the immediate loop?

Would it be wise to add a log in the console when users use SimpleProcessor?

pichlermarc · 2024-06-11T12:16:25Z

The benchmarks indicate that using the BatchProcessor is significantly more performant than the SimpleProcessor. Thank you again for pointing this out. I have some questions:

Glad to hear that this improved things. 🙂 There's still some performance overhead, but we're hoping to gradually improve this in the future by optimizing the SDK and Exporters.

Our benchmarks only produce a single span. Given there is nothing to batch, how does changing this processor affect the performance?

It batches spans over multiple requests, once the batch is large enough it will then export the spans.

Does this mean the SimpleProcessor is exporting the spans inline synchronously? Is this even possible?

It does some work synchronously, namely serializing the OTLP message and starting the http request it seems to be enough to block the execution.

Are the differences between the two processors documented anywhere?

It is documented on opentelemetry.io, but I think it would also make sense to document this in the JSDoc for the processor.

Could the logic be wrapped in a setTimeout to prevent blocking the immediate loop?

setTimeout or setImmediate would be an option. However, AFAIK it was a conscious design choice that the SimpleSpanProcessor does not defer the execution of the export to a later time.

Would it be wise to add a log in the console when users use SimpleProcessor?

I'd go with a JSDoc addition first, then re-evaluate if we need the log. 🙂

danstarns · 2024-06-15T20:10:09Z

Hi @pichlermarc.

There's still some performance overhead, but we're hoping to gradually improve this in the future by optimizing the SDK and Exporters.

I agree on the overhead. Even with using the Batch processor, in the screenshot, from our benchmarks, it shows that adding a few lines of otel js, using the batch processor, increases latency from 5.99ms to 8.58ms plus it impacts kps too.

Due to these findings, I made a PR to move the benchmarks I am using into this repo:

Integrated Benchmarks #4795

I have added these benchmarks to the source code of OpenTelemetry as they provide transparent insight into performance plus allow the community to iterate on performance improvements.

setTimeout or setImmediate would be an option. However, AFAIK it was a conscious design choice that the SimpleSpanProcessor does not defer the execution of the export to a later time.

Once we have some benchmarks in place we could experiment with things like this, as introducing a timeout and injecting in the current timestamp should not affect people's applications.

I'd go with a JSDoc addition first, then re-evaluate if we need the log.

Added here:

docs: add SimpleSpanProcessor warning #4798

JamieDanielson · 2024-07-24T16:42:32Z

It looks like at this point this became a documentation issue which has been updated (thank you!). We'll close this out and leave the other PR open for the benchmarks as that is still being looked into.

danstarns added bug Something isn't working triage labels May 29, 2024

danstarns mentioned this issue May 29, 2024

GraphQL Debugger Performance rocket-connect/graphql-debugger#322

Open

This was referenced Jun 4, 2024

feat: add batch processor rocket-connect/otel-js-server-benchmarks#2

Merged

benchmarks: add batch processor rocket-connect/graphql-debugger#324

Merged

danstarns mentioned this issue Jun 9, 2024

feat: change to batch processor rocket-connect/graphql-debugger#326

Merged

This was referenced Jun 15, 2024

Integrated Benchmarks #4795

Closed

docs: add SimpleSpanProcessor warning #4798

Merged

danstarns mentioned this issue Jun 15, 2024

refactor: reduce and simplify usage of getEnv() #4799

Merged

JamieDanielson closed this as completed Jul 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

performance: high latency when using otel libs inside common endpoints #4741

performance: high latency when using otel libs inside common endpoints #4741

danstarns commented May 29, 2024 •

edited

Loading

pichlermarc commented Jun 4, 2024 •

edited

Loading

danstarns commented Jun 5, 2024

pichlermarc commented Jun 11, 2024

danstarns commented Jun 15, 2024

JamieDanielson commented Jul 24, 2024

performance: high latency when using otel libs inside common endpoints #4741

performance: high latency when using otel libs inside common endpoints #4741

Comments

danstarns commented May 29, 2024 • edited Loading

What happened?

Background

Hypothesis

Benchmarks

Isolated OTEL JS Benchmarks

GraphQL Debugger Benchmarks

pichlermarc commented Jun 4, 2024 • edited Loading

danstarns commented Jun 5, 2024

Isolated OTEL JS Benchmarks

GraphQL Debugger Benchmarks

Summary

pichlermarc commented Jun 11, 2024

danstarns commented Jun 15, 2024

JamieDanielson commented Jul 24, 2024

danstarns commented May 29, 2024 •

edited

Loading

pichlermarc commented Jun 4, 2024 •

edited

Loading