Update to otel 0.20.0 #3649

BrynCooke · 2023-08-22T14:53:47Z

Updates Otel to 0.20

Otel 0.20 had a major change in the way that metrics were implemented resulting in us having to do a complete overhaul on our integration.

Problems include:

Some metrics traits were converted to structs, preventing the use of our wrappers.
It is no longer possible to migrate metrics from one prometheus registry to another. This resulted in a significant amount of restructuring and special handling to allow prometheus meter provider to remain across reloads.
Related to the above many tests were broken due to their reliance on prometheus and our lack of testability.

Unfortunately the above meant a significant amount of rework just to retain the equivalent tests that we had before.

Some dev docs have been created to try and give a higher level overview of how things fit together:
https://github.com/apollographql/router/blob/bryn/otel-update/dev-docs/metrics.md

On the plus side perf looks like it hasn't regressed.

Run of local perf here:

Checklist

Complete the checklist (and note appropriate exceptions) before a final PR is raised.

Exceptions

Note any exceptions here

Notes

[^1]. It may be appropriate to bring upcoming changes to the attention of other (impacted) groups. Please endeavour to do this before seeking PR approval. The mechanism for doing this will vary considerably, so use your judgement as to how and when to do this.
[^2]. Configuration is an important part of many changes. Where applicable please try to document configuration examples.
[^3]. Tick whichever testing boxes are applicable. If you are adding Manual Tests:
- please document the manual testing (extensively) in the Exceptions.
- please raise a separate issue to automate the test and label it (or ask for it to be labeled) as manual test

router-perf · 2023-08-22T14:54:21Z

Geal

I see how the new approach on metrics can be more flexible, but the performance impact is not clear, so it should go through a phase of profiling (not only benchmarking) right now to validate the solution, instead of at the end when we want to merge the PR without conflicting with parallel work.
Also, why is the new metrics system added as part of the otel 0.20 update? I have a hard time recognizing which parts are related to the update, and which ones are required by the new metrics system

apollo-router/src/metrics/mod.rs

apollo-router/src/metrics/aggregation.rs

Sorry, going to clear this until the PR is marked as ready for review so that I can see CI status at a glance.

This upgrade has many changes due to a new metrics API upstream. Metrics have largely been reworked, and in addition, some new metrics macros have been added to enabled us to move towards a better long term metrics story.

BrynCooke · 2023-09-18T15:36:35Z

apollo-router/tests/snapshots/tracing_tests__traced_basic_composition.snap

@@ -200,14 +200,6 @@ expression: get_spans()
                  [
                    "apollo_private.operation_signature",
                    "# -\n{topProducts{name reviews{author{id name}id product{name}}upc}}"
-                  ],


These events no longer go through the tracing layer, so don't get picked up by test span.

BrynCooke · 2023-09-18T15:38:23Z

apollo-router/src/tracer.rs

@@ -54,7 +54,7 @@ mod test {
    use std::sync::Mutex;

    use once_cell::sync::Lazy;


The changes to this file are required because if the tracer drops out of scope then otel will now avoid doing work by making calls to otel context a no-op.

BrynCooke · 2023-09-18T15:39:23Z

apollo-router/src/plugins/telemetry/tracing/jaeger.rs

Jaeger no-longer needs this workaround as it is possible to create the exporter without setting it as the default tracer pipeline.

BrynCooke · 2023-09-18T15:46:24Z

apollo-router/src/plugins/telemetry/reload.rs

@@ -5,7 +5,6 @@ use std::sync::atomic::Ordering;
 use anyhow::anyhow;


In this file we no longer use the reload layer when integrating with the metrics layer. As reloading functionality has shifted to the aggregate meter provider it's not needed.

BrynCooke · 2023-09-18T15:50:37Z

apollo-router/src/plugins/telemetry/mod.rs

@@ -2817,6 +2519,7 @@ mod tests {

    #[tokio::test]
    async fn test_handle_error_throttling() {


Previously this was failing due to the use of the global to hold the dashmap and tests interfering with each other.

BrynCooke · 2023-09-18T15:54:33Z

apollo-router/src/plugins/telemetry/mod.rs

    use crate::plugin::test::MockSubgraphService;
    use crate::plugin::test::MockSupergraphService;
    use crate::plugin::DynPlugin;
-    use crate::plugins::telemetry::handle_error;
+    use crate::plugins::telemetry::handle_error_internal;
    use crate::services::SubgraphRequest;
    use crate::services::SubgraphResponse;
    use crate::services::SupergraphRequest;
    use crate::services::SupergraphResponse;



The tests in this file have been improved/split up.
They didn't work with the upgrade due to the way that they relied on the meter provider and metrics layer to be part of the otel plugin. The otel plugin no longer has a reference to the aggregate meter provider.

BrynCooke · 2023-09-18T15:57:04Z

apollo-router/src/metrics/aggregation.rs

@@ -0,0 +1,431 @@
+use std::any::Any;


This is a new aggregate meter provider. It looks similar to the old one, but is designed to be mutated rather than replaced. This was required because we need to keep the prometheus meter provider around if we want metrics to be persisted across reloads.

BrynCooke · 2023-09-18T16:00:01Z

apollo-router/src/metrics/filter.rs

@@ -0,0 +1,333 @@
+use std::any::Any;


Rework of the existing metrics filtering mechanism. The functionality is very similar and will return a noop or real instrument depending on criteria.

BrynCooke · 2023-09-21T13:31:01Z

@BrynCooke for the macro formatting, what if you put $attr_key:literal instead of $attr_key:expr ?

@bnjjj It almost works, but then there are lots of issues with . and - characters in the keys.

I think this wold be worth fixing, but could be done as a follow up by someone with deeper macro knowledge than I.

It needs to take an ident, but also a string literal when there is call for it.

bnjjj · 2023-09-22T08:55:25Z

@BrynCooke macros are part of the public api and so we won't be able to change it afterwards, if you have to support ident then you just add a new branch in the macro to support it. I think it will work

improve metrics macros Signed-off-by: Benjamin Coenen <[email protected]>

bnjjj · 2023-09-22T12:15:59Z

One last concern I have right now is that we're testing the macro metrics using only the internal structure. We don't have a real test that test the final result of these macros for example in a prometheus format. What do you think @BrynCooke ? Is it something we could add easily in that PR ? Or in a followup PR ?

BrynCooke · 2023-09-22T13:57:51Z

One last concern I have right now is that we're testing the macro metrics using only the internal structure. We don't have a real test that test the final result of these macros for example in a prometheus format. What do you think @BrynCooke ? Is it something we could add easily in that PR ? Or in a followup PR ?

There are still prometheus tests:
https://github.com/apollographql/router/blob/bryn/otel-update/apollo-router/src/plugins/telemetry/mod.rs#L2496-L2524
and
https://github.com/apollographql/router/blob/bryn/otel-update/apollo-router/tests/metrics_tests.rs

They've just been split up from the other tests that added attributes and such.

bnjjj · 2023-09-25T08:09:23Z

@BrynCooke Ok if these tests are enough we should add more assertions to check that when we're creating a counter it creates a counter, same for histogram and so one... Because we didn't spot the bug regarding the metrics macros (using counter macro but under the hood it was creating an histogram)

BrynCooke · 2023-09-25T08:35:19Z

Agree, I'll add tests for the metric type.

BrynCooke · 2023-09-25T10:29:09Z

@bnjjj Added in 48ecaa3

Note that I have now split out the assertion macros

Geal · 2023-09-25T10:21:46Z

apollo-router/src/plugins/telemetry/metrics/apollo.rs

+                .with_endpoint(endpoint.as_str())
+                .with_timeout(batch_processor.max_export_timeout)
+                .with_metadata(metadata)
+                .with_compression(opentelemetry_otlp::Compression::Gzip),


is this something we want to make configurable?

I don't think so for apollo metrics, but could be persuaded if we think there is a good reason to.

For user otlp metrics we should absolutely add a configuration option. Happy to open a new ticket for that.

It doesn't really work with the test level meter provider and is tested via an integration test.

.changesets/maint_bryn_otel_update.md

apollo-router/Cargo.toml

garypen · 2023-09-26T13:50:11Z

apollo-router/src/plugins/telemetry/metrics/mod.rs

+            .unwrap_or_default();
+
+        let mut resource = Resource::from_detectors(
+            Duration::from_secs(0),


I know we are ignoring the timeout in our code, but it seems risky to set that at 0 secs. Maybe make it 5 seconds or at least a positive amount of time?
(same comment on other locations where timeout is set to 0.)

Not sure about this. Setting this to a positive amount of time implies that it may block for that amount of time, and we'll have to move execution to a blocking task.

Resource::default() also uses duration of zero under the hood.

apollo-router/src/plugins/telemetry/metrics/mod.rs

garypen · 2023-09-26T14:01:49Z

apollo-router/src/plugins/telemetry/mod.rs


-        metrics.http_requests_duration.record(
-            &opentelemetry::Context::current(),
+        f64_histogram!(


Is that a rename? (appending _seconds)

It's not a rename. The original code is referencing metrics.http_requests_duration which is declared as: apollo_router_http_request_duration_seconds.

garypen · 2023-09-26T14:02:35Z

apollo-router/src/plugins/telemetry/mod.rs

-        metrics.http_requests_duration.record(
-            &opentelemetry::Context::current(),
+        f64_histogram!(
+            "apollo_router_http_request_duration_seconds",


Same comment about possible rename.

It's not a rename. The original code is referencing metrics.http_requests_duration which is declared as: apollo_router_http_request_duration_seconds.

garypen · 2023-09-26T14:13:04Z

...elemetry/snapshots/apollo_router__plugins__telemetry__tests__it_test_prometheus_metrics.snap

@@ -1,8 +1,17 @@
 ---
 source: apollo-router/src/plugins/telemetry/mod.rs
-expression: prom_metrics
+expression: prometheus_metrics


I don't really understand this change, but won't this impact existing consumers of http_request_duration?

Hmm, I'm now confused also. Checking.

The metric was always called: apollo_router_http_request_duration_seconds_bucket. Pretty unfortunate.

In terms of why the snapshot has changed, it's because we no longer specify lots of extra attributes just for this prometheus test. The test is focused just on did prometheus serve up a result and did it look OK.

Extra attributes are still tested in the new tests that use the testing macros, such as: test_supergraph_metrics_ok

BrynCooke force-pushed the bryn/otel-update branch from 1cfa621 to fbac5d1 Compare September 7, 2023 22:28

This comment has been minimized.

Sign in to view

Geal previously requested changes Sep 14, 2023

View reviewed changes

apollo-router/src/metrics/mod.rs Outdated Show resolved Hide resolved

apollo-router/src/metrics/aggregation.rs Outdated Show resolved Hide resolved

BrynCooke force-pushed the bryn/otel-update branch 10 times, most recently from 07ddbcc to 9fdec83 Compare September 18, 2023 14:25

BrynCooke marked this pull request as ready for review September 18, 2023 14:52

BrynCooke requested review from Geal, a team, garypen and o0Ignition0o September 18, 2023 14:52

Update to opentelemetry 0.20

f6a67b6

This upgrade has many changes due to a new metrics API upstream. Metrics have largely been reworked, and in addition, some new metrics macros have been added to enabled us to move towards a better long term metrics story.

BrynCooke force-pushed the bryn/otel-update branch from 9fdec83 to f6a67b6 Compare September 18, 2023 15:02

BrynCooke commented Sep 18, 2023

View reviewed changes

bryn added 2 commits September 21, 2023 15:12

Fix accidentally renamed metric

aa2fab7

Add tests for metric names

67957bd

BrynCooke requested a review from bnjjj September 22, 2023 07:50

Merge dev

069742c

BrynCooke linked an issue Sep 22, 2023 that may be closed by this pull request

chore: update opentelemetry and associated crates to 0.20.0 #3601

Closed

improve metrics macros (#3880)

040dfb7

improve metrics macros Signed-off-by: Benjamin Coenen <[email protected]>

Make metrics assertions type specific

48ecaa3

Geal approved these changes Sep 25, 2023

View reviewed changes

Fix gauge test

a3fa457

bnjjj approved these changes Sep 25, 2023

View reviewed changes

bryn added 3 commits September 25, 2023 12:56

Make prom tests ignore reload.

0a65766

It doesn't really work with the test level meter provider and is tested via an integration test.

Make linux use xlarge for builds in circle

286f3e7

Merge dev

b900187

garypen reviewed Sep 26, 2023

View reviewed changes

bryn added 5 commits September 26, 2023 15:26

Update changelog

8a65c2e

Fix alpha ordering in cargo toml

fba92d2

Add constant for unknown_service

40acb7b

Merge dev

5f69d6c

Update cargo lock

6516f2c

BrynCooke merged commit bca9d86 into dev Sep 27, 2023
2 checks passed

BrynCooke deleted the bryn/otel-update branch September 27, 2023 09:32

Geal mentioned this pull request Oct 4, 2023

prep release: v1.32.0 #3965

Merged

lorf mentioned this pull request Oct 11, 2023

metric resources are disappeared from prometheus metric labels after upgrade #4011

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update to otel 0.20.0 #3649

Update to otel 0.20.0 #3649

BrynCooke commented Aug 22, 2023 •

edited

Loading

router-perf bot commented Aug 22, 2023 •

edited by BrynCooke

Loading

This comment has been minimized.

Geal left a comment

BrynCooke Sep 18, 2023

BrynCooke Sep 18, 2023

BrynCooke Sep 18, 2023

BrynCooke Sep 18, 2023

BrynCooke Sep 18, 2023

BrynCooke Sep 18, 2023

BrynCooke Sep 18, 2023

BrynCooke Sep 18, 2023

BrynCooke commented Sep 21, 2023 •

edited

Loading

bnjjj commented Sep 22, 2023

bnjjj commented Sep 22, 2023

BrynCooke commented Sep 22, 2023

bnjjj commented Sep 25, 2023

BrynCooke commented Sep 25, 2023

BrynCooke commented Sep 25, 2023 •

edited

Loading

Geal Sep 25, 2023

BrynCooke Sep 25, 2023

BrynCooke Sep 27, 2023

garypen Sep 26, 2023

BrynCooke Sep 26, 2023

garypen Sep 26, 2023

BrynCooke Sep 26, 2023

garypen Sep 26, 2023

BrynCooke Sep 26, 2023

garypen Sep 26, 2023

BrynCooke Sep 26, 2023

BrynCooke Sep 26, 2023

		@@ -54,7 +54,7 @@ mod test {
		use std::sync::Mutex;

		use once_cell::sync::Lazy;

		@@ -5,7 +5,6 @@ use std::sync::atomic::Ordering;
		use anyhow::anyhow;

		@@ -2817,6 +2519,7 @@ mod tests {

		#[tokio::test]
		async fn test_handle_error_throttling() {

Update to otel 0.20.0 #3649

Update to otel 0.20.0 #3649

Conversation

BrynCooke commented Aug 22, 2023 • edited Loading

router-perf bot commented Aug 22, 2023 • edited by BrynCooke Loading

This comment has been minimized.

Geal left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BrynCooke commented Sep 21, 2023 • edited Loading

bnjjj commented Sep 22, 2023

bnjjj commented Sep 22, 2023

BrynCooke commented Sep 22, 2023

bnjjj commented Sep 25, 2023

BrynCooke commented Sep 25, 2023

BrynCooke commented Sep 25, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BrynCooke commented Aug 22, 2023 •

edited

Loading

router-perf bot commented Aug 22, 2023 •

edited by BrynCooke

Loading

BrynCooke commented Sep 21, 2023 •

edited

Loading

BrynCooke commented Sep 25, 2023 •

edited

Loading