Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update to otel 0.20.0 #3649

Merged
merged 34 commits into from
Sep 27, 2023
Merged

Update to otel 0.20.0 #3649

merged 34 commits into from
Sep 27, 2023

Conversation

BrynCooke
Copy link
Contributor

@BrynCooke BrynCooke commented Aug 22, 2023

Updates Otel to 0.20

Otel 0.20 had a major change in the way that metrics were implemented resulting in us having to do a complete overhaul on our integration.

Problems include:

  • Some metrics traits were converted to structs, preventing the use of our wrappers.
  • It is no longer possible to migrate metrics from one prometheus registry to another. This resulted in a significant amount of restructuring and special handling to allow prometheus meter provider to remain across reloads.
  • Related to the above many tests were broken due to their reliance on prometheus and our lack of testability.

Unfortunately the above meant a significant amount of rework just to retain the equivalent tests that we had before.

Some dev docs have been created to try and give a higher level overview of how things fit together:
https://github.com/apollographql/router/blob/bryn/otel-update/dev-docs/metrics.md

On the plus side perf looks like it hasn't regressed.

Run of local perf here:
Uploading router.svg…

Checklist

Complete the checklist (and note appropriate exceptions) before a final PR is raised.

  • Changes are compatible[^1]
  • Documentation[^2] completed
  • Performance impact assessed and acceptable
  • Tests added and passing[^3]
    • Unit Tests
    • Integration Tests
    • Manual Tests

Exceptions

Note any exceptions here

Notes

[^1]. It may be appropriate to bring upcoming changes to the attention of other (impacted) groups. Please endeavour to do this before seeking PR approval. The mechanism for doing this will vary considerably, so use your judgement as to how and when to do this.
[^2]. Configuration is an important part of many changes. Where applicable please try to document configuration examples.
[^3]. Tick whichever testing boxes are applicable. If you are adding Manual Tests:
- please document the manual testing (extensively) in the Exceptions.
- please raise a separate issue to automate the test and label it (or ask for it to be labeled) as manual test

@router-perf
Copy link

router-perf bot commented Aug 22, 2023

CI performance tests

  • step - Basic stress test that steps up the number of users over time
  • events_without_dedup - Stress test for events with a lot of users and deduplication DISABLED
  • xlarge-request - Stress test with 10 MB request payload
  • xxlarge-request - Stress test with 100 MB request payload
  • events_big_cap_high_rate - Stress test for events with a lot of users, deduplication enabled and high rate event with a big queue capacity
  • const - Basic stress test that runs with a constant number of users
  • reload - Reload test over a long period of time at a constant rate of users
  • large-request - Stress test with a 1 MB request payload
  • events - Stress test for events with a lot of users and deduplication ENABLED
  • step-jemalloc-tuning - Clone of the basic stress test for jemalloc tuning
  • no-graphos - Basic stress test, no GraphOS.

@github-actions

This comment has been minimized.

Geal
Geal previously requested changes Sep 14, 2023
Copy link
Contributor

@Geal Geal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see how the new approach on metrics can be more flexible, but the performance impact is not clear, so it should go through a phase of profiling (not only benchmarking) right now to validate the solution, instead of at the end when we want to merge the PR without conflicting with parallel work.
Also, why is the new metrics system added as part of the otel 0.20 update? I have a hard time recognizing which parts are related to the update, and which ones are required by the new metrics system

apollo-router/src/metrics/mod.rs Outdated Show resolved Hide resolved
apollo-router/src/metrics/aggregation.rs Outdated Show resolved Hide resolved
@BrynCooke BrynCooke dismissed Geal’s stale review September 15, 2023 11:55

Sorry, going to clear this until the PR is marked as ready for review so that I can see CI status at a glance.

@BrynCooke BrynCooke force-pushed the bryn/otel-update branch 10 times, most recently from 07ddbcc to 9fdec83 Compare September 18, 2023 14:25
@BrynCooke BrynCooke marked this pull request as ready for review September 18, 2023 14:52
@BrynCooke BrynCooke requested review from Geal, a team, garypen and o0Ignition0o September 18, 2023 14:52
This upgrade has many changes due to a new metrics API upstream.

Metrics have largely been reworked, and in addition, some new metrics macros have been added to enabled us to move towards a better long term metrics story.
@@ -200,14 +200,6 @@ expression: get_spans()
[
"apollo_private.operation_signature",
"# -\n{topProducts{name reviews{author{id name}id product{name}}upc}}"
],
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These events no longer go through the tracing layer, so don't get picked up by test span.

@@ -54,7 +54,7 @@ mod test {
use std::sync::Mutex;

use once_cell::sync::Lazy;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes to this file are required because if the tracer drops out of scope then otel will now avoid doing work by making calls to otel context a no-op.

}
}
}
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Jaeger no-longer needs this workaround as it is possible to create the exporter without setting it as the default tracer pipeline.

@@ -5,7 +5,6 @@ use std::sync::atomic::Ordering;
use anyhow::anyhow;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this file we no longer use the reload layer when integrating with the metrics layer. As reloading functionality has shifted to the aggregate meter provider it's not needed.

@@ -2817,6 +2519,7 @@ mod tests {

#[tokio::test]
async fn test_handle_error_throttling() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously this was failing due to the use of the global to hold the dashmap and tests interfering with each other.

use crate::plugin::test::MockSubgraphService;
use crate::plugin::test::MockSupergraphService;
use crate::plugin::DynPlugin;
use crate::plugins::telemetry::handle_error;
use crate::plugins::telemetry::handle_error_internal;
use crate::services::SubgraphRequest;
use crate::services::SubgraphResponse;
use crate::services::SupergraphRequest;
use crate::services::SupergraphResponse;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tests in this file have been improved/split up.
They didn't work with the upgrade due to the way that they relied on the meter provider and metrics layer to be part of the otel plugin. The otel plugin no longer has a reference to the aggregate meter provider.

@@ -0,0 +1,431 @@
use std::any::Any;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a new aggregate meter provider. It looks similar to the old one, but is designed to be mutated rather than replaced. This was required because we need to keep the prometheus meter provider around if we want metrics to be persisted across reloads.

@@ -0,0 +1,333 @@
use std::any::Any;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rework of the existing metrics filtering mechanism. The functionality is very similar and will return a noop or real instrument depending on criteria.

@BrynCooke
Copy link
Contributor Author

BrynCooke commented Sep 21, 2023

@BrynCooke for the macro formatting, what if you put $attr_key:literal instead of $attr_key:expr ?

@bnjjj It almost works, but then there are lots of issues with . and - characters in the keys.

I think this wold be worth fixing, but could be done as a follow up by someone with deeper macro knowledge than I.

It needs to take an ident, but also a string literal when there is call for it.

@BrynCooke BrynCooke requested a review from bnjjj September 22, 2023 07:50
@bnjjj
Copy link
Contributor

bnjjj commented Sep 22, 2023

@BrynCooke macros are part of the public api and so we won't be able to change it afterwards, if you have to support ident then you just add a new branch in the macro to support it. I think it will work

@BrynCooke BrynCooke linked an issue Sep 22, 2023 that may be closed by this pull request
improve metrics macros

Signed-off-by: Benjamin Coenen <[email protected]>
@bnjjj
Copy link
Contributor

bnjjj commented Sep 22, 2023

One last concern I have right now is that we're testing the macro metrics using only the internal structure. We don't have a real test that test the final result of these macros for example in a prometheus format. What do you think @BrynCooke ? Is it something we could add easily in that PR ? Or in a followup PR ?

@BrynCooke
Copy link
Contributor Author

One last concern I have right now is that we're testing the macro metrics using only the internal structure. We don't have a real test that test the final result of these macros for example in a prometheus format. What do you think @BrynCooke ? Is it something we could add easily in that PR ? Or in a followup PR ?

There are still prometheus tests:
https://github.com/apollographql/router/blob/bryn/otel-update/apollo-router/src/plugins/telemetry/mod.rs#L2496-L2524
and
https://github.com/apollographql/router/blob/bryn/otel-update/apollo-router/tests/metrics_tests.rs

They've just been split up from the other tests that added attributes and such.

@bnjjj
Copy link
Contributor

bnjjj commented Sep 25, 2023

@BrynCooke Ok if these tests are enough we should add more assertions to check that when we're creating a counter it creates a counter, same for histogram and so one... Because we didn't spot the bug regarding the metrics macros (using counter macro but under the hood it was creating an histogram)

@BrynCooke
Copy link
Contributor Author

Agree, I'll add tests for the metric type.

@BrynCooke
Copy link
Contributor Author

BrynCooke commented Sep 25, 2023

@bnjjj Added in 48ecaa3

Note that I have now split out the assertion macros

.with_endpoint(endpoint.as_str())
.with_timeout(batch_processor.max_export_timeout)
.with_metadata(metadata)
.with_compression(opentelemetry_otlp::Compression::Gzip),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this something we want to make configurable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so for apollo metrics, but could be persuaded if we think there is a good reason to.

For user otlp metrics we should absolutely add a configuration option. Happy to open a new ticket for that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bryn added 3 commits September 25, 2023 12:56
It doesn't really work with the test level meter provider and is tested via an integration test.
.changesets/maint_bryn_otel_update.md Outdated Show resolved Hide resolved
apollo-router/Cargo.toml Outdated Show resolved Hide resolved
.unwrap_or_default();

let mut resource = Resource::from_detectors(
Duration::from_secs(0),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know we are ignoring the timeout in our code, but it seems risky to set that at 0 secs. Maybe make it 5 seconds or at least a positive amount of time?
(same comment on other locations where timeout is set to 0.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure about this. Setting this to a positive amount of time implies that it may block for that amount of time, and we'll have to move execution to a blocking task.

Resource::default() also uses duration of zero under the hood.

apollo-router/src/plugins/telemetry/metrics/mod.rs Outdated Show resolved Hide resolved

metrics.http_requests_duration.record(
&opentelemetry::Context::current(),
f64_histogram!(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that a rename? (appending _seconds)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not a rename. The original code is referencing metrics.http_requests_duration which is declared as: apollo_router_http_request_duration_seconds.

metrics.http_requests_duration.record(
&opentelemetry::Context::current(),
f64_histogram!(
"apollo_router_http_request_duration_seconds",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment about possible rename.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not a rename. The original code is referencing metrics.http_requests_duration which is declared as: apollo_router_http_request_duration_seconds.

@@ -1,8 +1,17 @@
---
source: apollo-router/src/plugins/telemetry/mod.rs
expression: prom_metrics
expression: prometheus_metrics
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really understand this change, but won't this impact existing consumers of http_request_duration?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I'm now confused also. Checking.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The metric was always called: apollo_router_http_request_duration_seconds_bucket. Pretty unfortunate.

In terms of why the snapshot has changed, it's because we no longer specify lots of extra attributes just for this prometheus test. The test is focused just on did prometheus serve up a result and did it look OK.

Extra attributes are still tested in the new tests that use the testing macros, such as: test_supergraph_metrics_ok

@BrynCooke BrynCooke merged commit bca9d86 into dev Sep 27, 2023
2 checks passed
@BrynCooke BrynCooke deleted the bryn/otel-update branch September 27, 2023 09:32
@Geal Geal mentioned this pull request Oct 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

chore: update opentelemetry and associated crates to 0.20.0
4 participants