-
Notifications
You must be signed in to change notification settings - Fork 271
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update to otel 0.20.0 #3649
Update to otel 0.20.0 #3649
Conversation
CI performance tests
|
1cfa621
to
fbac5d1
Compare
This comment has been minimized.
This comment has been minimized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see how the new approach on metrics can be more flexible, but the performance impact is not clear, so it should go through a phase of profiling (not only benchmarking) right now to validate the solution, instead of at the end when we want to merge the PR without conflicting with parallel work.
Also, why is the new metrics system added as part of the otel 0.20 update? I have a hard time recognizing which parts are related to the update, and which ones are required by the new metrics system
Sorry, going to clear this until the PR is marked as ready for review so that I can see CI status at a glance.
07ddbcc
to
9fdec83
Compare
This upgrade has many changes due to a new metrics API upstream. Metrics have largely been reworked, and in addition, some new metrics macros have been added to enabled us to move towards a better long term metrics story.
9fdec83
to
f6a67b6
Compare
@@ -200,14 +200,6 @@ expression: get_spans() | |||
[ | |||
"apollo_private.operation_signature", | |||
"# -\n{topProducts{name reviews{author{id name}id product{name}}upc}}" | |||
], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These events no longer go through the tracing layer, so don't get picked up by test span.
@@ -54,7 +54,7 @@ mod test { | |||
use std::sync::Mutex; | |||
|
|||
use once_cell::sync::Lazy; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes to this file are required because if the tracer drops out of scope then otel will now avoid doing work by making calls to otel context a no-op.
} | ||
} | ||
} | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Jaeger no-longer needs this workaround as it is possible to create the exporter without setting it as the default tracer pipeline.
@@ -5,7 +5,6 @@ use std::sync::atomic::Ordering; | |||
use anyhow::anyhow; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this file we no longer use the reload layer when integrating with the metrics layer. As reloading functionality has shifted to the aggregate meter provider it's not needed.
@@ -2817,6 +2519,7 @@ mod tests { | |||
|
|||
#[tokio::test] | |||
async fn test_handle_error_throttling() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previously this was failing due to the use of the global to hold the dashmap and tests interfering with each other.
use crate::plugin::test::MockSubgraphService; | ||
use crate::plugin::test::MockSupergraphService; | ||
use crate::plugin::DynPlugin; | ||
use crate::plugins::telemetry::handle_error; | ||
use crate::plugins::telemetry::handle_error_internal; | ||
use crate::services::SubgraphRequest; | ||
use crate::services::SubgraphResponse; | ||
use crate::services::SupergraphRequest; | ||
use crate::services::SupergraphResponse; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The tests in this file have been improved/split up.
They didn't work with the upgrade due to the way that they relied on the meter provider and metrics layer to be part of the otel plugin. The otel plugin no longer has a reference to the aggregate meter provider.
@@ -0,0 +1,431 @@ | |||
use std::any::Any; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a new aggregate meter provider. It looks similar to the old one, but is designed to be mutated rather than replaced. This was required because we need to keep the prometheus meter provider around if we want metrics to be persisted across reloads.
@@ -0,0 +1,333 @@ | |||
use std::any::Any; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rework of the existing metrics filtering mechanism. The functionality is very similar and will return a noop or real instrument depending on criteria.
@bnjjj It almost works, but then there are lots of issues with I think this wold be worth fixing, but could be done as a follow up by someone with deeper macro knowledge than I. It needs to take an ident, but also a string literal when there is call for it. |
@BrynCooke macros are part of the public api and so we won't be able to change it afterwards, if you have to support |
improve metrics macros Signed-off-by: Benjamin Coenen <[email protected]>
One last concern I have right now is that we're testing the macro metrics using only the internal structure. We don't have a real test that test the final result of these macros for example in a prometheus format. What do you think @BrynCooke ? Is it something we could add easily in that PR ? Or in a followup PR ? |
There are still prometheus tests: They've just been split up from the other tests that added attributes and such. |
@BrynCooke Ok if these tests are enough we should add more assertions to check that when we're creating a counter it creates a counter, same for histogram and so one... Because we didn't spot the bug regarding the metrics macros (using counter macro but under the hood it was creating an histogram) |
Agree, I'll add tests for the metric type. |
.with_endpoint(endpoint.as_str()) | ||
.with_timeout(batch_processor.max_export_timeout) | ||
.with_metadata(metadata) | ||
.with_compression(opentelemetry_otlp::Compression::Gzip), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this something we want to make configurable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so for apollo metrics, but could be persuaded if we think there is a good reason to.
For user otlp metrics we should absolutely add a configuration option. Happy to open a new ticket for that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't really work with the test level meter provider and is tested via an integration test.
.unwrap_or_default(); | ||
|
||
let mut resource = Resource::from_detectors( | ||
Duration::from_secs(0), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know we are ignoring the timeout in our code, but it seems risky to set that at 0 secs. Maybe make it 5 seconds or at least a positive amount of time?
(same comment on other locations where timeout is set to 0.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure about this. Setting this to a positive amount of time implies that it may block for that amount of time, and we'll have to move execution to a blocking task.
Resource::default() also uses duration of zero under the hood.
|
||
metrics.http_requests_duration.record( | ||
&opentelemetry::Context::current(), | ||
f64_histogram!( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is that a rename? (appending _seconds
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not a rename. The original code is referencing metrics.http_requests_duration
which is declared as: apollo_router_http_request_duration_seconds
.
metrics.http_requests_duration.record( | ||
&opentelemetry::Context::current(), | ||
f64_histogram!( | ||
"apollo_router_http_request_duration_seconds", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment about possible rename.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not a rename. The original code is referencing metrics.http_requests_duration
which is declared as: apollo_router_http_request_duration_seconds
.
@@ -1,8 +1,17 @@ | |||
--- | |||
source: apollo-router/src/plugins/telemetry/mod.rs | |||
expression: prom_metrics | |||
expression: prometheus_metrics |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't really understand this change, but won't this impact existing consumers of http_request_duration
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I'm now confused also. Checking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The metric was always called: apollo_router_http_request_duration_seconds_bucket
. Pretty unfortunate.
In terms of why the snapshot has changed, it's because we no longer specify lots of extra attributes just for this prometheus test. The test is focused just on did prometheus serve up a result and did it look OK.
Extra attributes are still tested in the new tests that use the testing macros, such as: test_supergraph_metrics_ok
Updates Otel to 0.20
Otel 0.20 had a major change in the way that metrics were implemented resulting in us having to do a complete overhaul on our integration.
Problems include:
Unfortunately the above meant a significant amount of rework just to retain the equivalent tests that we had before.
Some dev docs have been created to try and give a higher level overview of how things fit together:
https://github.com/apollographql/router/blob/bryn/otel-update/dev-docs/metrics.md
On the plus side perf looks like it hasn't regressed.
Run of local perf here:
Checklist
Complete the checklist (and note appropriate exceptions) before a final PR is raised.
Exceptions
Note any exceptions here
Notes
[^1]. It may be appropriate to bring upcoming changes to the attention of other (impacted) groups. Please endeavour to do this before seeking PR approval. The mechanism for doing this will vary considerably, so use your judgement as to how and when to do this.
[^2]. Configuration is an important part of many changes. Where applicable please try to document configuration examples.
[^3]. Tick whichever testing boxes are applicable. If you are adding Manual Tests:
- please document the manual testing (extensively) in the Exceptions.
- please raise a separate issue to automate the test and label it (or ask for it to be labeled) as
manual test