-
Notifications
You must be signed in to change notification settings - Fork 894
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Disambiguate http.*.duriation and/or split them into separate metrics #3520
Comments
I think I know opentelemetry-dotnet currently has a counter that ends the duration when headers are received. I took a look at opentelemetry-js does, and it appears to complete the duration on the "end" event. This is when the response has completed reading. Options:
|
This same question applies to tracing of HttpClient: https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/trace/semantic_conventions/http.md#http-client When should the trace start and when does it stop?
|
I wish there was any consensus like this.
random article on the internet says
|
Those articles are talking about latency. Looking at those, it occurs to me that
|
Thinking long-term, this is is the cleanest possible resolution of the problem. |
I think allowing flexibility in how
If there's agreement, we could add a recommendation for |
I believe we need both metrics (not necessarily now and however we call them). Despite challenges for non-native HTTP client instrumentations, client libraries (such as Azure SDK, Camel, Spring integrations, .NET Polly), service meshes, and user applications should be able to record both without any problem. If we allow flexibility in what One option I see is eventually having three metrics: TTFB (latency), TTLB (duration), client (no user code) call duration. Client call duration should be implemented in all instrumentations to enable default and consistent experience. I don't see other options if we keep |
One learning we have on Azure SDK in .NET side, that we can't reliably know when users are done reading the content - they don't frequently read all of it and don't dispose of the response. It's extensive and frequent and we've done some design decisions based on it. I probably miss something, but I still don't understand how TTLB can work reliably given this. |
Is it breaking based on the semantic convention stability rules? |
nope, not formally. but it's changing the semantics. One day my request duration is 100ms (ttfb) and tomorrow it's 20 sec (ttlb) - as a user, I'd be surprised. UPDATE: I think it is based on this:
alerts will be broken |
@antonfirsov This Monday was a public holiday in the US, so nothing happened. We'll post once we have any update. @antonfirsov @JamesNK I wonder what are your thoughts on #3520 (comment) and #3520 (comment)? we don't know which byte is the last (i.e. if user reads the stream until condition/error/etc) and can't guarantee response stream is closed on time. Arguably applications that don't follow best practices need observability even more than good ones and providing skewed and irrelevant telemetry to them would be a bummer anyway. |
@antonfirsov please review open-telemetry/semantic-conventions#70 (and open-telemetry/semantic-conventions#69) |
Think about the duration or span ending being separate from the response stream being read to the end. Instead, the end is tied to the end of the HTTP request (the point where the client releases all resources for it). An end can happen for many reasons:
The key point here is an HTTP request will end even if someone stops using it and doesn't clean it up. HTTP clients know when the request has ended (for whatever reason). And if, for some reason, an app is leaving HTTP requests hanging and they are eventually closed because of a server timeout, that is useful information to communicate in metrics. |
so we can only measure the responses that are read to the end/cancelled/failed. Assuming none of these things has happened, we don't know when user is done. E.g. I can forget to dispose of the response. I can sometimes get away with it. Should I know that I forgot to close it? Probably, from static analysis, but not at the cost of my telemetry usefulness. E.g. it's super-easy to ignore response stream for non-successful status codes. It's also super-common to send error details in response. Another case, I can write async Task GetAndProcess(HttpRequestMessage message)
{
using var response = await client.SendAsync(reuqest);
var content = readFirst100Bytes(response);
// do unrelated, heavy and long stuff
await _storageClient.Upload(content);
_queue.Publish("new content ....");
} I.e. when response is disposed is irrelevant and says nothing about time user would be interested in. And asking people to close response when they're done with it is too much. |
No, I mentioned timeouts:
For example:
If you stop using an HTTP request, then servers will kill it. Because, as you said, it's super easy for a client not to clean up a request. If servers didn't kill badly behaving HTTP clients, the internet would quickly DDOS itself because clients are awful (I write and ship HTTP servers). |
@JamesNK timeout is fine, can you please comment on disposal? |
to clearify: in the sample above, the user did everything right, but the duration of their span (and metrics) are wrong because they didn't read the stream to the end and didn't close response right away after being done with it. i.e. async Task GetAndProcess(HttpRequestMessage message)
{
using var response = await client.SendAsync(reuqest);
// this is the duration we can reliably measure
var content = readFirst100Bytes(response);
// this is when user is done with the response, but we have no way of knowing
await _storageClient.Upload(content); ...
// this is a duration we'll measure as HTTP client call duration, which makes no sense to me
} |
The code runs, but did the user do everything right? Leaving HTTP requests open can cause many problems:
Reporting the request ended with the response headers is false information. It potentially tells someone that the request was completed orderly and correctly. On the surface, the HTTP request may look healthy, but the reality is the request is still open. As explained above, that can cause many problems. HTTP request metrics should tell the truth and provide accurate information (it didn't end when you thought it ended) so someone can look at the data, see what is wrong with their app and fix it. |
The code is correct, but not optimal. Moreover, it can be the only reasonable thing to do when you conditionally need to continue reading the stream and checking the condition is a long operation. I see what you're saying: by measuring the duration when the response is disposed (or fully completed), we're telling user when resources are released, which is useful information. Users then may decide if they want to optimize or tolerate perf impact from it. What I'm trying to say is that tracing instrumentations so far are measuring a different thing - the call duration, which is another piece of useful information - how long did it take my application code to get a response at all and before I started running some other, potentially unrelated, code. So, let's try to agree that both pieces of information are useful. Time to full completion is an additive thing, and we need to define the new semantics for it. |
And one more point. Only by comparing different times ( 1) when user got a response 2) when user is done with it 3) when the request is fully completed and all resources are released) users can understand where and what they can optimize. With just p3, they won't have a clue and the duration will heavily depend on how the application code is written. Refactoring can change this time a lot, it won't be possible to compare similar calls to the same service when done by different pieces of code. |
@antonfirsov are you still interested in additional HTTP metrics, or do the clarifications to the duration metrics resolve this? thx! |
The clarification unblocked the .NET 8.0 metrics work which is done by now. We can discuss new metrics on-demand in the future. |
if all good, can you close this? (I don't have permissions in this repo) |
What are you trying to achieve?
The .NET team is working on the implementation of
http.client.duriation
forHttpClient
, and we were expecting that this metric has a clear definition regarding the duration span: whether it's duration to the first byte of the body VS the duration to the last byte of the body. This is not the case unfortunately, which leads to confusion and disagreements about how the metric should be interpreted.From the discussion in #3519 it seems like both interpretations are valuable, and the preferred interpretation depends on the application. This makes me think, that there should be two separate metrics instead of an ambiguous
http.client.duriation
.Additional context.
Statements from the discussion:
So the Java implementation already made the choice to go with the "time to first byte" interpretation, and some frameworks may have difficulties to implement the "time to last byte" version. It would be very unfortunate if different frameworks/libs would interpret the metric in different ways.
The same problem also applies to
http.server.duration
, which has been already implemented as "time to last byte" in .NET and may have been (?) interpreted differently in other existing implementations.Recommendation.
http.client.duration
andhttp.server.duration
should be disambiguated in the specs. Because of the statements quoted, I assume the preferred interpretation would be "time to first byte"?http.client.duration-to-last-byte
andhttp.server.duration-to-last-byte
.http.client.duration
tohttp.client.duration-to-first-byte
, though this would break existing implementations, and potentially delay the .NET delivery.The text was updated successfully, but these errors were encountered: