Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Connection Activities in SocketsHttpHandler #101814

Closed
wants to merge 10 commits into from

Conversation

antonfirsov
Copy link
Member

Resolves #93832 by introducing an activity for every HTTP/1.1 and HTTP/2.2 connection, emitting ActivityEvent at DNS/TCP/TLS completion and linking the connection activity to the request activity when the request is assigned to a connection from the pool.

Marking as a draft for now, since there are several open questions:

  1. Do we also need a DiagnosticListener like in DiagnosticsHandler or are we OK just publishing the activities?
  2. Emitting an ActivityEvent doesn't publish notifications on ActivityListeners. However, when a request is completed, the linked connection activity should contain all the events by that time. Would this be "interactive enough" for Aspire?
  3. The proposed implementation does not emit DNS & TCP events when ConnectCallback is set. To overcome this, we would need to grab Activity.Current in downlevel libraries, create sub-activities instead of events, or expose an API so users can emit these events manually in their callbacks. Do we expect Aspire users to use ConnectCallback frequently enough to make it worth the trouble?
  4. The proposed solution doesn't emit events for HTTP/3. Is this limitation problematic? This would need to be solved in System.Net.Quic (again either by emitting events on Activity.Current or by creating sub-activities instead of events).
  5. The lifetime of the activity doesn't match the span of the http.client.connection.duration metric, since the latter starts measuring time only after a successful TLS handshake. Do we need to harmonize the two?
  6. What should be the naming convention for the events emitted? Are there any rules?

cc @samsp-msft @noahfalk @MihaZupan @lmolkova

@samsp-msft
Copy link
Member

samsp-msft commented May 2, 2024

Do we also need a DiagnosticListener like in DiagnosticsHandler or are we OK just publishing the activities?

I don't think so - although related, we don't need to do a diagnostic source for this scenarios

Emitting an ActivityEvent doesn't publish notifications on ActivityListeners. However, when a request is completed, the linked connection activity should contain all the events by that time. Would this be "interactive enough" for Aspire?

Yes. The Aspire dashboard will be able to resolve links when we implement dotnet/aspire#2577. This will be a forcing function to make that happen.

The proposed implementation does not emit DNS & TCP events when ConnectCallback is set. To overcome this, we would need to grab Activity.Current in downlevel libraries, create sub-activities instead of events, or expose an API so users can emit these events manually in their callbacks. Do we expect Aspire users to use ConnectCallback frequently enough to make it worth the trouble?

According to @ReubenBond, we don't use it in Aspire

The proposed solution doesn't emit events for HTTP/3. Is this limitation problematic? This would need to be solved in System.Net.Quic (again either by emitting events on Activity.Current or by creating sub-activities instead of events).

Http/3 usage is still uncommon, and more complicated due to a QUIC connection being more virtual than physical. We should flush out the scenarios with 1.1/2 and then look at HTTP/3 later.

The lifetime of the activity doesn't match the span of the http.client.connection.duration metric, since the latter starts measuring time only after a successful TLS handshake. Do we need to harmonize the two?

It would be better if we can be consistent, but not if that causes a bunch of extra trouble for that metric.

What should be the naming convention for the events emitted? Are there any rules?
We should follow similar patterns to semantic conventions for name/value pairs and activity names.

@lmolkova probably has recommendations.

@@ -104,6 +108,17 @@ public void MarkConnectionAsNotIdle()
_connectionMetrics?.IdleStateChanged(idle: false);
}

public void LinkRequestActivity()
{
Activity? requestActivity = Activity.Current;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: can we check _activity is null before looking up Activity.Current to make it a bit cheaper when disabled?

return;
}

requestActivity.AddLink(new ActivityLink(_activity.Context));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is tooling/infra prepared to deal with there being millions of links to a given connection?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Millions of links to is not a problem, millions of links from could be!

public const string RequestActivityStopName = RequestActivityName + ".Stop";

public const string ConnectionNamespace = "System.Net.Http.Connections";
public const string ConnectionActivityName = ConnectionNamespace + ".HttpConnectionOut";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it a convention to have an Out postfix on these names?

Copy link
Member

@noahfalk noahfalk May 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a .NET convention to do so, but not an OTel convention. In general we probably want to sync with @lmolkova on naming and conventions here. If we can determine an OTel convention that would be preferable.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK this one is OperationName and I think has no OTel equivalent.
For the DisplayName, we'd need to define a new span and span name.

Does this one last for the connection lifetime? If so I'd use something like

connection {server.address}:{server.port}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am trying to figure out what is more clear in a trace:
System.Net.Http.Connections.HttpConnectionOut
or
System.Net.Http.Connections.HttpConnection ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK this one is OperationName and I think has no OTel equivalent.
For the DisplayName, we'd need to define a new span and span name.

OperationName is the default value used for DisplayName unless it is overriden somewhere. I don't think we get any value picking one string for OperationName and a different string for DisplayName. So we probably want OperationName = DisplayName = connection {server.address}:{server.port}

I am trying to figure out what is more clear in a trace

What gets transmitted by OTLP and rendered in the UI should be DisplayName and @lmolkova is suggesting that be "connection {server.address}:{server.port}". I think that would be clearer than either of those namespaced names no?

Copy link

@lmolkova lmolkova May 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OperationName could still be important for the DiagnosticSource (DS -> EventSource) as the event name and then it's nice to have a const string for it, but you all know much more about it than I do.

Copy link
Member Author

@antonfirsov antonfirsov May 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OperationName could still be important for the DiagnosticSource

This PR is not adding DiagnosticSource / DiagnosticListener, only ActivitySource + activities, because this is what #93832 is about in my understanding. Is there a reason for also adding a DiagnosticListener like we have it in DiagnosticHandler? I know little about how is this being consumed.

So we probably want OperationName = DisplayName = connection {server.address}:{server.port}

Given that OperationName has no OTel equivalent according to #101814 (comment), IMO it would be better if the OperationName followed the existing .NET convention for consistency. We can use a different DisplayName if we want. In that case wouldn't we also want to change the DisplayName for the request activity, or do we consider it a breaking change we better avoid?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that OperationName has no OTel equivalent according to #101814 (comment), IMO it would be better if the OperationName followed the existing .NET convention for consistency.

👍

In that case wouldn't we also want to change the DisplayName or the request activity, or do we consider it a breaking change we better avoid?

I would consider it part of onboarding onto HTTP semantic conventions which define what span name looks like.
My 2c on if changing DisplayName is breaking: existing otel http instrumentations set span name according to the conventions and it could only be breaking to something else like vanilla ActivityListener, but they'd still have the same OperationName as before

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OperationName could still be important for the DiagnosticSource (DS -> EventSource) as the event name and then it's nice to have a const string for it, but you all know much more about it than I do.

You can filter on Activity name as part of DiagnosticSourceEventSource and ActivityListener.Start/Stop, but not as part of ActivityListener.Sample or OTel. Given that OTel is a primary scenario we want to support I'd want to ensure that ActivitySource alone is sufficiently fine-grained to do the filtering we'd like to do. It won't harm anything to pick some constant value for OperationName that differs from DisplayName, but ideally nothing should be relying on us doing that and we'd move away from showing OperationName in our tools as a distinct piece of data.

@samsp-msft
Copy link
Member

@antonfirsov - would it be possible to show the output from a basic scenario. The easiest way to do that is to use the consoleExporter for Tracing.

Add a nuget ref for OpenTelemetry.Exporter.Console

And then use

        builder.Services.AddOpenTelemetry()
            .WithTracing(tracing =>
            {
              tracing.AddSource("System.Net.Http.Connections");
              tracing.AddSource("System.Net.Http");
              tracing.AddConsoleExporter();
             });

@noahfalk
Copy link
Member

noahfalk commented May 7, 2024

Emitting an ActivityEvent doesn't publish notifications on ActivityListeners. However, when a request is completed, the linked connection activity should contain all the events by that time. Would this be "interactive enough" for Aspire?

Yes. The Aspire dashboard will be able to resolve links when we implement dotnet/aspire#2577. This will be a forcing function to make that happen.

Beware the timing of when the Activity gets sent may make the UX awkward when trying to view the telemetry data live (or close to live). The OTel SDK and exporter do not transmit any information about an Activity until it has stopped as far as I am aware. If a request completes but the connection that serviced the request doesn't close until 10 minutes later that means there will be a 10 minute period where the event timings haven't been transmitted yet. If you want the timings to be available more quickly in the UI the options I can think of are:

  • Update the OTLP spec to allow sending information about incomplete Activities, then update OTel.NET exporter to do that. This is probably a lengthy path as it involves lots of discussion, consensus building, and coordination.
  • Instead of Activity events, use structured log messages.

@@ -74,10 +74,12 @@ public static async ValueTask<SslStream> EstablishSslConnectionAsync(SslClientAu
sslStream.AuthenticateAsClient(sslOptions);
}
}
activity?.AddEvent(new ActivityEvent("Tls.Authenticated"));
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to use spans and avoid using span events at all costs:

  1. A child span for TLS handshake would have a duration and timestamp
  2. It won't be verbose since connections are long-lived (much longer than requests)
  3. It can live without HTTP instrumentation on
  4. It can be turned off (if coming from a different source) while HTTP is on
  5. Span events are likely to be deprecated in the future in favor of independent Event API similar to logging
  6. We'll have standard means to record errors/duration.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to use spans and avoid using span events at all costs:

That seems fine to me. I mentioned in another comment that there was a timing issue where one long activity for the entire connection wouldn't transmit the events until the connection closed. I suggested changing the events to log messages as one potential solution, but changing them to spans also solves that problem.

It won't be verbose since connections are long-lived (much longer than requests)

I assume at the limit we might have one connection per request, but I hope such scenarios are uncommon in any high throughput service. Worst case if connection activity overhead was too high the app developer can always sample them or turn them off.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the advantages that come from implementing this with sub activities. Another big advantage is that it would also work if ConnectCallback is present, since we would implement them in downstream libraries. However, there are some challanges:

  • It increases the scope of the problem, since we would need to touch multiple System.Net.* libraries. We may need to pull in more reviewers from the networking team.
  • If we want to add attributes to those sub-activities, it increases the scope even further since we would need to make decisions on the characteristics of those attributes (eg. naming, possible values for attributes like error.type, etc). As I also noted in Connection Activities in SocketsHttpHandler #101814 (comment), my recommendation would be to track this in a separate issue.
  • If there is an Activity.Current to be used as a parent activity, and System.Net.Http.Connections.HttpConnectionOut is not enabled, the DNS/TCP/TLS activity would attach to the Http activity as parent which is somewhat unfortunate. We would need to document that users should avoid this.

Regardless, I agree sub-activities are better. If we have consensus, I will change this PR to add activities in downstream libs. @MihaZupan any concerns?

changing the events to log messages

Just to not exclude this option completely yet: @noahfalk do you mean a writing to DiagnosticListener by this or some other technique?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is an Activity.Current to be used as a parent activity, and System.Net.Http.Connections.HttpConnectionOut is not enabled, the DNS/TCP/TLS activity would attach to the Http activity as parent which is somewhat unfortunate. We would need to document that users should avoid this.

This is why when we originally talked I said not to use Activity.Current and be explicit about passing in the activity if you can.
If the DNS and/or TLS activities are invoked and the parent is not a connection (As the source isn't enabled) , then they can look at the activity and see if its a connection or not and therefore not parent themselves to it. You will still see them on the overall timeline.

The other hack would be to put all these activities in the same source, so they can only be enabled/disabled as a group. But as I think they would be useful for other scenarios beyond http connections, that really does seem like a hack.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about something that would ultimately be ILogger but later recalled that I don't think ILogger is easy to generate from the level of this code in the stack. Technically there are ways around that but I wouldn't worry about pursuing it any further. I think sub-activities are both easier and better.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But as I think they would be useful for other scenarios beyond http connections

This is why I would prefer to create the sub-activities in downlevel libs. System.Net.Sockets|NameResolution|Security would all have their own ActivitySource-s.

passing in the activity if you can.

It's not possible to pass it to downlevel libraries without significant (and IMO bloated) API additions, which we likely don't want.

switch (_kind)
{
case HttpConnectionKind.Http:
case HttpConnectionKind.Https:
case HttpConnectionKind.ProxyConnect:
stream = await ConnectToTcpHostAsync(_originAuthority.IdnHost, _originAuthority.Port, request, async, cancellationToken).ConfigureAwait(false);
stream = await ConnectToTcpHostAsync(_originAuthority.IdnHost, _originAuthority.Port, request, async, activity, cancellationToken).ConfigureAwait(false);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a chance that we'd need to instrument connections outside of HTTP client stack?
If so, would we regret not adding these ones to Socket and friends?

@@ -666,13 +680,19 @@ private async ValueTask<Stream> ConnectToTcpHostAsync(string host, int port, Htt
{
if (async)
{
await socket.ConnectAsync(endPoint, cancellationToken).ConfigureAwait(false);
IPAddress[] addresses = await Dns.GetHostAddressesAsync(host, cancellationToken).ConfigureAwait(false);
activity?.AddEvent(new ActivityEvent("Dns.Resolved"));
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here on events. Spans would provide much more value and flexibility than events.

IPAddress[] addresses = await Dns.GetHostAddressesAsync(host, cancellationToken).ConfigureAwait(false);
activity?.AddEvent(new ActivityEvent("Dns.Resolved"));
await socket.ConnectAsync(addresses, port, cancellationToken).ConfigureAwait(false);
activity?.AddEvent(new ActivityEvent("Transport.Connected"));
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The metric that shows how long it takes to establish a connection and a span that shows an establishment attempt is extremely valuable, while event only allow to visually see it on a trace.

Query that would calculate duration between parent activity start time to event timestamp would be not a great experience.

{
Stream? stream = null;
IPEndPoint? remoteEndPoint = null;
Activity? activity = CreateActivity()?.Start();
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's make sure we don't forget to pick relevant attributes for this one and hopefully TLS, DNS and connect spans.

@@ -689,9 +709,10 @@ private async ValueTask<Stream> ConnectToTcpHostAsync(string host, int port, Htt
}
catch (Exception ex)
{
activity?.Stop();
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we'll need to record error.type

Copy link
Member Author

@antonfirsov antonfirsov May 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR doesn't add attributes and it would be weird to record only error.type. We are tracking the addition of attributes to the request activity in #93019. We may consider adding attributes to the connection span as well, but I would do it in a separate PR together with or after resolving #93019. @samsp-msft do you think think this would be valuable? If yes, my recommendation is to open a separate tracking issue.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good!

Copy link
Contributor

Draft Pull Request was automatically closed for 30 days of inactivity. Please let us know if you'd like to reopen it.

@karelz karelz added this to the 9.0.0 milestone Jun 24, 2024
@github-actions github-actions bot locked and limited conversation to collaborators Jul 25, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Consider adding activities for potentially long running network operations
6 participants