Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add initial common infrastructure for wiring up distributed tracing #6916

Merged
merged 17 commits into from
Jul 16, 2019

Conversation

pakrym
Copy link
Contributor

@pakrym pakrym commented Jul 12, 2019

Fixes: #6796

Our goal is to provide a way to emit distributed traces from Azure client libraries. The distributed trace usually contains a set of spans. Span represents a unit of work in a distributed system, it has a start time, duration, name and a set of attached attributes.

We are using Activity (https://docs.microsoft.com/en-us/dotnet/api/system.diagnostics.activity?view=netcore-2.2) class to represent spans in .net. Activity.Current is async-local and flows with the async context so child Activities can be attached to parents.

In addition to work in this PR there would be an adapter that would turn Activities we create into OpenTelemetry (https://opentelemetry.io/) spans.

Add enough types to get basic 2 level diagnostic trace going.

2 levels are:

  1. Operation, like ConfigurationClient.Get
  2. HttpRequest

I also updated the ConfigurationClient to create tracing scopes in client methods.

Basic pattern is:

using ClientDiagnosticScope scope = _pipeline.Diagnostics.CreateScope("ConfigurationClient.Get");
// Add you string attributes
scope.AddAttribute("key", key);
// Add attributes that need to be computed lazily to avoid allocations when 
// tracing is disabled
scope.AddAttribute("value", integerValue, i => i.ToString());

scope.Start();

try
{
    // Create and make a request
}
catch (Exception e)
{
    scope.Failed(e);
}

Names of diagnostic sources and events will change in the future.

Sample trace that is produced using a zipkin exporter:

image

@pakrym pakrym requested a review from lmolkova July 15, 2019 19:01

public void AddAttribute<T>(string name, T value)
{
if (_activity != null)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd avoid doing this. Activity tags are always strings. ToString() can be called by caller and it's better to make it clear from API.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This allows us to avoid formatting to string when there is no diagnostic source attached.

Copy link
Member

@lmolkova lmolkova Jul 15, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is a special Activity.Recorded flag now - it should be false if there is no listener and if you badly need tags - you should add them only if activity is recorded (i.e. sampled in). OpenTelemetry should take care of setting it to true when needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would there be a point in creating activity at all if it's not being recorded?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, for propagation: you don't record it, but service before you and after you did - they can correlate telemetry (but causation is broken)

{
if (_activity != null && _source.IsEnabled(_name))
{
_source.StartActivity(_activity, null);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For Http requests we should pass payload. It is more efficient to pass it (and parse it) then create multiple attributes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not for http requests, this activity is for client level operations.


if (_source != null)
{
_source.StopActivity(_activity, null);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here, we should have payload

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't always have payload at this level and we won't be able to have an adapter library per client SDK anyway.

return;
}

_source?.Write(_activity.OperationName + ".Exception", e);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most likely it will be ignored by tracing system as it is likely handled and just noise. Unhandled exceptions are taken care of by web framework, So I think we don't need it for now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we care about setting a status flag on spans?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we do, can we do it better than Exception instance? If not, that's ok, but we need a plan

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can, but using an exception should be an option too.

@@ -20,6 +20,7 @@
<ItemGroup>
<PackageReference Include="Microsoft.Bcl.AsyncInterfaces" />
<PackageReference Include="System.Buffers" />
<PackageReference Include="System.Diagnostics.DiagnosticSource" />
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, we need to use DiagSource 4.6.0 preview, it uses W3C trace-context and we should never use old Id format in Azure SDKs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't the format supposed to be controlled by the app?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -113,8 +123,13 @@ public virtual async Task<Response<ConfigurationSetting>> AddAsync(Configuration
/// <param name="cancellationToken">A <see cref="CancellationToken"/> controlling the request lifetime.</param>
public virtual Response<ConfigurationSetting> Add(ConfigurationSetting setting, CancellationToken cancellationToken = default)
{
using (Request request = CreateAddRequest(setting))
using DiagnosticScope scope = _pipeline.Diagnostics.CreateScope("ConfigurationClient.Add");
scope.AddAttribute("key", setting?.Key);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is key?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically a configuration option name.

@@ -126,6 +141,11 @@ public virtual Response<ConfigurationSetting> Add(ConfigurationSetting setting,
throw response.CreateRequestFailedException();
}
}
catch (Exception e)
{
scope.Failed(e);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to think more on how to track status of Activity. Exceptions are not perfect (they are likely handled). Will we parse it (with type +something like WebException.Status ) in opentelemetry?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to parse it?

Copy link
Member

@lmolkova lmolkova Jul 16, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean from the exception type we can understand if it is let's say timeout or HttpRequestException.
In HttpRequestException, it is useful to know the failure details (like these https://docs.microsoft.com/en-us/dotnet/api/system.net.webexceptionstatus?view=netframework-4.8)

One thing that bothers me with exception is this:

  • request starts, tracing system doesn't care about Start
  • exception happens - tracing system now have to store it and map it to what? It didn't event create Span yet. So it'll set something on the Activity tags or will be forced to create Span on start and access current span to attach exception to.
  • request ends - ok, now tracing system tracks Span and can set status

Other approaches could be:

  • set status on the Activity.Tags - but it has to be a string
  • set status in the Stop payload - we need to agree on the contract (prop name, value type)

Let's discuss it tomorrow. Activity kinda has to have status and if we agree that is the long term plan - then I like the Exception approach now.

<PackageReference Update="System.Threading.Tasks.Extensions" Version="4.5.2" />
<PackageReference Update="System.ValueTuple" Version="4.5.0" />
<PackageReference Update="WindowsAzure.ServiceBus" Version="5.1.0" />
<PackageReference Update="WindowsAzure.Storage" Version="9.3.3" />
<PackageReference Update="xunit.runner.visualstudio" Version="2.4.1" />
<PackageReference Update="xunit" Version="2.4.1" />
</ItemGroup>
<!-- Track 2 specific versions -->
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a conditional section for Track 2 libraries because I'd like to update some package versions that might be used in track 1 clients too.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume that DiagnosticsSource and Unsafe are the only 2 packages that are shared? I'm OK with keeping these separate at least for now while we are consuming preview versions of those packages.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, those were the ones I was worried about.

}
}
}
else
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think supporting Request-Id doesn't make sense. Azure Services will never implement Request-Id header so propagating it won't help anything.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So what should we do in case of non-W3C activity?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two options

  1. Insist on W3C activity (in preview7 there will be an instance method Activity.SetIdFormat()). Then if a user had activity in the old format, the one created by Azure SDK will not correlate to the old one. But anything that happens in and after Azure SDK (e.g. EventHub and message processing or Storage logs and EventGrid notifications followed by Function) will be correlated.

  2. Tolerate old format but not inject headers. Then everything before and including Azure SDK will correlate, anything after will become a new trace.

Looking into this, there is no good solution for format mismatch. But I think this is not Azure SDK responsibility to solve this issue - tracing system should control the format and enforce W3C on the incoming boundary.

In the bright future, there will be nothing else than W3C and we won't have this problem.

So I suggest do 2.

@pakrym
Copy link
Contributor Author

pakrym commented Jul 16, 2019

I'm going to merge this as it's mostly fine so I can have fewer changes in flight and start working on storage codegen.

I'll address further comments as part of a new PR.

@pakrym pakrym merged commit 369ea8e into Azure:master Jul 16, 2019
@pakrym pakrym deleted the pakrym/dt-basics branch July 16, 2019 17:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Prototype tracing implementation
4 participants