Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added distributed tracing support: activity and diagnostic listener. #6853

Merged
merged 6 commits into from
Aug 30, 2021

Conversation

Ilchert
Copy link
Contributor

@Ilchert Ilchert commented Dec 9, 2020

Added distributed tracing support: added grain filters for activity propagation.
#4992

@alexvaluyskiy
Copy link

Have you looked at ActivitySource? It is the new recommended approach for distributed tracing for Microsoft.
It uses the same Activity class, but the Activity should be started via ActivitySource, instead of DiagnosticSource
You could look at this example.
https://github.com/open-telemetry/opentelemetry-dotnet/blob/master/src/OpenTelemetry.Instrumentation.StackExchangeRedis/Implementation/RedisProfilerEntryToActivityConverter.cs

@Ilchert
Copy link
Contributor Author

Ilchert commented Dec 10, 2020

@alexvaluyskiy, I have seen Activity source, but it is not available in netcore 3. Also, if I understand correctly, activity source just simplify working with activities and diagnostic source and my current code do the same.

@alexvaluyskiy
Copy link

It works in NetCore 3.1, you just should a reference to the latest version of ‘ System.Diagnostics.DiagnosticSource’

@alexvaluyskiy
Copy link

ActivitySource allows you to add a tracing to OpenTelemetry without any additional code

@Ilchert
Copy link
Contributor Author

Ilchert commented Dec 11, 2020

@alexvaluyskiy, It is interesting, according to msdn it is available since net5, but also there is a nuget package. But Orleans uses version 4.7 of the DiagnosticSource NuGet package and I can't increase the version, because it depends on other 5.0 libs (System.Runtime.CompilerServices.Unsafe in full dotnet). So we have to upgrade all references to actual and then update the pull request using ActivitySource.
But eventually, this PR does the same as ActivitySource - creates activity and pushes info into DiagnosticSource.

@Ilchert
Copy link
Contributor Author

Ilchert commented Dec 11, 2020

Also, the main goal of this PR for me is propagating trace id over Orleans transport to support trace headers inside HttpClient.

@ReubenBond
Copy link
Member

ReubenBond commented Dec 16, 2020

Thank you, this looks good to me and we can accept it with the change mentioned (DiagnosticSource/DiagnosticListener subclass).

We can upgrade all references to .NET 5 libs as long as they are still compatible with .NET Standard 2.0. It will take some work (must edit Directory.Build.props in the root directory to upgrade the packages).

This could also be a separate package, on OrleansContrib, for example.

@Ilchert
Copy link
Contributor Author

Ilchert commented Dec 16, 2020

Thank you, this looks good to me and we can accept it with the change mentioned (DiagnosticSource/DiagnosticListener subclass).

@ReubenBond please confirm, that you prefer subclass to a static field.

@ReubenBond
Copy link
Member

I'm fine with a static field if the value will never need to participate in DI. A subclass may be cleaner.

@Ilchert
Copy link
Contributor Author

Ilchert commented Dec 16, 2020

Thanks, I will come back in a few days.

@Ilchert
Copy link
Contributor Author

Ilchert commented Dec 17, 2020

@ReubenBond done.

@alexvaluyskiy
Copy link

@cijothomas could you please look at this implementation?

@cijothomas
Copy link

@cijothomas could you please look at this implementation?

Ack. I'm on vacation now, will be back by Jan 1st week and take a look.

@cijothomas
Copy link

It is strongly advised to use ActivitySource API to start the Activity, instead of doing new Activity(). ActivitySource was introduced to make Activity more aligned with OpenTelemetry efforts, and has capabilities like built-in Sampling etc.
Its available as nuget for all .NET versions, and being a System.* package, is fully backward compatible.

@cijothomas
Copy link

Additionally, would recommend to use OpenTelemetry concept of Propagators to propagate context, instead of propagating manually.
(but this would require adding a dependency to OpenTelemetry library. This is something we hope to solve in .NET 6 timeline where .NET DiagnosticSource would expose a propagator API (dotnet/runtime#46054))

@Ilchert
Copy link
Contributor Author

Ilchert commented Jan 29, 2021

@ReubenBond could you please summarize, should I cancel this PR and wait 5.0 libs, reference OpenTelemetry or it is OK for current orleans version?

@ReubenBond
Copy link
Member

@Ilchert I was under the impression that @cijothomas' comment needed addressing.
On the other hand, I also believe this can be turned into an OrleansContrib project, since it doesn't have any requirements on changing the core.

@oising
Copy link
Contributor

oising commented Feb 3, 2021

hey @Ilchert -- this might help you kickstart the modifications to use what @cijothomas is suggesting:

https://rehansaeed.com/deep-dive-into-open-telemetry-for-net/
https://rehansaeed.com/open-telemetry-for-asp-net-core/

@Ilchert
Copy link
Contributor Author

Ilchert commented Feb 3, 2021

@ReubenBond thanks, I will update this PR when orleans updates dependencies to 5.x version. I believe that framework for distributed application must support distributed tracing :)

@Ilchert Ilchert closed this Feb 3, 2021
@cijothomas
Copy link

This is the link to official instrumentation doc: https://github.com/open-telemetry/opentelemetry-dotnet/tree/main/src/OpenTelemetry.Api#instrumenting-a-libraryapplication-with-net-activity-api
Open issues in the OpenTelemetry-dotnet repo, if you have questions/concerns.

@Ilchert
Copy link
Contributor Author

Ilchert commented Feb 3, 2021

@cijothomas thanks!

@ReubenBond
Copy link
Member

I believe that framework for distributed application must support distributed tracing :)

I agree, it's very important

@Ilchert Ilchert reopened this Feb 16, 2021
@oising
Copy link
Contributor

oising commented Feb 17, 2021

@Ilchert @cijothomas -- I would love to use this functionality with Orleans 3.4.x -- if I understand the conversation above, if I reference the OpenTelemetry nuget package(s), I should be able to get this working with minimal changes, right?

@Ilchert
Copy link
Contributor Author

Ilchert commented Feb 17, 2021

@oising, you can just upgrade Diagnostic source package to 5.0. Open telemetry does not have stable package.

@cijothomas
Copy link

@oising, you can just upgrade Diagnostic source package to 5.0. Open telemetry does not have stable package.

It does have stable now. (Released last week) https://www.nuget.org/packages/opentelemetry.api

@cijothomas
Copy link

@Ilchert @cijothomas -- I would love to use this functionality with Orleans 3.4.x -- if I understand the conversation above, if I reference the OpenTelemetry nuget package(s), I should be able to get this working with minimal changes, right?

Mostly you'd just need System.Diagnostics.DiagnosticSource reference only.
If you need to leverage the OpenTelemetry context propagation mechanism, then you'd need to add reference to Opentelemetry.Api nuget as well.

https://github.com/open-telemetry/opentelemetry-dotnet/tree/main/src/OpenTelemetry.Api#instrumenting-a-libraryapplication-with-net-activity-api

@ReubenBond ReubenBond added this to the 4.0.0 milestone Apr 28, 2021
/// <returns>The builder.</returns>
public static ISiloBuilder AddActivityPropagation(this ISiloBuilder builder)
{
if (Activity.DefaultIdFormat != ActivityIdFormat.W3C)
Copy link
Contributor

@amccool amccool Jun 11, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am running into systems that have DefaultIdFormat set to hierarchical, which causes this to throw. It appears that https://devblogs.microsoft.com/aspnet/improvements-in-net-core-3-0-for-troubleshooting-and-monitoring-distributed-apps/ and other posts such as https://jimmybogard.com/building-end-to-end-diagnostics-and-tracing-a-primer-trace-context/ are recommending
Activity.DefaultIdFormat = ActivityIdFormat.W3C;

Obviously the ClientBuilderExtension would need to change as well.

The systems in question are server2016 with dotnet3.1 runtime. dotnet5.0 is not installed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello, @amccool, we have this check because internal implementation of context uses some w3c specific features, like scope, tags, etc. So, to use this extension you must change default format into W3C.

Copy link

@magole magole Jul 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please reconsider limiting to only W3C. Many systems rely on hierarchical IDs and it's set for entire app. Hierarchical IDs help with logs stored in DBs, since prefix matches use index. For example: startswith "|f728de91a402224da40046d1234e83c5.3b48459462d8e54e."

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default DistributedContextPropagator will handle both hierarchical and W3C ActivityId formats.

@oising
Copy link
Contributor

oising commented Jun 29, 2021

Hmm, I've been playing with this extensively the last couple of days and I realized that a grain calling other grains are not shown to be dependent. That is to say, all dependent grain calls fall under the root call in the request, rather than as a child to the span of the calling grain(s). I don't think this is expected behavior.

Example - here I'm using the out of box OTEL instrumentation for aspnetcore and httpclient, and fronting a graphql server that talks to our cluster. You can see that the dependencies chain correct for the the OOTB calls - to the first grain call - but once you go beyond the initial hop into orleans, all calls are direct dependents of the root:

"getgateway" is a span representing the graphql query execution.

image

@ReubenBond - I would hold off on this, personally.

(ignore the slow startup times -- it does late connecting to services, and is running on my overloaded laptop lol

UPDATE Never mind... it turns out the use of GraphQL dataloaders and the way field resolvers are lazily evaluated is causing the chain to be broken

@SebastianStehle
Copy link
Contributor

I really like this and it is a super cool feature. But I guess to also support stuff like serializers and other filters you would have to implement this deeper into the system.

But can it not be merged and then improved later?

@davidfowl
Copy link
Member

cc @shirhatti

@ReubenBond
Copy link
Member

ReubenBond commented Aug 30, 2021

But can it not be merged and then improved later?

Yes, it can. I think this PR is taking the right approach, too.
Many thanks to everyone who contributed to this either with code or with comments.

@ReubenBond ReubenBond merged commit 001be5b into dotnet:main Aug 30, 2021
Comment on lines +96 to +104
// Create activity from context directly
var traceParent = RequestContext.Get(TraceParentHeaderName) as string;
var traceState = RequestContext.Get(TraceStateHeaderName) as string;
var parentContext = new ActivityContext();

if (traceParent is not null)
{
parentContext = ActivityContext.Parse(traceParent, traceState);
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would replace this logic and rely on the configured DistributedContextPropagator instead. I'm not familiar with Orleans, but in ASP.NET Core we use the propagator registered in DI and if not present, we fall back to the globally configured DistributedContextPropagator.Current

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, was it intentional not to propagate Baggage? Or was it an omission?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DistributedContextPropagator is NET 6.0 only. I'm not sure Orleans 4.0 should be constrained to that, unless there's plans to backport it to 5.0

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You may try to reference the latest Diagnostic Source package https://www.nuget.org/packages/System.Diagnostics.DiagnosticSource/6.0.0-preview.7.21377.19 which should be supporting the propagators and should work in .NET 5.0 too.

Copy link
Contributor Author

@Ilchert Ilchert Aug 31, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tarekgh we had the same discussion about v5 packages. I can't just upgrade packages especially to preview versions. I think we should refactor this code after orleans upgrades all nuget references to 6.0.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tarekgh Oh, that would be good if it works :)

Comment on lines +80 to +86
if (currentActivity is not null &&
currentActivity.IdFormat == ActivityIdFormat.W3C)
{
RequestContext.Set(TraceParentHeaderName, currentActivity.Id);
if (currentActivity.TraceStateString is not null)
RequestContext.Set(TraceStateHeaderName, currentActivity.TraceStateString);
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I mentioned on the incoming filter, the outgoing filter should also use a propagator.

@shirhatti
Copy link

👀 @tarekgh

@ReubenBond
Copy link
Member

We have a PR open to implement this, courtesy of @suraciii - comments welcome

@amccool
Copy link
Contributor

amccool commented Dec 19, 2021

would be good to be able to have the option to filter out system grains, similar to the dashboard.

@suraciii
Copy link
Contributor

suraciii commented Dec 19, 2021

@amccool I agree, wonder if it can be implemented by a sampler
Also an option to make it always start new root traces for certain grain methods would be helpful, especially for IRemindable.ReceiveReminder, otherwise periodic calls of long-running grains may produce super long traces

@suraciii
Copy link
Contributor

The PR to add DistributedContextPropagator: #7443

There's another choice is to manage activities along with src/Orleans.Core/Diagnostics/MessagingTrace.cs, just like what aspnetcore did in https://github.com/dotnet/aspnetcore/blob/release/6.0/src/Hosting/Hosting/src/Internal/HostingApplicationDiagnostics.cs, so that activities, diagnostic events and metrics can be managed in one location

@ReubenBond
Copy link
Member

I've opened a follow-up PR which changes some of the details here. Please take a look if you're using this and/or are interested: #7647

@github-actions github-actions bot locked and limited conversation to collaborators Dec 2, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.