Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding correlation IDs to reconcile loops #1460

Merged
merged 1 commit into from
Aug 2, 2021

Conversation

arschles
Copy link
Contributor

@arschles arschles commented Jun 18, 2021

Signed-off-by: Aaron Schlesinger [email protected]

What type of PR is this?

/kind feature

What this PR does / why we need it:

This PR adds shared correlation IDs to context.Contexts passed to reconcile loops, so that the API requests made to Azure get logged and tracked properly. That feature is especially useful for debugging.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #1310

Special notes for your reviewer:

Please confirm that if this PR changes any image versions, then that's the sole change this PR makes.

TODOs:

  • squashed commits
  • includes documentation
  • adds unit tests

Release note:

Add 'x-ms-correlation-id' headers to all Azure API calls via distributed traces.

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. area/provider/azure Issues or PRs related to azure provider sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. labels Jun 18, 2021
@k8s-ci-robot k8s-ci-robot requested review from cpanato and devigned June 18, 2021 18:37
@k8s-ci-robot
Copy link
Contributor

Welcome @arschles!

It looks like this is your first PR to kubernetes-sigs/cluster-api-provider-azure 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/cluster-api-provider-azure has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Jun 18, 2021
@k8s-ci-robot
Copy link
Contributor

Hi @arschles. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@arschles
Copy link
Contributor Author

/kind feature

@k8s-ci-robot k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jun 18, 2021
@CecileRobertMichon
Copy link
Contributor

/ok-to-test

Welcome @arschles!

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jun 18, 2021
Copy link
Contributor

@devigned devigned left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WDYT about changing the way that tele.Tracer.Start() works such that anytime start is called and the context does not contain a correlationID, one is generated and added to the value bag. This could be done as a best effort kind of thing, so the error could be ignored, or a default value substituted if a new UUID can not be created?

Also, pkg/trace/corr.go requires a file header comment. You will see them littered through the other code files. If you run make test, it will generate files, lint, and run the local tests.

@arschles
Copy link
Contributor Author

@devigned I think that's a great idea! I'm working on both the tele.Tracer.Start() change and fixing build issues now.

@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. release-note-none Denotes a PR that doesn't merit a release note. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Jun 22, 2021
@arschles arschles marked this pull request as ready for review June 22, 2021 21:46
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 22, 2021
@arschles arschles changed the title Adding correlation IDs to reconcile loops (WIP) Adding correlation IDs to reconcile loops Jun 22, 2021
@arschles
Copy link
Contributor Author

/retest

// }
// defer newSpan.End()
// doSomething(ctx)
func StartSpan(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WDYT of something like this:

In ./util/tele.go, we wrap the otel.Tracer with our own struct that implements trace.Tracer. We then write our own Start(...) func which would then delegate back to otel.Tracer().Start(...). That way the interface can stay the same.

package tele

type tracer struct {
    otel.Tracer
}

func (t *tracer) Start(ctx, opName string, opt ...otel.SpanOption) (context.Context, otel.Span) {
    // add correlationID to ctx and continue as normal
    // return the inner implementation of Start
    return t.Tracer.Start(ctx, opName, opt...)
}

Then either in the same package or elsewhere, add context helpers to set / fetch the correlationID.

package corr  // or something else...

// FromCtx or some better name you think of
func FromCtx(ctx context.Context) (ctx, string) {
    // create new context.WithValue containing correlationID and return both if it does not exist, else return context and correlationID
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, happy to do that! A few questions before I get started, just to make sure I'm on the same page

  • I'm assuming that you'd want to still call tele.Tracer().Start(...) in all the reconcilers, correct?
  • Are you ok with creating our own tracer instance (the wrapper you mentioned) and then passing that -- instead of the default one -- to otel.SetTracerProvider in main.go?
  • Do we need FromCtx to be exported at all? It seems like it would only be used inside the same package as the custom tracer, but not sure if you're thinking something more

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm assuming that you'd want to still call tele.Tracer().Start(...) in all the reconcilers, correct?

If possible, that would be great.

Are you ok with creating our own tracer instance (the wrapper you mentioned) and then passing that -- instead of the default one -- to otel.SetTracerProvider in main.go?

💯

Do we need FromCtx to be exported at all? It seems like it would only be used inside the same package as the custom tracer, but not sure if you're thinking something more

I was adding that because I saw you were returning the correlationID in your StartSpan func. This led me to think you wanted a way to access it from the context. I think one might need to access the correlationID from the context to be able to add that as a HTTP header for the Azure SDK for Go clients.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perfect - I'll make these changes.

regarding FromCtx - just in case someone wants to access the correlation ID, I'll make FromCtx exported like you said.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@devigned just an update, so you know I haven't forgotten about this! I'm looking through the implementation & docs trying to figure out a way to create a new TracerProvider that we can pass to the SetTracerProvider function in main.go. That would need to create a new Tracer which wraps the implementation you referenced above. I think that's going to require a lot of plumbing, which we could make unnecessary if we called a custom function to get either:

  • get a new tracer that which wraps the underlying tracer implementation, like what you showed here
  • create a new span, which is what's already in this PR

I'll keep digging, but let me know about going with either of the above options in case there isn't a good way to wrap everything we need

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another option would be to create a custom TracerProvider that wraps the jaeger one. Let me know your thoughts on that too?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay, @arschles. I was thinking something more like this: master...devigned:corr

The above diff should add the correlationID to the context without needing to manipulate the signature of the Tracer. wdyt?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @devigned I see what you mean now! Thanks, I'll implement that.

@k8s-ci-robot k8s-ci-robot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jul 9, 2021
@arschles
Copy link
Contributor Author

@devigned got it, I've removed most of the formatting changes made earlier - there are a few still here in which I moved the creation of cancellable contexts below the creation of the spans. that is so the new contexts can be based on the contexts that were created by the new span. I do think it's valuable - but not critical - to do that so the cancellable context has the new correlation IDs in it. let me know what you think of that.

also to address your and @CecileRobertMichon's point about percolating these correlation IDs up to the Azure API - I had originally planned on doing that in a follow-up PR when this PR was large, to draw a clear line between the change to add the IDs and the change to send them up to Azure. now that this is smaller, would you prefer I do that all in this PR?

I'm happy to do it either way, whatever makes it easier for you all

@CecileRobertMichon
Copy link
Contributor

I had originally planned on doing that in a follow-up PR when this PR was large, to draw a clear line between the change to add the IDs and the change to send them up to Azure. now that this is smaller, would you prefer I do that all in this PR?

separate PR is fine with me, up to you

Copy link
Contributor

@devigned devigned left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/assign @CecileRobertMichon

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 30, 2021
@arschles
Copy link
Contributor Author

Thanks @CecileRobertMichon - I'll submit a follow-up PR with those changes

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 30, 2021
@arschles arschles force-pushed the corr-id branch 2 times, most recently from 8fb96e3 to ac73cf6 Compare July 30, 2021 19:45
@devigned
Copy link
Contributor

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 30, 2021
defer span.End()
ctx, cancel := context.WithTimeout(ctx, reconciler.DefaultedLoopTimeout(ampr.ReconcileTimeout))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why the change in order here but not in every controller? azuremachine and azurecluster still have the reverse order

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved this context creation down to below the creation of the span so that the timeout context includes the correlation ID value. IIRC the others had this change but I reverted them by accident. Would it make more sense to save all these changes for a separate PR?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

either way, as long as the changes are consistent (make the changes here to all the controllers or none of them)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, going to take this change out then because it's not strictly necessary

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 30, 2021
Comment on lines +140 to +141
),
)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@CecileRobertMichon are you ok with this the the other similar format fixes?

@devigned
Copy link
Contributor

/retest

@CecileRobertMichon
Copy link
Contributor

/lgtm
/approve
/retest

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 2, 2021
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: CecileRobertMichon

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 2, 2021
@arschles
Copy link
Contributor Author

arschles commented Aug 2, 2021

/test pull-cluster-api-provider-azure-e2e

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/provider/azure Issues or PRs related to azure provider cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Azure API requests through the SDK should use a shared correlation ID for a given reconcile loop
4 participants