Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Percolate Azure correlation IDs to REST API calls #1574

Merged
merged 1 commit into from
Aug 27, 2021

Conversation

arschles
Copy link
Contributor

@arschles arschles commented Jul 30, 2021

/kind feature

What this PR does / why we need it:
This is a follow-on to #1460. In that PR, we added functionality to set x-ms-correlation-id keys on all context.Contexts returned by new traces returned by tele.Tracer().Start(). That change ensured that correlation IDs were created at the root of all reconciliation operations, but those correlation IDs did not escape the running process. This change ensures that, as the correlation IDs percolate to the REST API/autorest layer, they are picked up and sent over the wire to the Azure API.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #1310

Special notes for your reviewer:

This is a follow-on PR to #1460, according to this comment. It should not be merged before that one. I am submitting this as a draft pull request until #1460 is merged.

@CecileRobertMichon @devigned there are 2 things that we've discussed that should be done after #1460:

  1. Percolate the correlation ID up to the Azure API (this PR)
  2. Ensure correlation IDs show up in logs

I'll submit the second in a separate PR

TODOs:

  • squashed commits
  • includes documentation (@devigned @CecileRobertMichon are there any places that the correlation ID needs to be documented?)
  • adds unit tests

Release note:

Sending x-ms-correlation-request-id values to the Azure API to correlate HTTP requests with trace spans

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/feature Categorizes issue or PR as related to a new feature. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. area/provider/azure Issues or PRs related to azure provider labels Jul 30, 2021
@k8s-ci-robot k8s-ci-robot requested review from devigned and shysank July 30, 2021 22:22
@k8s-ci-robot k8s-ci-robot added sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jul 30, 2021
@k8s-ci-robot
Copy link
Contributor

Hi @arschles. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jul 30, 2021
// Wrap the original Sender on the autorest.Client c.
// The wrapped Sender should set the x-ms-correlation-id on the given
// request, then pass the new request to the underlying Sender.
c.Sender = autorest.DecorateSender(c.Sender, msCorrelationIDSendDecorator)
Copy link
Contributor Author

@arschles arschles Jul 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@devigned related to our (offline) discussion today - you pointed out that autorest client creation code is centralized around this function. I this just wraps (wraps! 😆) the raw client's sender in code that extracts the correlation ID out of the request context, puts it into the header, and then calls through to the underlying sender. I think this does the trick. See below for a related comment on tests.

@@ -235,3 +240,51 @@ func TestGetDefaultUbuntuImage(t *testing.T) {
})
}
}

func TestMSCorrelationIDSendDecorator(t *testing.T) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@CecileRobertMichon @devigned would you prefer that I include a test here for SetAutoRestClientDefaults rather than just msCorrelationIDSendDecorator. I didn't want to blow up the scope too much, but I'm happy to go a bit bigger in tests here if you'd prefer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is fine with me, ideally we would add both but for this specific functionality a specific test is good enough

@arschles arschles force-pushed the corr-id-percolate branch from 5795c49 to f5d003c Compare August 2, 2021 18:43
@arschles arschles marked this pull request as ready for review August 2, 2021 20:59
@arschles arschles force-pushed the corr-id-percolate branch from f5d003c to 71d2e59 Compare August 2, 2021 21:02
@arschles arschles changed the title [WIP] Percolate Azure correlation IDs to REST API calls Percolate Azure correlation IDs to REST API calls Aug 2, 2021
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 2, 2021
@arschles arschles force-pushed the corr-id-percolate branch from a6a5671 to a07f5b1 Compare August 2, 2021 21:12
@CecileRobertMichon
Copy link
Contributor

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Aug 2, 2021
Copy link
Contributor

@CecileRobertMichon CecileRobertMichon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/assign @mboersma

@CecileRobertMichon
Copy link
Contributor

@arschles sorry for the delay getting back to you on this. The PR looks good to me, are you planning to make the required changes to all controllers in the PR? I've asked @mboersma for a review as well.

The tests are failing because of lint:

util/tele/corr_id.go:30:63: Comment should end in a period (godot)
// context.Contexts, HTTP headers, and other similar locations

@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Aug 17, 2021
@arschles
Copy link
Contributor Author

@arschles sorry for the delay getting back to you on this.

@CecileRobertMichon not a problem.

The PR looks good to me, are you planning to make the required changes to all controllers in the PR?

Great to hear. The changes in those controllers were unnecessary, so I've removed them in 9a04ee4 (I'll squash later so you can see that in isolation for now). My apologies for forgetting to take those changes out.

I've asked @mboersma for a review as well.

Great, and howdy @mboersma!

The tests are failing because of lint:

Fixed in 86ab113

@devigned
Copy link
Contributor

/retest

Copy link
Contributor

@devigned devigned left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR lgtm. @arschles, let's pair for a minute to verify the correlation ID is being propagated and collected in ARM.

Copy link
Contributor

@mboersma mboersma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

I'll try this in tilt locally to see if I can spot the Correlation ID being propagated.

@CecileRobertMichon
Copy link
Contributor

@mboersma @devigned is this on hold or ready to go?

@mboersma
Copy link
Contributor

mboersma commented Aug 24, 2021

The observability "stack" continues to work with this change, but I can't tell by looking at the traces (in Jaeger or in App Insights) if x-ms-correlation-id is being set for SDK requests. However, we don't capture all the HTTP headers in the traces. There aren't ARM deployments to check AFAICT, so I'm not sure how to validate this.

Deferring to @devigned...

@arschles
Copy link
Contributor Author

Thanks @mboersma! @devigned is there a different way you'd like me to go about testing this?

@devigned
Copy link
Contributor

We need to validate that we can find a correlation ID in ARM logs. @arschles let's verify and then I'm good.

@arschles
Copy link
Contributor Author

@devigned k, I'll contact you offline and we can set up a time to do this verification 😄

Copy link
Contributor

@devigned devigned left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks solid. However, upon verifying the correlation ID in ARM logs, I was unable to find the correlated HTTP requests. After changing the HTTP header key, I was able to find the correlated HTTP request.

I think it would be important to also add the correlation ID to the span to make it easier for users to report correlated Azure requests via a trace attribute. This will make it much easier for a support rep to debug what is happening on the Azure side of things.

util/tele/corr_id.go Outdated Show resolved Hide resolved
@@ -36,6 +36,7 @@ func (t tracer) Start(
opts ...trace.SpanOption,
) (context.Context, trace.Span) {
ctx, _ = ctxWithCorrID(ctx)
opts = append(opts, trace.WithSpanKind(trace.SpanKindClient))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please consider adding the correlation ID to the span as well. That way, the traces will also have an ARM correlation ID available.

Copy link
Contributor

@devigned devigned left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

pending tests for approval

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 27, 2021
@devigned
Copy link
Contributor

@CecileRobertMichon do you have any comments? If not, I've verified the functionality and lgtm.

@CecileRobertMichon
Copy link
Contributor

/lgtm
/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: CecileRobertMichon

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 27, 2021
@k8s-ci-robot k8s-ci-robot merged commit feaeb16 into kubernetes-sigs:main Aug 27, 2021
@k8s-ci-robot k8s-ci-robot added this to the v0.5 milestone Aug 27, 2021
@arschles arschles deleted the corr-id-percolate branch August 27, 2021 20:44
@arschles
Copy link
Contributor Author

@CecileRobertMichon @devigned @mboersma thank you all for your help on this PR!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/provider/azure Issues or PRs related to azure provider cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Azure API requests through the SDK should use a shared correlation ID for a given reconcile loop
5 participants