Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(controller): multiple TrafficRoutingReconciler #1472

Merged
merged 20 commits into from
Oct 26, 2021

Conversation

fblgit
Copy link
Contributor

@fblgit fblgit commented Sep 4, 2021

Current Scenario

You can add multiple trafficRouting on rollouts spec. Unfortunately, only the first is used (alphabetically sorted). example:

    canary:
      canaryService: rollouts-demo-canary
      stableService: rollouts-demo-stable
      trafficRouting:
        smi: {}
        nginx:
          stableIngress: rollouts-demo-stable

In that case, only nginx will be used and smi will be ignored.

Feature multiple TrafficRoutingReconciler

With the same spec:

    canary:
      canaryService: rollouts-demo-canary
      stableService: rollouts-demo-stable
      trafficRouting:
        smi: {}
        nginx:
          stableIngress: rollouts-demo-stable

In this case, both trafficRouting definitions are used and they are kept in sync as well.

INFO[2021-09-04T14:01:19Z] Found 2 TrafficRouting Reconciliers           namespace=default rollout=rollouts-demo
INFO[2021-09-04T14:01:19Z] Reconciling TrafficRouting with type 'Nginx'  namespace=default rollout=rollouts-demo
INFO[2021-09-04T14:01:19Z] syncing service                               namespace=default rollout=rollouts-demo service=rollouts-demo-canary
INFO[2021-09-04T14:01:19Z] updating canary Ingress                       desiredWeight=5 ingress=rollouts-demo-rollouts-demo-stable-canary namespace=default rollout=rollouts-demo
INFO[2021-09-04T14:01:19Z] Event(v1.ObjectReference{Kind:"Rollout", Namespace:"default", Name:"rollouts-demo", UID:"02f3c5eb-18c1-41bb-ae05-bc80150c06b3", APIVersion:"argoproj.io/v1alpha1", ResourceVersion:"58651", FieldPath:""}): type: 'Normal' reason: 'PatchingCanaryIngress' Updating Ingress `rollouts-demo-rollouts-demo-stable-canary` to desiredWeight '5'
INFO[2021-09-04T14:01:19Z] Updating Ingress `rollouts-demo-rollouts-demo-stable-canary` to desiredWeight '5'  event_reason=PatchingCanaryIngress namespace=default rollout=rollouts-demo
INFO[2021-09-04T14:01:19Z] Desired weight (stepIdx: 0) 5 verified        namespace=default rollout=rollouts-demo
INFO[2021-09-04T14:01:19Z] Reconciling TrafficRouting with type 'SMI'    namespace=default rollout=rollouts-demo
INFO[2021-09-04T14:01:19Z] TrafficSplit `rollouts-demo` modified         event_reason=TrafficSplitModified namespace=default rollout=rollouts-demo
INFO[2021-09-04T14:01:19Z] Desired weight (stepIdx: 0) 5 verified        namespace=default rollout=rollouts-demo
INFO[2021-09-04T14:01:19Z] Event(v1.ObjectReference{Kind:"Rollout", Namespace:"default", Name:"rollouts-demo", UID:"02f3c5eb-18c1-41bb-ae05-bc80150c06b3", APIVersion:"argoproj.io/v1alpha1", ResourceVersion:"58651", FieldPath:""}): type: 'Normal' reason: 'TrafficSplitModified' TrafficSplit `rollouts-demo` modified
INFO[2021-09-04T14:01:19Z] Rollout step 1/7 completed (setWeight: 5)     event_reason=RolloutStepCompleted namespace=default rollout=rollouts-demo

This is performed with the minimum impact by just mutating the structure into a list that is processed one by one sequentially.

Scenarios

A) existing fleet that can't transition from nginx to VirtualGateways or similars
B) adding sidecars to large throughput nginx controllers is not viable or can have a negative impact
C) adoption of full N-S and W-E canary with zero-downtime on large fleets
D) nginx-service-mesh (nginxinc) ++ ingress-nginx (kubernetes) doesn't work so well together
E) nginx service mesh N/S W/E without using nginxinc ingress.
F) no host header annotations to circumvent routing issues when adding nginx-ingress to a mesh

Despite facing both A+B+C+D+E+F Scenarios, there are a few thoughts behind this:

  • Why have to choose between SMI or Nginx?
  • Why change 3000 ingresses into VirtualGateways or similar?
  • Why adding sidecars to the ingress if the annotations should work fine ?
  • Why can accomplish either N/S or W/E traffic shift with minimal cost/impact ?
  • Why multiple providers wouldn't be able to work together?

This will be the first to allow the use of multiple traffic-shift mechanisms among the other existing solutions that enables features such as canary around :)

This PR enables the requested feature #514 that was gonna be part of 1.1 but ultimately was removed.

Checklist:

  • Either (a) I've created an enhancement proposal and discussed it with the community, (b) this is a bug fix, or (c) this is a chore.
  • The title of the PR is (a) conventional, (b) states what changed, and (c) suffixes the related issues number. E.g. "fix(controller): Updates such and such. Fixes #1234".
  • I've signed my commits with DCO
  • I have written unit and/or e2e tests for my change. PRs without these are unlikely to be merged.
  • My builds are green. Try syncing with master if they are not.
  • My organization is added to USERS.md.

@xavipanda
Copy link

We face similar scenarios, we built a custom controller for this to sync these two..

Hi @jessesuen
This was part of the milestone for argo-rollouts till not long ago, and now this contribution is at the door.
Do you have some time to review this please ? we would love to get rid of that controller and see rollouts taking care of both SMI and NGINX.

Appreciate your efforts! Thank you

@harikrongali
Copy link
Contributor

@xavipanda
we will take a look at it and let you know

@harikrongali harikrongali added this to the v1.1 milestone Sep 13, 2021
@codecov
Copy link

codecov bot commented Sep 13, 2021

Codecov Report

Merging #1472 (a0bfa56) into master (26433be) will decrease coverage by 0.19%.
The diff coverage is 61.86%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1472      +/-   ##
==========================================
- Coverage   81.77%   81.57%   -0.20%     
==========================================
  Files         112      112              
  Lines       15071    15142      +71     
==========================================
+ Hits        12324    12352      +28     
- Misses       2103     2134      +31     
- Partials      644      656      +12     
Impacted Files Coverage Δ
rollout/controller.go 74.66% <ø> (-0.63%) ⬇️
rollout/trafficrouting.go 68.61% <61.86%> (-9.57%) ⬇️
rollout/canary.go 74.49% <0.00%> (-2.03%) ⬇️
metricproviders/datadog/datadog.go 77.62% <0.00%> (-1.21%) ⬇️
rollout/trafficrouting/istio/istio.go 80.56% <0.00%> (-0.40%) ⬇️
pkg/kubectl-argo-rollouts/cmd/get/get_rollout.go 98.49% <0.00%> (+0.01%) ⬆️
pkg/kubectl-argo-rollouts/cmd/status/status.go 93.33% <0.00%> (+0.09%) ⬆️
utils/istio/istio.go 95.16% <0.00%> (+0.16%) ⬆️
... and 2 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 26433be...a0bfa56. Read the comment docs.

@harikrongali
Copy link
Contributor

@fblgit can you fix lint errors?

@harikrongali
Copy link
Contributor

@khhirani can you review the PR?

Copy link
Contributor

@khhirani khhirani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fblgit Can you please create new unit tests to verify that your change works? Right now your PR isn't meeting our code coverage requirement.

Please add some new E2E tests as well.

@fblgit
Copy link
Contributor Author

fblgit commented Sep 14, 2021

sure, let me work on it :) thanks

@fblgit
Copy link
Contributor Author

fblgit commented Sep 19, 2021

Hi @khhirani

e2e-tests:

  • Added a new test scenario, SMI + Ingress: make test-e2e E2E_TEST_OPTIONS="-testify.m ^TestSMIIngressCanaryStep$"

Note: Added a few func's to get Ingress data. Let me know if u want a decoupled e2e just for the nginx, should be quickly doable.

lint:
Fixed sync.Mutex issue, now looking ok.

# make lint
go mod tidy
go mod vendor
golangci-lint run --fix
# 

unit-tests:

  • added a unit-test for 3 reconcilers (alb + nginx + smi)
  • added a unit-test for 2 reconcilers (nginx + smi)
  • existing tests (networkReconcilers and trafficReconcilers) has been adapted to the new []type

coverage:

  • should look better now, please run the workflow when u have some time 👍

Hope next time I can nail the PR it at the first attempt, sorry for the trouble. Thank you

@fblgit fblgit requested a review from khhirani September 19, 2021 18:07
@harikrongali
Copy link
Contributor

@fblgit Thanks for the contribution. Can you fix the indentation issue?
Also can you add documentation as will with different scenarios and use cases. (provide some sample examples so that users can try).

@fblgit
Copy link
Contributor Author

fblgit commented Sep 20, 2021

@harikrongali
Happy to contribute to this great project!

Regarding your comments:

indent:

  • it's a Github diff visual issue, please open the file in your IDE or View File in GH and you won't see any indent issue. check L97, there is a for that is required due to the change of []type

documentation:

  • I did my best on this, but keep in mind I'm not an English native speaker so It would be very encouraging that someone looks at it and help on this element.
  • clarified this will be available from v1.1 (as shown in the milestone for this PR)

example:

  • Added a whole scenario, SMI+Ingress within the getting-started guide (SMI + Nginx)

Again thanks for your time, let me know if anything else is needed.

@fblgit fblgit requested a review from harikrongali September 20, 2021 16:42
@fblgit
Copy link
Contributor Author

fblgit commented Sep 20, 2021

quite strange...

 FAIL: TestFunctionalSuite/TestRolloutAbortRetryPromote (6.41s)

But when I run this one locally:

=== RUN   TestFunctionalSuite/TestRolloutAbortRetryPromote
time="2021-09-20T16:49:43Z" level=info msg="Deleting e2e-test-name=TestRolloutAbortRetryPromote"
W0920 16:49:43.915482  166675 warnings.go:70] policy/v1beta1 PodDisruptionBudget is deprecated in v1.21+, unavailable in v1.25+; use policy/v1 PodDisruptionBudget
W0920 16:49:43.917669  166675 warnings.go:70] policy/v1beta1 PodDisruptionBudget is deprecated in v1.21+, unavailable in v1.25+; use policy/v1 PodDisruptionBudget
time="2021-09-20T16:49:43Z" level=info msg="Event watcher started"
time="2021-09-20T16:49:44Z" level=info msg="rollout.argoproj.io/abort-retry-promote created\n" rollout=abort-retry-promote
time="2021-09-20T16:49:44Z" level=info msg="Waiting for condition: status=Healthy" rollout=abort-retry-promote
time="2021-09-20T16:49:46Z" level=info msg="Condition 'status=Healthy' met after 2s" rollout=abort-retry-promote
time="2021-09-20T16:49:46Z" level=info msg="Updated rollout pod spec: 2021-09-20T16:49:46.211363209Z" rollout=abort-retry-promote
time="2021-09-20T16:49:46Z" level=info msg="Waiting for condition: status=Paused" rollout=abort-retry-promote
time="2021-09-20T16:49:47Z" level=info msg="Condition 'status=Paused' met after 1s" rollout=abort-retry-promote
time="2021-09-20T16:49:47Z" level=info msg="Pod expectation 'revision:1 pod count == 1' met" rollout=abort-retry-promote
time="2021-09-20T16:49:47Z" level=info msg="Pod expectation 'revision:2 pod count == 1' met" rollout=abort-retry-promote
time="2021-09-20T16:49:47Z" level=info msg="Aborted rollout" rollout=abort-retry-promote
time="2021-09-20T16:49:47Z" level=info msg="Waiting for condition: status=Degraded" rollout=abort-retry-promote
time="2021-09-20T16:49:47Z" level=info msg="Condition 'status=Degraded' met after 0s" rollout=abort-retry-promote
time="2021-09-20T16:49:47Z" level=info msg="Retried rollout" rollout=abort-retry-promote
time="2021-09-20T16:49:47Z" level=info msg="Waiting for condition: status=Paused" rollout=abort-retry-promote
time="2021-09-20T16:49:48Z" level=info msg="Condition 'status=Paused' met after 0s" rollout=abort-retry-promote
time="2021-09-20T16:49:48Z" level=info msg="Pod expectation 'revision:1 pod count == 1' met" rollout=abort-retry-promote
time="2021-09-20T16:49:48Z" level=info msg="Pod expectation 'revision:2 pod count == 1' met" rollout=abort-retry-promote
time="2021-09-20T16:49:48Z" level=info msg="Waiting for condition: status=Healthy" rollout=abort-retry-promote
time="2021-09-20T16:49:51Z" level=info msg="Condition 'status=Healthy' met after 2s" rollout=abort-retry-promote
time="2021-09-20T16:49:51Z" level=info msg="Pod expectation 'revision:1 pod count == 0' met" rollout=abort-retry-promote
time="2021-09-20T16:49:51Z" level=info msg="Pod expectation 'revision:2 pod count == 1' met" rollout=abort-retry-promote
time="2021-09-20T16:49:51Z" level=info msg="Deleting e2e-test-name=TestRolloutAbortRetryPromote"
time="2021-09-20T16:49:51Z" level=info msg="Event watcher stopped" rollout=abort-retry-promote
W0920 16:49:51.153409  166675 warnings.go:70] policy/v1beta1 PodDisruptionBudget is deprecated in v1.21+, unavailable in v1.25+; use policy/v1 PodDisruptionBudget
W0920 16:49:51.155028  166675 warnings.go:70] policy/v1beta1 PodDisruptionBudget is deprecated in v1.21+, unavailable in v1.25+; use policy/v1 PodDisruptionBudget
time="2021-09-20T16:49:51Z" level=info msg="Deleting e2e-test-name"
W0920 16:49:51.194590  166675 warnings.go:70] policy/v1beta1 PodDisruptionBudget is deprecated in v1.21+, unavailable in v1.25+; use policy/v1 PodDisruptionBudget
W0920 16:49:51.196061  166675 warnings.go:70] policy/v1beta1 PodDisruptionBudget is deprecated in v1.21+, unavailable in v1.25+; use policy/v1 PodDisruptionBudget
--- PASS: TestFunctionalSuite (7.69s)
    --- PASS: TestFunctionalSuite/TestRolloutAbortRetryPromote (7.29s)

I believe the problem is unrelated to the PR itself, can someone confirm? thank you

Copy link
Member

@huikang huikang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!

docs/features/traffic-management/mixed.md Outdated Show resolved Hide resolved
Signed-off-by: fblgit <[email protected]>
@fblgit
Copy link
Contributor Author

fblgit commented Sep 23, 2021

@khhirani / @harikrongali
Please can you tell me the merge order for this milestone ?
I'm fine to rebase it, resolve conflicts, retest, etc at once in order to avoid further conflicts. I will align this one with your provided timeline and merging order.

@jessesuen
Copy link
Member

Hi @fblgit ! Thank you for this contribution! Regarding “merge order”, theres no real order per-se. It's affected by release timing, or if/when the PR is ready. This past week, I had been prioritizing and only merging things which were required for the v1.1 release candidate, since we are running behind on v1.1. I actually didn't realize @harikrongali had targeted this for v1.1 otherwise I would have paid closer attention. Sorry about that.

Generally speaking, we try to avoid producing unnecessary work for external contributors if it can be avoided, and instead place the burden on project maintainers. This includes holding off on merging big sweeping changes (until impacted smaller PRs get in first), so that outside contributors don't have to deal with the conflicts. Other times we'll contribute to the PR itself to fix conflicts ourselves.

In any case, this PR now on my radar and based on feedback, appears close to complete. So if you rebase now, I think you won't have to deal with conflicts again.

@fblgit
Copy link
Contributor Author

fblgit commented Sep 23, 2021

Noted and thank you, will be ready before Monday.

Regards

@harikrongali
Copy link
Contributor

@fblgit It is from my end. Sorry for that, I wanted to get in as we have delayed release. But as there are review comments on changes, it didn't cut the 1.1 release. I should have updated it last week. We wanted to get in the changes as soon as possible after the review process is done to avoid merge conflicts.

@harikrongali harikrongali modified the milestones: v1.1, v1.2 Sep 23, 2021
@fblgit
Copy link
Contributor Author

fblgit commented Sep 24, 2021

@jessesuen ready to merge. tests and e2e looks good.

@fblgit fblgit force-pushed the rollouts_multitrafficreconciler branch from f87f8ab to a0bfa56 Compare October 6, 2021 15:33
@sonarcloud
Copy link

sonarcloud bot commented Oct 6, 2021

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 0 Code Smells

No Coverage information No Coverage information
6.4% 6.4% Duplication

@harikrongali harikrongali added the ready-for-review Ready for final review label Oct 22, 2021
Copy link
Member

@jessesuen jessesuen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. One minor improvement.

Comment on lines +97 to +100
if reconcilers == nil {
// Not using traffic routing
c.newStatus.Canary.Weights = nil
return nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is dead code because if len(reconcilers) == 0 will cover the case where reconcilers == nil.

@jessesuen jessesuen merged commit a398ff9 into argoproj:master Oct 26, 2021
@jessesuen
Copy link
Member

Great improvement! Can fix the minor issue later.

@fblgit fblgit deleted the rollouts_multitrafficreconciler branch October 28, 2021 22:23
noam-codefresh pushed a commit to codefresh-io/argo-rollouts that referenced this pull request Feb 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready-for-review Ready for final review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants