Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[donotmerge] debug aks e2e #1488

Closed
wants to merge 21 commits into from

Conversation

alexeldeib
Copy link
Contributor

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #

Special notes for your reviewer:

Please confirm that if this PR changes any image versions, then that's the sole change this PR makes.

TODOs:

  • squashed commits
  • includes documentation
  • adds unit tests

Release note:


@k8s-ci-robot
Copy link
Contributor

@alexeldeib: Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. label Jul 2, 2021
@k8s-ci-robot k8s-ci-robot requested review from cpanato and juan-lee July 2, 2021 22:17
@k8s-ci-robot k8s-ci-robot added area/provider/azure Issues or PRs related to azure provider sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Jul 2, 2021
@k8s-ci-robot k8s-ci-robot added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Jul 2, 2021
@alexeldeib
Copy link
Contributor Author

In that one the cluster reached provisioned but both MPs seem to have gotten stuck. Added some debu logs and trying again.

What’s weird is they both seem to have been created in azure — toward the end you can see the logs from the agentpools service where we diff the current/desired agentpool, and both pools are up to date. But we don’t see to reach the end of the reconcile, and yet I don’t see any errors. Could be failing to call VMSS but somehow not logging..

@alexeldeib
Copy link
Contributor Author

/hold

just for debug

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 2, 2021
@alexeldeib
Copy link
Contributor Author

alexeldeib commented Jul 3, 2021

hmm. in that run, cluster failed to provision, but AMCP is fully populated/ready. checking the timestamps...

https://storage.googleapis.com/kubernetes-jenkins/pr-logs/pull/kubernetes-sigs_cluster-api-provider-azure/1488/pull-cluster-api-provider-azure-e2e-exp/1411107169369067520/artifacts/clusters/bootstrap/resources/capz-e2e-ni5r7m/AzureManagedControlPlane/capz-e2e-ni5r7m.yaml

looks like AMCP was "reconciling kubeconfig" withing 5min

@alexeldeib
Copy link
Contributor Author

alexeldeib commented Jul 3, 2021

and then we randomly start failing on identity not present, AFTER amcp successfully reconciles?

failed to create scope: failed to retrieve AzureClusterIdentity external object \"capz-e2e-ni5r7m\"/\"multi-tenancy-identity\": AzureClusterIdentity.infrastructure.cluster.x-k8s.io \"multi-tenancy-identity\" not found\nsigs.k8s.io/cluster-api-provider-azure/azure/scope.NewManagedControlPlaneCredentialsProvider\n\t/workspace/azure/scope/identity.go:118\nsigs.k8s.io/cluster-api-provider-azure/azure/scope.NewManagedControlPlaneScope\n\t/workspace/azure/scope/managedcontrolplane.go:69\nsigs.k8s.io/cluster-api-provider-azure/exp/controllers.(*AzureManagedControlPlaneReconciler).Reconcile\n\t/workspace/exp/controllers/azuremanagedcontrolplane_controller.go:157\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:298\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:253\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:214\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1371\nfailed to init credentials provider\nsigs.k8s.io/cluster-api-provider-azure/azure/scope.NewManagedControlPlaneScope\n\t/workspace/azure/scope/managedcontrolplane.go:71\nsigs.k8s.io/cluster-api-provider-azure/exp/controllers.(*AzureManagedControlPlaneReconciler).Reconcile\n\t/workspace/exp/controllers/azuremanagedcontrolplane_controller.go:157\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:298\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:253\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:214

interestingly, I do not see the AzureClusterIdentity in the output artifacts: https://gcsweb.k8s.io/gcs/kubernetes-jenkins/pr-logs/pull/kubernetes-sigs_cluster-api-provider-azure/1488/pull-cluster-api-provider-azure-e2e-exp/1411107169369067520/artifacts/clusters/bootstrap/resources/capz-e2e-ni5r7m/

@alexeldeib
Copy link
Contributor Author

/test pull-cluster-api-provider-azure-e2e-exp

getting some more data

@alexeldeib
Copy link
Contributor Author

alexeldeib commented Jul 3, 2021

so for the failures waiting for a control plane machine where it looks "hung", appears we get stuck looping here:

now to find why

update: from a successful run, this is the period during which no vmss are created, so we're looping trying to find one. I don't get why that makes sense though, if we are able to run pods, the VMs are clearly there...something is weird with the amounts of time between different events

@k8s-ci-robot k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Jul 3, 2021
@alexeldeib
Copy link
Contributor Author

/test pull-cluster-api-provider-azure-e2e-exp

@alexeldeib
Copy link
Contributor Author

in the last run, it looks like pods did come up, but we get no results listing scalesets and eventually time out :/

https://storage.googleapis.com/kubernetes-jenkins/pr-logs/pull/kubernetes-sigs_cluster-api-provider-azure/1488/pull-cluster-api-provider-azure-e2e-exp/1411150593765937152/artifacts/clusters/bootstrap/controllers/capz-controller-manager/capz-controller-manager-69d5ccc44d-2mgcg/manager.log

probably want to dump kubectl get nodes / equivalent, also adding nmi logs might be good. I don't understand why we get stuck in that loop for so long

@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Jul 8, 2021
@alexeldeib alexeldeib changed the title debug e2e [donotmerge] debug aks e2e Jul 8, 2021
@alexeldeib
Copy link
Contributor Author

So it seems like we're really stuck in that loop, but the nodes from both pools already joined. So i'm trying to understand why we keep listing VMSS and getting zero results from Azure.

@alexeldeib
Copy link
Contributor Author

/test pull-cluster-api-provider-azure-e2e-exp

2 similar comments
@alexeldeib
Copy link
Contributor Author

/test pull-cluster-api-provider-azure-e2e-exp

@alexeldeib
Copy link
Contributor Author

/test pull-cluster-api-provider-azure-e2e-exp

@alexeldeib
Copy link
Contributor Author

/test pull-cluster-api-provider-azure-e2e

This reverts commit 5498946.
@k8s-ci-robot k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Aug 13, 2021
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please ask for approval from alexeldeib after the PR has been reviewed.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Aug 14, 2021
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Aug 14, 2021
@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Aug 14, 2021
@k8s-ci-robot
Copy link
Contributor

@alexeldeib: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 25, 2021
@CecileRobertMichon
Copy link
Contributor

/close

@k8s-ci-robot
Copy link
Contributor

@CecileRobertMichon: Closed this PR.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/provider/azure Issues or PRs related to azure provider cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants