Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: add troubleshooting examples debugging missing nodes #831

Conversation

jackfrancis
Copy link
Contributor

What this PR does / why we need it:

This PR adds some supporting troubleshooting documentation to get a user started debugging why a cluster node did not come online.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #

Special notes for your reviewer:

Please confirm that if this PR changes any image versions, then that's the sole change this PR makes.

TODOs:

  • squashed commits
  • includes documentation
  • adds unit tests

Release note:


@k8s-ci-robot
Copy link
Contributor

@jackfrancis: Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jul 27, 2020
@k8s-ci-robot
Copy link
Contributor

Hi @jackfrancis. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jul 27, 2020
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign fabriziopandini
You can assign the PR to them by writing /assign @fabriziopandini in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added area/provider/azure Issues or PRs related to azure provider sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. labels Jul 27, 2020
@CecileRobertMichon
Copy link
Contributor

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jul 27, 2020
$ export CLUSTER_RESOURCE_GROUP=my-cluster-rg
$ export VM_PREFIX=my-cluster-md-0-
$ export KUBECONFIG=/Users/me/.kube/my-cluster.kubeconfig
$ $ for vm in $(az vm list -g $CLUSTER_RESOURCE_GROUP | jq -r --arg VM_PREFIX "${VM_PREFIX}" '.[] | select (.name | startswith($VM_PREFIX)).name'); do kubectl get node $vm 2>&1 >/dev/null && continue || echo node $vm did not join the cluster; done
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this only works for VMs, not vmss (ie. not MachinePool) right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. I'll add an equivalent example for VMSS.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wouldn't kubectl get machines / kubectl get machinepool grep -v Ready have the same effect without needing az cli and calling to Azure? In your case did the machine show as Ready?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, grep -v Running *

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I honestly didn't look at the machine resource at all. This was how my brain worked:

  1. create cluster w/ desired node count
  2. wait for nodes to come online, after a while noticed that there was one missing node
  3. how many actual VMs are in my resource group? it's 20
  4. O.K., so which one didn't register as a node?

FWIW

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I think both are valid, you went to the RG as a first instinct because you're familiar with Azure. For users who aren't as comfortable with Azure specific stuff it'd be nice to document how all this stuff can be done without needing to care about the underlying infrastructure... kubectl get azuremachine should show you the VM status while kubectl get machine should show you the status of the machine from a k8s perspective.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think both perspectives are valid here. In reality will probably need to look at both CAPZ CRD and the underlying infra. Atleast that's how I have approached it so far, I look at the crd's for a quick understanding of what CAPZ thinks is going on then I look at the azure system to see what is happening. It would be worth add a small section to see how this vms in azure related to the CRD's of capz.

@CecileRobertMichon
Copy link
Contributor

/assign @mboersma @jsturtevant for additional review

@jsturtevant
Copy link
Contributor

Love the idea of giving use case driven debugging tips.

My thoughts are if you use the CAPZ crd's you could get rid of the export CLUSTER_RESOURCE_GROUP=my-cluster-rg and export VM_PREFIX=my-cluster-md-0- which could then be automated into a tool similar to what is in https://github.com/kubernetes-sigs/cluster-api-provider-azure/tree/master/hack/debugging.

I use the ssh and map tool all the time when debuggin and would find something similiar useful too. I had this problem of nodes coming online just yesterday 😄

@CecileRobertMichon
Copy link
Contributor

@jackfrancis once #901 merges, consider changing these instructions looking at boot diagnostics from the portal or the Azure CLI (https://docs.microsoft.com/en-us/cli/azure/vm/boot-diagnostics?view=azure-cli-latest#az-vm-boot-diagnostics-get-boot-log)

@CecileRobertMichon
Copy link
Contributor

@jackfrancis are you still planning on getting this one in?

@k8s-ci-robot
Copy link
Contributor

@jackfrancis: The following test failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
pull-cluster-api-provider-azure-e2e-windows b819443 link /test pull-cluster-api-provider-azure-e2e-windows

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@devigned
Copy link
Contributor

@jackfrancis gentle nudge on this. We are happy to take over from current state if you would prefer.

@CecileRobertMichon
Copy link
Contributor

Rewrote this with new options in #1232

/close

@k8s-ci-robot
Copy link
Contributor

@CecileRobertMichon: Closed this PR.

In response to this:

Rewrote this with new options in #1232

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@jackfrancis jackfrancis deleted the docs-troubleshooting-missing-nodes branch December 9, 2022 23:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/provider/azure Issues or PRs related to azure provider cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants