docs: add troubleshooting examples debugging missing nodes #831

jackfrancis · 2020-07-27T19:26:17Z

What this PR does / why we need it:

This PR adds some supporting troubleshooting documentation to get a user started debugging why a cluster node did not come online.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #

Special notes for your reviewer:

Please confirm that if this PR changes any image versions, then that's the sole change this PR makes.

TODOs:

squashed commits
includes documentation
adds unit tests

Release note:

k8s-ci-robot · 2020-07-27T19:26:19Z

@jackfrancis: Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot · 2020-07-27T19:26:25Z

Hi @jackfrancis. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot · 2020-07-27T19:26:33Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign fabriziopandini
You can assign the PR to them by writing /assign @fabriziopandini in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

CecileRobertMichon · 2020-07-27T20:07:49Z

/ok-to-test

CecileRobertMichon · 2020-07-27T20:11:34Z

docs/troubleshooting.md

+$ export CLUSTER_RESOURCE_GROUP=my-cluster-rg
+$ export VM_PREFIX=my-cluster-md-0-
+$ export KUBECONFIG=/Users/me/.kube/my-cluster.kubeconfig
+$ $ for vm in $(az vm list -g $CLUSTER_RESOURCE_GROUP |  jq -r --arg VM_PREFIX "${VM_PREFIX}" '.[] | select (.name | startswith($VM_PREFIX)).name'); do kubectl get node $vm 2>&1 >/dev/null && continue || echo node $vm did not join the cluster; done


this only works for VMs, not vmss (ie. not MachinePool) right?

Right. I'll add an equivalent example for VMSS.

wouldn't kubectl get machines / kubectl get machinepool grep -v Ready have the same effect without needing az cli and calling to Azure? In your case did the machine show as Ready?

Sorry, grep -v Running *

I honestly didn't look at the machine resource at all. This was how my brain worked:

create cluster w/ desired node count

wait for nodes to come online, after a while noticed that there was one missing node

how many actual VMs are in my resource group? it's 20

O.K., so which one didn't register as a node?

FWIW

Right, I think both are valid, you went to the RG as a first instinct because you're familiar with Azure. For users who aren't as comfortable with Azure specific stuff it'd be nice to document how all this stuff can be done without needing to care about the underlying infrastructure... kubectl get azuremachine should show you the VM status while kubectl get machine should show you the status of the machine from a k8s perspective.

I think both perspectives are valid here. In reality will probably need to look at both CAPZ CRD and the underlying infra. Atleast that's how I have approached it so far, I look at the crd's for a quick understanding of what CAPZ thinks is going on then I look at the azure system to see what is happening. It would be worth add a small section to see how this vms in azure related to the CRD's of capz.

CecileRobertMichon · 2020-07-29T16:33:17Z

/assign @mboersma @jsturtevant for additional review

jsturtevant · 2020-07-30T15:28:02Z

Love the idea of giving use case driven debugging tips.

My thoughts are if you use the CAPZ crd's you could get rid of the export CLUSTER_RESOURCE_GROUP=my-cluster-rg and export VM_PREFIX=my-cluster-md-0- which could then be automated into a tool similar to what is in https://github.com/kubernetes-sigs/cluster-api-provider-azure/tree/master/hack/debugging.

I use the ssh and map tool all the time when debuggin and would find something similiar useful too. I had this problem of nodes coming online just yesterday 😄

CecileRobertMichon · 2020-08-27T00:19:23Z

@jackfrancis once #901 merges, consider changing these instructions looking at boot diagnostics from the portal or the Azure CLI (https://docs.microsoft.com/en-us/cli/azure/vm/boot-diagnostics?view=azure-cli-latest#az-vm-boot-diagnostics-get-boot-log)

CecileRobertMichon · 2020-11-04T19:14:19Z

@jackfrancis are you still planning on getting this one in?

k8s-ci-robot · 2021-01-07T17:43:24Z

@jackfrancis: The following test failed, say /retest to rerun all failed tests:

Test name	Commit	Details	Rerun command
pull-cluster-api-provider-azure-e2e-windows	`b819443`	link	`/test pull-cluster-api-provider-azure-e2e-windows`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

devigned · 2021-01-27T18:55:12Z

@jackfrancis gentle nudge on this. We are happy to take over from current state if you would prefer.

CecileRobertMichon · 2021-03-19T16:55:31Z

Rewrote this with new options in #1232

/close

k8s-ci-robot · 2021-03-19T16:55:37Z

@CecileRobertMichon: Closed this PR.

In response to this:

Rewrote this with new options in #1232

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

docs: add troubleshooting examples debugging missing nodes

4ce7326

k8s-ci-robot added do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jul 27, 2020

k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jul 27, 2020

k8s-ci-robot requested review from CecileRobertMichon and devigned July 27, 2020 19:26

k8s-ci-robot added area/provider/azure Issues or PRs related to azure provider sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. labels Jul 27, 2020

docs: language and anchor link

35ca8e7

jackfrancis mentioned this pull request Jul 27, 2020

node didn't join cluster #832

Closed

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jul 27, 2020

CecileRobertMichon reviewed Jul 27, 2020

View reviewed changes

docs: provide vmss example

b819443

k8s-ci-robot closed this Mar 19, 2021

jackfrancis deleted the docs-troubleshooting-missing-nodes branch December 9, 2022 23:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: add troubleshooting examples debugging missing nodes #831

docs: add troubleshooting examples debugging missing nodes #831

jackfrancis commented Jul 27, 2020

k8s-ci-robot commented Jul 27, 2020

k8s-ci-robot commented Jul 27, 2020

k8s-ci-robot commented Jul 27, 2020

CecileRobertMichon commented Jul 27, 2020

CecileRobertMichon Jul 27, 2020

jackfrancis Jul 27, 2020

CecileRobertMichon Jul 27, 2020

CecileRobertMichon Jul 27, 2020

jackfrancis Jul 27, 2020

CecileRobertMichon Jul 27, 2020

jsturtevant Jul 30, 2020

CecileRobertMichon commented Jul 29, 2020

jsturtevant commented Jul 30, 2020

CecileRobertMichon commented Aug 27, 2020

CecileRobertMichon commented Nov 4, 2020

k8s-ci-robot commented Jan 7, 2021

devigned commented Jan 27, 2021

CecileRobertMichon commented Mar 19, 2021

k8s-ci-robot commented Mar 19, 2021

docs: add troubleshooting examples debugging missing nodes #831

docs: add troubleshooting examples debugging missing nodes #831

Conversation

jackfrancis commented Jul 27, 2020

k8s-ci-robot commented Jul 27, 2020

k8s-ci-robot commented Jul 27, 2020

k8s-ci-robot commented Jul 27, 2020

CecileRobertMichon commented Jul 27, 2020

CecileRobertMichon Jul 27, 2020

Choose a reason for hiding this comment

jackfrancis Jul 27, 2020

Choose a reason for hiding this comment

CecileRobertMichon Jul 27, 2020

Choose a reason for hiding this comment

CecileRobertMichon Jul 27, 2020

Choose a reason for hiding this comment

jackfrancis Jul 27, 2020

Choose a reason for hiding this comment

CecileRobertMichon Jul 27, 2020

Choose a reason for hiding this comment

jsturtevant Jul 30, 2020

Choose a reason for hiding this comment

CecileRobertMichon commented Jul 29, 2020

jsturtevant commented Jul 30, 2020

CecileRobertMichon commented Aug 27, 2020

CecileRobertMichon commented Nov 4, 2020

k8s-ci-robot commented Jan 7, 2021

devigned commented Jan 27, 2021

CecileRobertMichon commented Mar 19, 2021

k8s-ci-robot commented Mar 19, 2021