-
Notifications
You must be signed in to change notification settings - Fork 430
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docs: add troubleshooting examples debugging missing nodes #831
docs: add troubleshooting examples debugging missing nodes #831
Conversation
@jackfrancis: Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Hi @jackfrancis. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/ok-to-test |
docs/troubleshooting.md
Outdated
$ export CLUSTER_RESOURCE_GROUP=my-cluster-rg | ||
$ export VM_PREFIX=my-cluster-md-0- | ||
$ export KUBECONFIG=/Users/me/.kube/my-cluster.kubeconfig | ||
$ $ for vm in $(az vm list -g $CLUSTER_RESOURCE_GROUP | jq -r --arg VM_PREFIX "${VM_PREFIX}" '.[] | select (.name | startswith($VM_PREFIX)).name'); do kubectl get node $vm 2>&1 >/dev/null && continue || echo node $vm did not join the cluster; done |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this only works for VMs, not vmss (ie. not MachinePool) right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. I'll add an equivalent example for VMSS.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wouldn't kubectl get machines
/ kubectl get machinepool
grep -v Ready
have the same effect without needing az cli and calling to Azure? In your case did the machine show as Ready?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, grep -v Running
*
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I honestly didn't look at the machine
resource at all. This was how my brain worked:
- create cluster w/ desired node count
- wait for nodes to come online, after a while noticed that there was one missing node
- how many actual VMs are in my resource group? it's 20
- O.K., so which one didn't register as a node?
FWIW
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, I think both are valid, you went to the RG as a first instinct because you're familiar with Azure. For users who aren't as comfortable with Azure specific stuff it'd be nice to document how all this stuff can be done without needing to care about the underlying infrastructure... kubectl get azuremachine
should show you the VM status while kubectl get machine
should show you the status of the machine from a k8s perspective.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think both perspectives are valid here. In reality will probably need to look at both CAPZ CRD and the underlying infra. Atleast that's how I have approached it so far, I look at the crd's for a quick understanding of what CAPZ thinks is going on then I look at the azure system to see what is happening. It would be worth add a small section to see how this vms in azure related to the CRD's of capz.
/assign @mboersma @jsturtevant for additional review |
Love the idea of giving use case driven debugging tips. My thoughts are if you use the CAPZ crd's you could get rid of the I use the ssh and map tool all the time when debuggin and would find something similiar useful too. I had this problem of nodes coming online just yesterday 😄 |
@jackfrancis once #901 merges, consider changing these instructions looking at boot diagnostics from the portal or the Azure CLI (https://docs.microsoft.com/en-us/cli/azure/vm/boot-diagnostics?view=azure-cli-latest#az-vm-boot-diagnostics-get-boot-log) |
@jackfrancis are you still planning on getting this one in? |
@jackfrancis: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
@jackfrancis gentle nudge on this. We are happy to take over from current state if you would prefer. |
Rewrote this with new options in #1232 /close |
@CecileRobertMichon: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
What this PR does / why we need it:
This PR adds some supporting troubleshooting documentation to get a user started debugging why a cluster node did not come online.
Which issue(s) this PR fixes (optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)
format, will close the issue(s) when PR gets merged):Fixes #
Special notes for your reviewer:
Please confirm that if this PR changes any image versions, then that's the sole change this PR makes.
TODOs:
Release note: