Skip to content

Commit

Permalink
Update troubleshooting docs
Browse files Browse the repository at this point in the history
  • Loading branch information
Cecile Robert-Michon committed Mar 19, 2021
1 parent b5110b1 commit 7d5b441
Showing 1 changed file with 127 additions and 25 deletions.
152 changes: 127 additions & 25 deletions docs/troubleshooting.md
Original file line number Diff line number Diff line change
@@ -1,57 +1,157 @@
# Troubleshooting Guide

Common issue users might have when using Cluster API Provider for Azure.
Common issues users might run into when using Cluster API Provider for Azure. This list is work-in-progress. Feel free to open a PR to add to it if you find that useful information is missing.

## Debugging cluster creation
You will need to review the logs for the components of the control plane nodes and for the workload clusters. Start your investigation with reviewing the logs of the control plane then move onto the workload cluster that is created.
## Examples of troubleshooting real-world issues

## Review logs of control plane
While cluster buildout is running, you can follow the controller logs in a separate window like this:
### No Azure resources are getting created

This is likely due to missing or invalid Azure credentials.

Check the CAPZ controller logs on the management cluster:

```bash
kubectl get po -o wide --all-namespaces -w # Watch pod creation until azure-provider-controller-manager-0 is available
kubectl logs deploy/capz-controller-manager -n capz-system manager
```

kubectl logs -n capz-system azure-provider-controller-manager-0 manager -f # Follow the controller logs
If you see an error similar to this:

```
azure.BearerAuthorizer#WithAuthorization: Failed to refresh the Token for request to https://management.azure.com/subscriptions/123/providers/Microsoft.Compute/skus?%24filter=location+eq+%27eastus2%27&api-version=2019-04-01: StatusCode=401 -- Original Error: adal: Refresh request failed. Status Code = '401'. Response body: {\"error\":\"invalid_client\",\"error_description\":\"AADSTS7000215: Invalid client secret is provided.
```

An error such as the following in the manager could point to a mismatch between a current CAPI and an old CAPZ version:
Make sure the provided Service Principal client ID and client secret are correct and that the password has not expired.

### The AzureCluster infrastructure is provisioned but no virtual machines are coming up

Your Azure subscription might have no quota for the requested VM size in the specified Azure location.

Check the CAPZ controller logs on the management cluster:

```bash
kubectl logs deploy/capz-controller-manager -n capz-system manager
```

If you see an error similar to this:
```
E0320 23:33:33.288073 1 controller.go:258] controller-runtime/controller "msg"="Reconciler error" "error"="failed to create AzureMachine VM: failed to create nic capz-cluster-control-plane-7z8ng-nic for machine capz-cluster-control-plane-7z8ng: unable to determine NAT rule for control plane network interface: strconv.Atoi: parsing \"capz-cluster-control-plane-7z8ng\": invalid syntax" "controller"="azuremachine" "request"={"Namespace":"default","Name":"capz-cluster-control-plane-7z8ng"}
"error"="failed to reconcile AzureMachine: failed to create virtual machine: failed to create VM capz-md-0-qkg6m in resource group capz-fkl3tp: compute.VirtualMachinesClient#CreateOrUpdate: Failure sending request: StatusCode=0 -- Original Error: autorest/azure: Service returned an error. Status=\u003cnil\u003e Code=\"OperationNotAllowed\" Message=\"Operation could not be completed as it results in exceeding approved standardDSv3Family Cores quota.
```

### Remoting to workload clusters
After the workload cluster is finished deploying you will have a kubeconfig in `./kubeconfig`.
Follow the [these steps](https://docs.microsoft.com/en-us/azure/azure-resource-manager/templates/error-resource-quota). Alternatively, you can specify another Azure location and/or VM size during cluster creation.

### A virtual machine is running but the k8s node did not join the cluster

Check the AzureMachine (or AzureMachinePool if using a MachinePool) status:
```bash
kubectl get azuremachines -o wide
```

If you see an output like this:

```
NAME READY STATE
default-template-md-0-w78jt false Updating
```

This indicates that the bootstrap script has not yet succeeded. Check the AzureMachine `status.conditions` field for more information.

[Take a look at the cloud-init logs](#checking-cloud-init-logs-ubuntu) for further debugging.

### One or more control plane replicas are missing

Take a look at the KubeadmControlPlane controller logs and look for any potential errors:

```bash
kubectl logs deploy/capi-kubeadm-control-plane-controller-manager -n capi-kubeadm-control-plane-system manager
```

In addition, make sure all pods on the workload cluster are healthy, including pods in the `kube-system` namespace.

### Nodes are in NotReady state

Make sure you have installed a CNI on the workload cluster and that all the pods on the workload cluster are in running state.

### Load Balancer service fails to come up

Check the cloud-controller-manager logs on the workload cluster.

If running the Azure cloud provider in-tree:

```
kubectl logs kube-controller-manager-<control-plane-node-name> -n kube-system
```

Using the ssh information provided during cluster creation (environment variable `AZURE_SSH_PUBLIC_KEY_B64`), you can debug most issues by SSHing into the VMs that have been created:
If running the Azure cloud provider out-of-tree:

```
# connect to first control node - capi is default linux user created by deployment
API_SERVER=$(kubectl get azurecluster capz-cluster -o jsonpath='{.status.network.apiServerIp.dnsName}')
kubectl logs cloud-controller-manager -n kube-system
```


## Watching Kubernetes resources

To watch progression of all Cluster API resources on the management cluster you can run:

```bash
kubectl get cluster-api
```

## Looking at controller logs

To check the CAPZ controller logs on the management cluster, run:

```bash
kubectl logs deploy/capz-controller-manager -n capz-system manager
```

### Checking cloud-init logs (Ubuntu)

Cloud-init logs can provide more information on any issues that happened when running the bootstrap script.

#### Option 1: Using the Azure Portal

Located in the virtual machine blade, the boot diagnostics option is under the Support and Troubleshooting section in the Azure portal.

For more information, see [here](https://docs.microsoft.com/en-us/azure/virtual-machines/boot-diagnostics#boot-diagnostics-view)

#### Option 2: Using the Azure CLI

```bash
az vm boot-diagnostics get-boot-log --name MyVirtualMachine --resource-group MyResourceGroup
```

For more information, see [here](https://docs.microsoft.com/en-us/cli/azure/vm/boot-diagnostics?view=azure-cli-latest).

#### Option 3: With SSH

Using the ssh information provided during cluster creation (environment variable `AZURE_SSH_PUBLIC_KEY_B64`):


##### connect to first control node - capi is default linux user created by deployment
```
API_SERVER=$(kubectl get azurecluster capz-cluster -o jsonpath='{.spec.controlPlaneEndpoint.host}')
ssh capi@${API_SERVER}
```

# list nodes
##### list nodes
```
kubectl get azuremachines
NAME READY STATE
capz-cluster-control-plane-2jprg true Succeeded
capz-cluster-control-plane-ck5wv true Succeeded
capz-cluster-control-plane-w4tv6 true Succeeded
capz-cluster-md-0-s52wb true Succeeded
capz-cluster-md-0-s52wb false Failed
capz-cluster-md-0-w8xxw true Succeeded
```

# pick node name from output above:
##### pick node name from output above:
```
node=$(kubectl get azuremachine capz-cluster-md-0-s52wb -o jsonpath='{.status.addresses[0].address}')
ssh -J capi@${apiserver} capi@${node}
```

> There are some [provided scripts](/hack/debugging/Readme.md) that can help automate a few common tasks.
Reviewing the following logs on the workload cluster can help with troubleshooting:

- `less /var/lib/waagent/custom-script/download/0/stdout`
- `journalctl -u cloud-final`
- `less /var/log/cloud-init-output.log`
- `journalctl -u kubelet`
##### look at cloud-init logs
`less /var/log/cloud-init-output.log`

## Automated log collection

Expand All @@ -60,3 +160,5 @@ As part of [CI](../scripts/ci-e2e.sh) there is a [log collection script](hack/..
```bash
./hack/log/log-dump.sh
```

There are also some [provided scripts](/hack/debugging/Readme.md) that can help automate a few common tasks.

0 comments on commit 7d5b441

Please sign in to comment.