docs: update csi-node troubleshooting (#1130)

juicedata · Sep 30, 2024 · 792180b · 792180b
1 parent 6a6216e
commit 792180b
Show file tree

Hide file tree

Showing 2 changed files with 28 additions and 8 deletions.
diff --git a/docs/en/administration/troubleshooting-cases.md b/docs/en/administration/troubleshooting-cases.md
@@ -19,9 +19,18 @@ Above error message shows that the CSI Driver named `csi.juicefs.com` isn't foun
 If you used `mount pod` mode, follow these steps to troubleshoot:
 
 * Run `kubectl get csidrivers.storage.k8s.io` and check if `csi.juicefs.com` actually missing, if that is indeed the case, CSI Driver isn't installed at all, head to [Installation](../getting_started.md).
-* If `csi.juicefs.com` already exists in the above `csidrivers` list, that means CSI Driver is installed, the problem is with CSI Node.
-* [Check if CSI Node is working correctly](./troubleshooting.md#check-csi-node).
-* There should be a CSI Node pod on the exact Kubernetes node where the application pod is running, if [scheduling strategy](../guide/resource-optimization.md#csi-node-node-selector) has been configured for the CSI Node DaemonSet, or the node itself is [tainted](https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration), CSI Node may be missing on such worker nodes.
+* If `csi.juicefs.com` already exists in the above `csidrivers` list, that means CSI Driver is installed, the problem is with CSI Node, check its status:
+  * Before troubleshooting, navigate to [check CSI Node](./troubleshooting.md#check-csi-node) to see a list of helpful commands;
+  * A CSI Node pod is expected on the node where the application pod is running, if [scheduling strategy](../guide/resource-optimization.md#csi-node-node-selector) has been configured for the CSI Node DaemonSet, or the node itself is [tainted](https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration), CSI Node may be missing on some worker nodes, causing the "driver not found" issue;
+  * If CSI Node is actually running, look for error in its logs:
+
+  ```shell
+  # juicefs-plugin container handls actual CSI Driver work, if it cannot access Kubernetes API, mount pod cannot be created
+  kubectl logs -n kube-system juicefs-csi-node-xxx juicefs-plugin --tail 100
+
+  # node-driver-registrar container is in charge of registering csidriver, if there's been an error, it'll show in logs
+  kubectl logs -n kube-system juicefs-csi-node-xxx node-driver-registrar --tail 100
+  ```
 
 If you used `sidecar` mode, check if the namespace which application pod running has `juicefs.com/enable-injection=true` label:
 

diff --git a/docs/zh_cn/administration/troubleshooting-cases.md b/docs/zh_cn/administration/troubleshooting-cases.md
@@ -14,14 +14,25 @@ sidebar_position: 7
 kubernetes.io/csi: attacher.MountDevice failed to create newCsiDriverClient: driver name csi.juicefs.com not found in the list of registered CSI drivers
 ```
 
-上方的报错信息表示，名为 `csi.juicefs.com` 的驱动没有找到，请先确认使用的是 mount pod 模式还是 sidecar 模式。
+上方的报错信息表示，名为 `csi.juicefs.com` 的驱动没有找到，请先确认使用的是 Mount Pod 模式还是 Sidecar 模式。
 
 若使用的是 mount pod 模式，遵循以下步骤进行排查：
 
-* 运行 `kubectl get csidrivers.storage.k8s.io`，如果输出的中确没有 `csi.juicefs.com` 字样，说明 CSI 驱动并未按照，重新回顾[「安装 JuiceFS CSI 驱动」](../getting_started.md)。
-* 如果上方的 `csidrivers` 列表中存在 `csi.juicefs.com`，那么说明 CSI 驱动已经安装，问题出在 CSI Node。
-* [检查 CSI Node 是否正常运作](./troubleshooting.md#check-csi-node)。
-* 检查应用 Pod 所在节点，是否正常运行着 CSI Node，如果为 CSI Node 这个 DaemonSet 组件配置了[调度策略](../guide/resource-optimization.md#csi-node-node-selector)，或者节点本身存在[「污点」](https://kubernetes.io/zh-cn/docs/concepts/scheduling-eviction/taint-and-toleration)，都有可能造成 CSI Node 容器缺失。
+* 运行 `kubectl get csidrivers.storage.k8s.io`，如果输出的中确没有 `csi.juicefs.com` 字样，说明 CSI 驱动并未安装，仔细回顾[「安装 JuiceFS CSI 驱动」](../getting_started.md)；
+* 如果上方的 `csidrivers` 列表中存在 `csi.juicefs.com`，那么说明 CSI 驱动已经安装，问题出在 CSI Node，检查 CSI Node 是否正常运作：
+  * 排查开始前，可以简单阅读[检查 CSI Node](./troubleshooting.md#check-csi-node)，代码示范里有一些快捷命令可供参考；
+  * 关注应用 Pod 所在节点，检查节点是否正常运行着 CSI Node，如果为 CSI Node 这个 DaemonSet 组件配置了[调度策略](../guide/resource-optimization.md#csi-node-node-selector)，或者节点本身存在[「污点」](https://kubernetes.io/zh-cn/docs/concepts/scheduling-eviction/taint-and-toleration)，都有可能造成 CSI Node 容器缺失，造成该错误；
+  * 如果问题节点的 CSI Node 正常运行（处于 Running 状态），核实他的各个容器均没有明显错误日志，比方说：
+
+  ```shell
+  # juicefs-plugin 容器负责运行 CSI 驱动的实际工作，如果他访问 Kubernetes API 失败，则会导致 Mount Pod 无法创建
+  kubectl logs -n kube-system juicefs-csi-node-xxx juicefs-plugin --tail 100
+
+  # node-driver-registrar 容器负责注册 csidriver，如果注册过程异常，该容器会报错
+  kubectl logs -n kube-system juicefs-csi-node-xxx node-driver-registrar --tail 100
+  ```
+
+  * 如果以上排查均无结论，则认为 Kubernetes 本身出现了问题，可以尝试重启 kubelet 或者重启系统，如果问题仍得不到解决，需要向 Kubernetes 的管理员或服务提供商寻求帮助。
 
 若使用的是 sidecar 模式，请确认对应的 namespace 有没有打上 JuiceFS sidecar 所需 label（`juicefs.com/enable-injection=true`）：