-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"etcdserver: request timed out" #717
Comments
I would look at resource usage on the host, this is probably a CPU exhaustion or disk I/O issue? |
Around the time this happened the node was at ~60% CPU utilization and disk IO was peaking at 108MB/s and 648 operations/s (write). The disk claims to support 122.8mb/s sustained throughput (256gb GCP SSD). So it does seem like it was potentially hitting the limit? |
That might be it, we had some problems with I/O throttling breaking various prowjobs (kind or otherwise) on prow.k8s.io due to heavy throttling because of highly exceeding the limits, I believe prow.k8s.io switched to appropriately sized pd-ssd on the nodes to avoid throttling. (IIRC 250gb pd-ssd?) |
There should be a graph with I/O throttling in the VM details IIRC, though it averages so you can miss it if you're looking at broad timespans. |
You are right, heavily throttled, spikes at 100MB/s throttled. We just switched to 256gb SSDs (to match yours) because of semi-related issues where our docker instances were getting killed. The problem is we build a bunch of docker images at the start of all our tests which probably uses tons of disk. So I guess if we can optimize that to do less work this problem may resolve itself |
That may help, another option is to try to avoid disk heavy jobs from co-scheduling. Unfortunately Kubernetes / Prow are not terribly helpful there. Pod anti-affinity (or at least in the past a cheaper option was identical As far as I know for disk heavy workloads the standard pattern is to have dedicated nodes with taints and labels for those specific workloads and schedule a limited number of them (say one per node) to those nodes, with other workloads on other nodes. That may not be the most practical for Prow.. 😞 |
We are getting this one a lot too, I suspect the same problem. Looks like kubeadm is health checking and timing out
|
So we are seeing the a fair amount so I added some retries in istio/istio#15637 (and set loglevel=debug now for further debugging) to the setup. Any other ideas to mitigate this? |
😞 Symptomatically the API Server is not coming up healthily, the most common case is that it was killed or evicted. That could be due to this. kind / kubeadm can't do much there, it's expected that if the host doesn't have enough capacity to bring up the API server healthily in that time frame (controlled by kubeadm) that it won't come up. IIRC they've timed it for Raspberry PIs. If you run create with |
I would add that we don't see either of these on prow.k8s.io and I've not seen these locally*, I would still guess a resource management issue on the underlying hosts. I.E. the prow cluster nodes are overloaded / throttled 😬 Are the nodes being I/O throttled still? Additional notes:
* there used to be an issue where if the host had low disk you'd see the more recent one you posted but we avoid that now by disabling disk based eviction.
|
Ok so dug into it a bit more. With the retry PR, it doesn't seem to help. Either kind comes up, or it fails 3 times - it never seems to fail once then succeed. I ran another test now that the cluster is pretty much empty. The node had no pods running on it until the test was scheduled on it. The first step of the test is setting up the kind cluster: https://gubernator.k8s.io/build/istio-prow/pr-logs/pull/istio_istio/15642/integ-galley-k8s-presubmit-tests-master/3435/ Looking at the node metrics, IO write bytes throttle is peaking at 15mb/s. So even just the I did the retaining + log dump, hopefully that can help. Example failure is at https://prow.istio.io/view/gcs/istio-prow/pr-logs/pull/istio_istio/15637/integ-telemetry-k8s-presubmit-tests-master/2039, artifiacts will have all the logs From these logs the most obvious looking error:
There is also this:
We had some other tests where we got kind running, but then one of our containers running in kind failed to start a server with "no space left on device". I am pretty sure this is not literal disk space, but inotify limit? It seems it repeats that no space left error about 10x then it finally exits, which leads me to believe that is the root cause. So seems like maybe increasing |
That sounds very suspicious.
I'm not sure either.
That's normal there's a period during startup where the CNI config is not written out yet. This error will show up on many, many clusters while the CNI daemonset is still coming up and writing that out. As long as it eventually gets the CNI config this is normal and fine 😅
Yes! Now this error I have seen before, but not much on prow.k8s.io. This is indeed an issue with running out of inotify watches.
Yes, and we should be able to write a daemonset to do this. |
This should go in our known-issues guide as well, and prow.k8s.io should probably bump its inotify watches as well, IIRC they're relatively low on the cluster nodes by default. |
great, thanks for all the help! Now that we have the logs being dumped hopefully we can track down future issues faster now too. @Katharine can you help us get the limit raised? I don't have access to the cluster |
adapted to stable APIs from Azure/AKS#772 (comment) apiVersion: apps/v1
kind: DaemonSet
metadata:
name: tune-sysctls
namespace: kube-system
labels:
app: tune-sysctls
spec:
selector:
matchLabels:
name: tune-sysctls
template:
metadata:
labels:
name: tune-sysctls
spec:
hostNetwork: true
hostPID: true
hostIPC: true
initContainers:
- name: setsysctls
command:
- sh
- -c
- sysctl -w fs.inotify.max_user_watches=524288;
image: alpine:3.6
imagePullPolicy: IfNotPresent
name: setsysctls
resources: {}
securityContext:
privileged: true
volumeMounts:
- name: sys
mountPath: /sys
containers:
- name: sleepforever
resources:
requests:
cpu: 0.01
image: alpine:3.6
name: sleepforever
command: ["tail"]
args: ["-f", "/dev/null"]
volumes:
- name: sys
hostPath:
path: /sys (Note: arbitrary-ish number of watches, the default on some distros is like ~8k ish IIRC). |
Filed kubernetes/test-infra#13515 to check this in somewhere. |
I've applied the above across the istio test cluster. We should check it in to istio/test-infra so it accurately reflects reality. |
We've deployed the daemonset today, the current version for prow is at https://github.com/kubernetes/test-infra/blob/f96fa91682a29c57838d4df17d9ef8d4ecf7260f/prow/cluster/tune-sysctls_daemonset.yaml (more or less the same, minor tweaks). |
With the fix things are going smoothly, thanks for the help. feel free to close this |
Excellent! 😄 |
@BenTheElder: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
What happened:
We started using KinD in our prow integration tests for Istio, and occasionally are seeing errors like
Error from server: error when creating "/logs/artifacts/galley-test-a99cf400cb3343eeac7/_suite_context/istio-deployment-577548603/istio-config-only.yaml": etcdserver: request timed out
What you expected to happen:
etcd doesn't timeout.
How to reproduce it (as minimally and precisely as possible):
This is the hard part, I am not sure how to reproduce this consistently. I do, however, have a bunch of logs from where it occurred - attached below.
I realize this is probably not a very actionable bug report, my main question is what info do we need to collect to root cause this?
Environment:
kind version
):v0.3.0
kubectl version
): I think 1.14? That is what comes with 0.3.0 right?docker info
): 18.06.1-ce/etc/os-release
): Ubuntu 16.04Logs:
The text was updated successfully, but these errors were encountered: