-
Notifications
You must be signed in to change notification settings - Fork 69
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigate "killing connection/stream because serving request timed out" during perf test #621
Comments
@sonyafenge did we also see this error in runs with lower than 5k nodes? for example 100 and 500 nodes? |
errors from etcd:
|
similar encountering https://blog.csdn.net/qq_36783142/article/details/103443750 it's attributed to issue under load. |
@vinaykul are you seeing this panic in your 5k node run? |
and looking into disk throttling based on discussion here kubernetes-sigs/kind#717 |
GCE disk I/O limit is proportional to the disk size. @sonyafenge will run this test again with SSD size increased from 500G to 1000G to confirm the effect of disk I/O throttling. |
Please open attached netadata snapshot from: https://registry.my-netdata.io/ Please update file type to ".snapshot" before open attached file: |
@sonyafenge what is the version of the etcd that our arktos etcd based on? |
from @vinaykul 's 3 apiserver tests, also seeing similar etcd issues but no panic in apiserver:
|
Arktos is running on customized ETCD 3.4.4 |
some of the tuning in https://etcd.io/docs/v3.4.0/tuning/ looks interesting. for example: |
here's another discussion about tuning: https://docs.portworx.com/portworx-install-with-kubernetes/operate-and-maintain-on-kubernetes/etcd/ in specific: |
here's an custom optimization to etcd https://www.alibabacloud.com/blog/performance-optimization-of-etcd-in-web-scale-data-scenario_594750 |
for the next runs, can we try the following parameter for etcd
Also run this on the host where etcd runs sudo ionice -c2 -n0 -p `pgrep etcd` |
more tuning here |
should enable etcd monitoring https://etcd.io/docs/v3.4.0/op-guide/monitoring/ |
9/3/2020 [issuedebug][arktos][ETCDperf]started 5k nodes run and cluster crashed with this error and also another error:https://github.com/futurewei-cloud/arktos/issues/682
logs can be found under GCP project: workload-controller-manager:
|
Please open attached netadata snapshot from: https://registry.my-netdata.io/ |
Still working on the root cause |
@sonyafenge let's revert the snapshot period change for the next run. from metrics it looks like our etcd host didn't have high cpu and memory load. the default snapshot interval 10k may be okay or we could probably even pushed it out a little say 15k or 20k. Let's use ETCD_SNAPSHOT_COUNT = 20000 for the next run for experiment. |
next step for this work is to enable ETCD monitoring and stress testing etcd. we need to understand etcd's behavior under stress. "bug/crash driven optimization" can only get us so far. for monitoring this could be used https://etcd.io/docs/v3.4.0/op-guide/monitoring/ |
Started two 5K runs, both disabled workload-controller-manager. Still see killing connection/stream #621. Here is run detail:
|
I'm not seeing a machine called sonya-uswest21 |
both runs have such logs right before runs:
|
run 2 has this
@sonyafenge Did we mean to run with ETCD_SNAPSHOT_COUNT=20k instead? |
sorry, that's a copy error, it should be sonya-uswest2 |
only 1. etcd Version: 3.4.4-arktos.1 changed ETCD_SNAPSHOT=20000
run 2 still use the default value, it is 10000 |
@sonyafenge steps to get etcd metrics using prometheus this can be done from the master node. come to think of it, it may be better to put this on a non-master node to reduce the load of the master.
export RELEASE="2.2.1"
wget https://github.com/prometheus/prometheus/releases/download/v${RELEASE}/prometheus-${RELEASE}.linux-amd64.tar.gz
tar xvf prometheus-${RELEASE}.linux-amd64.tar.gz
cd prometheus-2.2.1.linux-amd64/
root@ip-172-31-27-32:/home/ubuntu# cat /tmp/test-etcd.yaml
global:
scrape_interval: 10s
scrape_configs:
- job_name: test-etcd
static_configs:
- targets: ['172.31.27.32:2379','127.0.0.1:2379']
./prometheus --config.file="/tmp/test-etcd.yaml" --web.listen-address=":9090" to access the metrics, make sure port 9090 is enabled on the host, and then head to browser and do
replace host ip with the actual public ip of the host |
More than 5% GRPC request failure detected in Etcd for 5 minutes |
some observation about etcd compact (also great etcd doc here) apiserver is setting compact interval to 5m however from etcd log, it seems compact is being done way more frequent: and it compact is running together with snapshotting |
our snapshot count is at 10k while etcd doc suggests it should be 100,000 in specific: |
suggestion on next run:
|
each cluster will need to be tuned up. however, in general it is preferred to have smaller, quick compaction OPs to avoid the lock retention. In this etcd issue, etcd-io/etcd#11021, it is hinted for large scale cluster, it should be smaller. |
Here’s the sequence of events I’m going with:
So my questions is about endpoints_controller now. Is it normal that we have this constant error of syncing endpoints? It happened sometimes 10+ times per minute. |
No longer an issue |
8/24/2020 [perf-tests][arktos][node5000] node5000-1api-1wc-1etcd:
Load run still failed and crashed with panic in apiserver:
runtime.go:73] Observed a panic: &errors.errorString{s:"killing connection/stream because serving request timed out and response had been started"} (killing connection/stream because serving request timed out and response had been started)
This error is usually surrounded by
The text was updated successfully, but these errors were encountered: