Investigate "killing connection/stream because serving request timed out" during perf test #621

pdgetrf · 2020-08-25T22:26:22Z

8/24/2020 [perf-tests][arktos][node5000] node5000-1api-1wc-1etcd:

Load run still failed and crashed with panic in apiserver:
runtime.go:73] Observed a panic: &errors.errorString{s:"killing connection/stream because serving request timed out and response had been started"} (killing connection/stream because serving request timed out and response had been started)

This error is usually surrounded by

E0822 14:36:23.920503       1 wrap.go:32] apiserver panic'd on PUT /apis/coordination.k8s.io/v1beta1/tenants/system/namespaces/kube-node-lease/leases/hollow-node-p8rvh?timeout=10s

The text was updated successfully, but these errors were encountered:

pdgetrf · 2020-08-25T22:27:33Z

@sonyafenge did we also see this error in runs with lower than 5k nodes? for example 100 and 500 nodes?

pdgetrf · 2020-08-25T22:50:07Z

errors from etcd:

2020-08-22 14:41:23.345016 W | etcdserver: read-only range request "key:\"/registry/services/specs/\" range_end:\"/registry/services/specs0\" limit:500 " with result "error:context canceled" took too long (2.093356818s) to execute
2020-08-22 14:41:23.345087 W | etcdserver: read-only range request "key:\"/registry/actions/\" range_end:\"/registry/actions0\" limit:500 " with result "error:context canceled" took too long (2.100863077s) to execute
2020-08-22 14:41:23.345235 W | etcdserver: read-only range request "key:\"/registry/pods/\" range_end:\"/registry/pods0\" limit:500 " with result "error:context canceled" took too long (2.071970109s) to execute
2020-08-22 14:41:23.345477 W | etcdserver: read-only range request "key:\"/registry/pods/\" range_end:\"/registry/pods0\" limit:500 " with result "error:context canceled" took too long (2.078145492s) to execute
2020-08-22 14:41:23.348268 W | etcdserver: read-only range request "key:\"/registry/secrets/vydm4f-testns/default-token-rnxjn\" " with result "error:context canceled" took too long (2.04715108s) to execute

pdgetrf · 2020-08-25T22:55:51Z

similar encountering https://blog.csdn.net/qq_36783142/article/details/103443750 it's attributed to issue under load.

pdgetrf · 2020-08-25T23:05:29Z

@vinaykul are you seeing this panic in your 5k node run?

pdgetrf · 2020-08-25T23:43:20Z

and looking into disk throttling based on discussion here kubernetes-sigs/kind#717

pdgetrf · 2020-08-26T17:20:13Z

graph from disk I/O

and zoom in till the end:

marked section looks like being throttled. checking the logs for activities during these time.

pdgetrf · 2020-08-26T18:05:52Z

disk graph seems to correlate with the time when panic started to appear in apiserver log

pdgetrf · 2020-08-26T18:19:05Z

and disk I/O matches the time when etcd started to give up in etcd.log

pdgetrf · 2020-08-26T18:24:41Z

and here's the overall system load

pdgetrf · 2020-08-26T22:04:54Z

GCE disk I/O limit is proportional to the disk size.

@sonyafenge will run this test again with SSD size increased from 500G to 1000G to confirm the effect of disk I/O throttling.

sonyafenge · 2020-08-28T21:55:20Z

Please open attached netadata snapshot from: https://registry.my-netdata.io/

Please update file type to ".snapshot" before open attached file:
netdata-19b8d07b911d-20200828-053044-540.txt

pdgetrf · 2020-08-28T22:12:09Z

disk I/O before death. the plateau is a sign of disk being throttled.

pdgetrf · 2020-08-28T22:35:21Z

@sonyafenge what is the version of the etcd that our arktos etcd based on?

pdgetrf · 2020-08-30T15:57:04Z

from @vinaykul 's 3 apiserver tests, also seeing similar etcd issues but no panic in apiserver:

2020-08-29 16:07:23.334423 W | rafthttp: health check for peer c1becd0abd5925dc could not connect: dial tcp: lookup vinay-k8s-10k-kubemark-master-8c8 on 169.254.169.254:53: no such host
raft2020/08/29 16:07:24 INFO: 83fa9bf147631769 is starting a new election at term 4
raft2020/08/29 16:07:24 INFO: 83fa9bf147631769 became candidate at term 5
raft2020/08/29 16:07:24 INFO: 83fa9bf147631769 received MsgVoteResp from 83fa9bf147631769 at term 5
raft2020/08/29 16:07:24 INFO: 83fa9bf147631769 [logterm: 2, index: 535] sent MsgVote request to c1becd0abd5925dc at term 5
2020-08-29 16:07:24.857063 W | etcdserver: read-only range request "key:\"/registry/health\" " with result "error:context canceled" took too long (2.000007517s) to execute
WARNING: 2020/08/29 16:07:24 grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing"
2020-08-29 16:07:25.686850 W | etcdserver: timed out waiting for read index response (local node might have slow network)
2020-08-29 16:07:25.686983 W | etcdserver: read-only range request "key:\"/registry/services/endpoints/kube-system/kube-controller-manager\" " with result "error:etcdserver: request timed out" took too long (7.000262842s) to execute
raft2020/08/29 16:07:26 INFO: 83fa9bf147631769 is starting a new election at term 5
raft2020/08/29 16:07:26 INFO: 83fa9bf147631769 became candidate at term 6
raft2020/08/29 16:07:26 INFO: 83fa9bf147631769 received MsgVoteResp from 83fa9bf147631769 at term 6
raft2020/08/29 16:07:26 INFO: 83fa9bf147631769 [logterm: 2, index: 535] sent MsgVote request to c1becd0abd5925dc at term 6
2020-08-29 16:07:26.861178 W | etcdserver: read-only range request "key:\"/registry/health\" " with result "error:context deadline exceeded" took too long (2.000107129s) to execute
WARNING: 2020/08/29 16:07:26 grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing"
raft2020/08/29 16:07:27 INFO: 83fa9bf147631769 is starting a new election at term 6
raft2020/08/29 16:07:27 INFO: 83fa9bf147631769 became candidate at term 7
raft2020/08/29 16:07:27 INFO: 83fa9bf147631769 received MsgVoteResp from 83fa9bf147631769 at term 7
raft2020/08/29 16:07:27 INFO: 83fa9bf147631769 [logterm: 2, index: 535] sent MsgVote request to c1becd0abd5925dc at term 7
2020-08-29 16:07:28.334606 W | rafthttp: health check for peer c1becd0abd5925dc could not connect: dial tcp: lookup vinay-k8s-10k-kubemark-master-8c8 on 169.254.169.254:53: no such host
2020-08-29 16:07:28.334664 W | rafthttp: health check for peer c1becd0abd5925dc could not connect: dial tcp: lookup vinay-k8s-10k-kubemark-master-8c8 on 169.254.169.254:53: no such host
2020-08-29 16:07:28.686306 W | etcdserver: read-only range request "key:\"/registry/services/endpoints/kube-system/kube-controller-manager\" " with result "error:context canceled" took too long (996.538674ms) to execute
WARNING: 2020/08/29 16:07:28 grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing"
2020-08-29 16:07:28.865603 W | etcdserver: read-only range request "key:\"/registry/health\" " with result "error:context canceled" took too long (2.000091191s) to execute
WARNING: 2020/08/29 16:07:28 grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing"
2020-08-29 16:07:29.306984 W | etcdserver: read-only range request "key:\"/registry/jobs/\" range_end:\"/registry/jobs0\" limit:500 " with result "error:context canceled" took too long (8.818748064s) to execute
WARNING: 2020/08/29 16:07:29 grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing"
raft2020/08/29 16:07:29 INFO: 83fa9bf147631769 is starting a new election at term 7
raft2020/08/29 16:07:29 INFO: 83fa9bf147631769 became candidate at term 8
raft2020/08/29 16:07:29 INFO: 83fa9bf147631769 received MsgVoteResp from 83fa9bf147631769 at term 8
raft2020/08/29 16:07:29 INFO: 83fa9bf147631769 [logterm: 2, index: 535] sent MsgVote request to c1becd0abd5925dc at term 8
2020-08-29 16:07:29.973398 W | etcdserver: read-only range request "key:\"/registry/services/endpoints/kube-system/kube-scheduler\" " with result "error:context canceled" took too long (9.999617436s) to execute
WARNING: 2020/08/29 16:07:29 grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing"
2020-08-29 16:07:30.870122 W | etcdserver: read-only range request "key:\"/registry/health\" " with result "error:context canceled" took too long (2.000036686s) to execute

Sindica · 2020-08-31T17:02:43Z

@sonyafenge what is the version of the etcd that our arktos etcd based on?

Arktos is running on customized ETCD 3.4.4

pdgetrf · 2020-08-31T19:42:15Z

some of the tuning in https://etcd.io/docs/v3.4.0/tuning/ looks interesting. for example:

pdgetrf · 2020-08-31T19:46:23Z

here's another discussion about tuning:

https://docs.portworx.com/portworx-install-with-kubernetes/operate-and-maintain-on-kubernetes/etcd/

in specific:

pdgetrf · 2020-08-31T19:58:55Z

here's an custom optimization to etcd

https://www.alibabacloud.com/blog/performance-optimization-of-etcd-in-web-scale-data-scenario_594750

pdgetrf · 2020-09-01T18:40:57Z

@sonyafenge

for the next runs, can we try the following parameter for etcd

--snapshot-count=5000

Also run this on the host where etcd runs

sudo ionice -c2 -n0 -p `pgrep etcd`

pdgetrf · 2020-09-01T21:11:15Z

more tuning here

https://openai.com/blog/scaling-kubernetes-to-2500-nodes/

pdgetrf · 2020-09-04T00:19:42Z

should enable etcd monitoring https://etcd.io/docs/v3.4.0/op-guide/monitoring/

sonyafenge · 2020-09-04T18:16:47Z

9/3/2020 [issuedebug][arktos][ETCDperf]

started 5k nodes run and cluster crashed with this error and also another error:https://github.com/futurewei-cloud/arktos/issues/682
configuration change:

ETCD_SNAPSHOT_COUNT to 5000
Run "sudo ionice -c2 -n0 -p pgrep etcd" on kubemark master

sonyali@sonya-uswest2:~/go/src/k8s.io/arktos$ git diff
diff --git a/cluster/gce/manifests/etcd.manifest b/cluster/gce/manifests/etcd.manifest
index c9049da..ef000ba 100644
--- a/cluster/gce/manifests/etcd.manifest
+++ b/cluster/gce/manifests/etcd.manifest
@@ -45,7 +45,7 @@
         "value": "{{ etcd_protocol }}://{{ hostname }}:{{ server_port }}"
       },
       { "name": "ETCD_SNAPSHOT_COUNT",
-        "value": "10000"
+        "value": "5000"
       }
         ],
     "livenessProbe": {

etcd-3a1w1e-kubemark-master /var/log # sudo ionice -c2 -n0 -p `pgrep etcd`

logs can be found under GCP project: workload-controller-manager:

sonya-uswest2:  /home/sonyali/logs/perf-test/gce-5000/arktos/0903-etcd-3a1w1e

sonyafenge · 2020-09-04T18:23:25Z

Please open attached netadata snapshot from: https://registry.my-netdata.io/
netdata-61e075073cf4-20200904-020203-480.txt

pdgetrf · 2020-09-04T19:54:26Z

crash started roughly at 2:05am

around the time of the crash, etcd:

2020-09-04 17:05:54.801274 W | etcdserver: read-only range request "key:\"/registry/pods/\" range_end:\"/registry/pods0\" limit:500 " with result "range_response_count:500 size:624221" took too long (9.604877046s) to execute

that is, almost 10 seconds to execute

pdgetrf · 2020-09-04T20:07:30Z

possible straw hat broke the camel's back

"WARNING: 2020/09/04 17:05:45 grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing"

after this error response time started to rocket from ms to s

yxiong2020 · 2020-09-04T20:23:22Z

Still working on the root cause

pdgetrf · 2020-09-04T23:22:15Z

ETCD was down and it seems to be down during a snapshotting attempt.

And during this time apiserver was having issue:

pdgetrf · 2020-09-04T23:31:37Z

@sonyafenge let's revert the snapshot period change for the next run. from metrics it looks like our etcd host didn't have high cpu and memory load. the default snapshot interval 10k may be okay or we could probably even pushed it out a little say 15k or 20k. Let's use ETCD_SNAPSHOT_COUNT = 20000 for the next run for experiment.

pdgetrf · 2020-09-08T17:35:26Z

next step for this work is to enable ETCD monitoring and stress testing etcd.

we need to understand etcd's behavior under stress. "bug/crash driven optimization" can only get us so far.

for monitoring this could be used https://etcd.io/docs/v3.4.0/op-guide/monitoring/

sonyafenge · 2020-09-10T22:58:53Z

Started two 5K runs, both disabled workload-controller-manager. Still see killing connection/stream #621.
Both run has a new panic in kube-controller-manager.log: Panic: "invalid memory address or nil pointer dereference" in daemon_controller.go #698

Here is run detail:

etcd Version: 3.4.4-arktos.1

Commit 7983fde on 9/8 with ETCD_SNAPSHOT=20000 and disable workload-controller-manager

1 apiserver, 0 workload, 1 etcd

etcd size after crashed: 2.2GB

logs can be found under GCP project: workload-controller-manager: sonya-uswest2: /home/sonyali/logs/perf-test/gce-5000/arktos/0908-debug-1a1w1e

etcd Version: 3.4.4:

Commit “multi-tenancy CRD e2e tests (#662)” on 9/8 with disable workload-controller-manager

1 apiserver, 0 workload, 1 etcd

etcd size after crashed: 1.5GB

logs can be found under GCP project: workload-controller-manager: sonya-useast1: /home/sonyali/logs/perf-test/gce-5000/arktos/0908-etcd344-1a1w1e

pdgetrf · 2020-09-11T19:33:31Z

I'm not seeing a machine called sonya-uswest21

pdgetrf · 2020-09-11T19:38:41Z

both runs have such logs right before runs:

apiserver received an error that is not an metav1.Status: &errors.errorString{s:"context canceled"}

pdgetrf · 2020-09-11T19:43:58Z

run 2 has this

"2020-09-09 04:16:28.799325 I | pkg/flags: recognized and used environment variable ETCD_SNAPSHOT_COUNT=10000"

@sonyafenge Did we mean to run with ETCD_SNAPSHOT_COUNT=20k instead?

sonyafenge · 2020-09-11T20:38:44Z

I'm not seeing a machine called sonya-uswest21

sorry, that's a copy error, it should be sonya-uswest2

sonyafenge · 2020-09-11T20:43:43Z

run 2 has this
"2020-09-09 04:16:28.799325 I | pkg/flags: recognized and used environment variable ETCD_SNAPSHOT_COUNT=10000"
@sonyafenge Did we mean to run with ETCD_SNAPSHOT_COUNT=20k instead?

only 1. etcd Version: 3.4.4-arktos.1 changed ETCD_SNAPSHOT=20000

Commit 7983fde on 9/8 with ETCD_SNAPSHOT=20000 and disable workload-controller-manager

run 2 still use the default value, it is 10000

pdgetrf · 2020-09-14T18:13:42Z

etcd monitoring https://blog.freshtracks.io/a-deep-dive-into-kubernetes-metrics-part-5-etcd-metrics-6502693fa58

pdgetrf · 2020-09-14T23:16:25Z

@sonyafenge steps to get etcd metrics using prometheus

this can be done from the master node. come to think of it, it may be better to put this on a non-master node to reduce the load of the master.

get prometheus

export RELEASE="2.2.1"
wget https://github.com/prometheus/prometheus/releases/download/v${RELEASE}/prometheus-${RELEASE}.linux-amd64.tar.gz
tar xvf prometheus-${RELEASE}.linux-amd64.tar.gz
cd prometheus-2.2.1.linux-amd64/

write config file

root@ip-172-31-27-32:/home/ubuntu# cat /tmp/test-etcd.yaml
global:
  scrape_interval: 10s
scrape_configs:
  - job_name: test-etcd
    static_configs:
    - targets: ['172.31.27.32:2379','127.0.0.1:2379']

run prometheus

./prometheus --config.file="/tmp/test-etcd.yaml" --web.listen-address=":9090"

to access the metrics, make sure port 9090 is enabled on the host, and then head to browser and do

http://[host ip]:9090/

replace host ip with the actual public ip of the host

pdgetrf · 2020-09-16T03:11:09Z

etcd metrics for the failed run

pdgetrf · 2020-09-16T04:25:28Z

More than 5% GRPC request failure detected in Etcd for 5 minutes

https://awesome-prometheus-alerts.grep.to/rules.html

pdgetrf · 2020-09-16T18:57:55Z

other metrics from the run:

pdgetrf · 2020-09-21T17:28:44Z

some observation about etcd compact (also great etcd doc here)

apiserver is setting compact interval to 5m

however from etcd log, it seems compact is being done way more frequent:

and it compact is running together with snapshotting

pdgetrf · 2020-09-21T17:52:08Z

our snapshot count is at 10k

while etcd doc suggests it should be 100,000

in specific:

pdgetrf · 2020-09-21T18:02:46Z

@sonyafenge

suggestion on next run:

apiserver flag: change --etcd-compaction-interval to 1h from 5m

--etcd-compaction-interval=1h

etcd flag: increase the snapshot count --snapshot-count from 10k to 100k

yb01 · 2020-09-21T20:33:04Z

each cluster will need to be tuned up. however, in general it is preferred to have smaller, quick compaction OPs to avoid the lock retention. In this etcd issue, etcd-io/etcd#11021, it is hinted for large scale cluster, it should be smaller.

pdgetrf · 2020-09-24T20:02:44Z

Here’s the sequence of events I’m going with:

Apiserver was killed by Kubelet due to probing failure around 7:22.
Before and around 7:22, apiserver was showing lots of long “list endpoints” error, “total time” of list goes from 1 minutes to 2 minutes. So who’s busy accessing endpoints during this time? It seems to be the endpoints_controller.
Around the same time ETCD was showing some abrupt metrics changes (for example, resource version stopped increasing) from Prometheus but no obvious error in the log

So my questions is about endpoints_controller now. Is it normal that we have this constant error of syncing endpoints? It happened sometimes 10+ times per minute.

pdgetrf · 2020-09-24T20:03:14Z

I’m seeing tons of logs like this in endpoints_controller.go

This started from the very beginning, lasted from 02:38 to around 07:21 (right before crash) and turned into this:

pdgetrf · 2020-09-25T20:02:01Z

good info on metrics

https://docs.signalfx.com/en/latest/integrations/agent/monitors/kubernetes-apiserver.html

Sindica · 2021-10-05T16:58:18Z

No longer an issue

pdgetrf added the area/performance-test label Aug 25, 2020

pdgetrf added this to the 830 milestone Aug 25, 2020

pdgetrf self-assigned this Aug 25, 2020

zmn223 modified the milestones: 830, 930 Aug 31, 2020

sonyafenge mentioned this issue Sep 9, 2020

Verify whether ETCD customization has impact on ETCD performance #694

Closed

Sindica closed this as completed Oct 5, 2021

maelvls mentioned this issue Oct 20, 2021

Cert-manager causes API server panic on clusters with more than 20000 secrets. cert-manager/cert-manager#3748

Closed

Investigate "killing connection/stream because serving request timed out" during perf test #621

Investigate "killing connection/stream because serving request timed out" during perf test #621

Comments

pdgetrf commented Aug 25, 2020

pdgetrf commented Aug 25, 2020 • edited Loading

pdgetrf commented Aug 25, 2020

pdgetrf commented Aug 25, 2020

pdgetrf commented Aug 25, 2020

pdgetrf commented Aug 25, 2020

pdgetrf commented Aug 26, 2020 • edited Loading

pdgetrf commented Aug 26, 2020 • edited Loading

pdgetrf commented Aug 26, 2020

pdgetrf commented Aug 26, 2020

pdgetrf commented Aug 26, 2020

sonyafenge commented Aug 28, 2020

pdgetrf commented Aug 28, 2020 • edited Loading

pdgetrf commented Aug 28, 2020

pdgetrf commented Aug 30, 2020 • edited Loading

Sindica commented Aug 31, 2020

pdgetrf commented Aug 31, 2020 • edited Loading

pdgetrf commented Aug 31, 2020 • edited Loading

pdgetrf commented Aug 31, 2020

pdgetrf commented Sep 1, 2020

pdgetrf commented Sep 1, 2020

pdgetrf commented Sep 4, 2020

sonyafenge commented Sep 4, 2020

9/3/2020 [issuedebug][arktos][ETCDperf]

sonyafenge commented Sep 4, 2020

pdgetrf commented Sep 4, 2020 • edited Loading

pdgetrf commented Sep 4, 2020

yxiong2020 commented Sep 4, 2020

pdgetrf commented Sep 4, 2020 • edited Loading

pdgetrf commented Sep 4, 2020 • edited Loading

pdgetrf commented Sep 8, 2020 • edited Loading

sonyafenge commented Sep 10, 2020 • edited Loading

pdgetrf commented Sep 11, 2020

pdgetrf commented Sep 11, 2020 • edited Loading

pdgetrf commented Sep 11, 2020 • edited Loading

sonyafenge commented Sep 11, 2020

sonyafenge commented Sep 11, 2020

pdgetrf commented Sep 14, 2020

pdgetrf commented Sep 14, 2020 • edited Loading

pdgetrf commented Sep 16, 2020

pdgetrf commented Sep 16, 2020 • edited Loading

pdgetrf commented Sep 16, 2020

pdgetrf commented Sep 21, 2020 • edited Loading

pdgetrf commented Sep 21, 2020 • edited Loading

pdgetrf commented Sep 21, 2020

yb01 commented Sep 21, 2020

pdgetrf commented Sep 24, 2020 • edited Loading

pdgetrf commented Sep 24, 2020

pdgetrf commented Sep 25, 2020

Sindica commented Oct 5, 2021

pdgetrf commented Aug 25, 2020 •

edited

Loading

pdgetrf commented Aug 26, 2020 •

edited

Loading

pdgetrf commented Aug 26, 2020 •

edited

Loading

pdgetrf commented Aug 28, 2020 •

edited

Loading

pdgetrf commented Aug 30, 2020 •

edited

Loading

pdgetrf commented Aug 31, 2020 •

edited

Loading

pdgetrf commented Aug 31, 2020 •

edited

Loading

pdgetrf commented Sep 4, 2020 •

edited

Loading

pdgetrf commented Sep 4, 2020 •

edited

Loading

pdgetrf commented Sep 4, 2020 •

edited

Loading

pdgetrf commented Sep 8, 2020 •

edited

Loading

sonyafenge commented Sep 10, 2020 •

edited

Loading

pdgetrf commented Sep 11, 2020 •

edited

Loading

pdgetrf commented Sep 11, 2020 •

edited

Loading

pdgetrf commented Sep 14, 2020 •

edited

Loading

pdgetrf commented Sep 16, 2020 •

edited

Loading

pdgetrf commented Sep 21, 2020 •

edited

Loading

pdgetrf commented Sep 21, 2020 •

edited

Loading

pdgetrf commented Sep 24, 2020 •

edited

Loading