Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate "killing connection/stream because serving request timed out" during perf test #621

Closed
pdgetrf opened this issue Aug 25, 2020 · 48 comments
Assignees
Milestone

Comments

@pdgetrf
Copy link
Collaborator

pdgetrf commented Aug 25, 2020

8/24/2020 [perf-tests][arktos][node5000] node5000-1api-1wc-1etcd:

Load run still failed and crashed with panic in apiserver:
runtime.go:73] Observed a panic: &errors.errorString{s:"killing connection/stream because serving request timed out and response had been started"} (killing connection/stream because serving request timed out and response had been started)

This error is usually surrounded by

E0822 14:36:23.920503       1 wrap.go:32] apiserver panic'd on PUT /apis/coordination.k8s.io/v1beta1/tenants/system/namespaces/kube-node-lease/leases/hollow-node-p8rvh?timeout=10s
@pdgetrf pdgetrf added this to the 830 milestone Aug 25, 2020
@pdgetrf pdgetrf self-assigned this Aug 25, 2020
@pdgetrf
Copy link
Collaborator Author

pdgetrf commented Aug 25, 2020

@sonyafenge did we also see this error in runs with lower than 5k nodes? for example 100 and 500 nodes?

@pdgetrf
Copy link
Collaborator Author

pdgetrf commented Aug 25, 2020

errors from etcd:

2020-08-22 14:41:23.345016 W | etcdserver: read-only range request "key:\"/registry/services/specs/\" range_end:\"/registry/services/specs0\" limit:500 " with result "error:context canceled" took too long (2.093356818s) to execute
2020-08-22 14:41:23.345087 W | etcdserver: read-only range request "key:\"/registry/actions/\" range_end:\"/registry/actions0\" limit:500 " with result "error:context canceled" took too long (2.100863077s) to execute
2020-08-22 14:41:23.345235 W | etcdserver: read-only range request "key:\"/registry/pods/\" range_end:\"/registry/pods0\" limit:500 " with result "error:context canceled" took too long (2.071970109s) to execute
2020-08-22 14:41:23.345477 W | etcdserver: read-only range request "key:\"/registry/pods/\" range_end:\"/registry/pods0\" limit:500 " with result "error:context canceled" took too long (2.078145492s) to execute
2020-08-22 14:41:23.348268 W | etcdserver: read-only range request "key:\"/registry/secrets/vydm4f-testns/default-token-rnxjn\" " with result "error:context canceled" took too long (2.04715108s) to execute

@pdgetrf
Copy link
Collaborator Author

pdgetrf commented Aug 25, 2020

similar encountering https://blog.csdn.net/qq_36783142/article/details/103443750 it's attributed to issue under load.

@pdgetrf
Copy link
Collaborator Author

pdgetrf commented Aug 25, 2020

@vinaykul are you seeing this panic in your 5k node run?

@pdgetrf
Copy link
Collaborator Author

pdgetrf commented Aug 25, 2020

and looking into disk throttling based on discussion here kubernetes-sigs/kind#717

@pdgetrf
Copy link
Collaborator Author

pdgetrf commented Aug 26, 2020

graph from disk I/O

image

and zoom in till the end:

image

marked section looks like being throttled. checking the logs for activities during these time.

@pdgetrf
Copy link
Collaborator Author

pdgetrf commented Aug 26, 2020

disk graph seems to correlate with the time when panic started to appear in apiserver log

image

@pdgetrf
Copy link
Collaborator Author

pdgetrf commented Aug 26, 2020

and disk I/O matches the time when etcd started to give up in etcd.log

image

@pdgetrf
Copy link
Collaborator Author

pdgetrf commented Aug 26, 2020

and here's the overall system load

image

@pdgetrf
Copy link
Collaborator Author

pdgetrf commented Aug 26, 2020

GCE disk I/O limit is proportional to the disk size.

image

@sonyafenge will run this test again with SSD size increased from 500G to 1000G to confirm the effect of disk I/O throttling.

@sonyafenge
Copy link
Collaborator

Please open attached netadata snapshot from: https://registry.my-netdata.io/

Please update file type to ".snapshot" before open attached file:
netdata-19b8d07b911d-20200828-053044-540.txt

@pdgetrf
Copy link
Collaborator Author

pdgetrf commented Aug 28, 2020

disk I/O before death. the plateau is a sign of disk being throttled.

image

@pdgetrf
Copy link
Collaborator Author

pdgetrf commented Aug 28, 2020

@sonyafenge what is the version of the etcd that our arktos etcd based on?

@pdgetrf
Copy link
Collaborator Author

pdgetrf commented Aug 30, 2020

from @vinaykul 's 3 apiserver tests, also seeing similar etcd issues but no panic in apiserver:

2020-08-29 16:07:23.334423 W | rafthttp: health check for peer c1becd0abd5925dc could not connect: dial tcp: lookup vinay-k8s-10k-kubemark-master-8c8 on 169.254.169.254:53: no such host
raft2020/08/29 16:07:24 INFO: 83fa9bf147631769 is starting a new election at term 4
raft2020/08/29 16:07:24 INFO: 83fa9bf147631769 became candidate at term 5
raft2020/08/29 16:07:24 INFO: 83fa9bf147631769 received MsgVoteResp from 83fa9bf147631769 at term 5
raft2020/08/29 16:07:24 INFO: 83fa9bf147631769 [logterm: 2, index: 535] sent MsgVote request to c1becd0abd5925dc at term 5
2020-08-29 16:07:24.857063 W | etcdserver: read-only range request "key:\"/registry/health\" " with result "error:context canceled" took too long (2.000007517s) to execute
WARNING: 2020/08/29 16:07:24 grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing"
2020-08-29 16:07:25.686850 W | etcdserver: timed out waiting for read index response (local node might have slow network)
2020-08-29 16:07:25.686983 W | etcdserver: read-only range request "key:\"/registry/services/endpoints/kube-system/kube-controller-manager\" " with result "error:etcdserver: request timed out" took too long (7.000262842s) to execute
raft2020/08/29 16:07:26 INFO: 83fa9bf147631769 is starting a new election at term 5
raft2020/08/29 16:07:26 INFO: 83fa9bf147631769 became candidate at term 6
raft2020/08/29 16:07:26 INFO: 83fa9bf147631769 received MsgVoteResp from 83fa9bf147631769 at term 6
raft2020/08/29 16:07:26 INFO: 83fa9bf147631769 [logterm: 2, index: 535] sent MsgVote request to c1becd0abd5925dc at term 6
2020-08-29 16:07:26.861178 W | etcdserver: read-only range request "key:\"/registry/health\" " with result "error:context deadline exceeded" took too long (2.000107129s) to execute
WARNING: 2020/08/29 16:07:26 grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing"
raft2020/08/29 16:07:27 INFO: 83fa9bf147631769 is starting a new election at term 6
raft2020/08/29 16:07:27 INFO: 83fa9bf147631769 became candidate at term 7
raft2020/08/29 16:07:27 INFO: 83fa9bf147631769 received MsgVoteResp from 83fa9bf147631769 at term 7
raft2020/08/29 16:07:27 INFO: 83fa9bf147631769 [logterm: 2, index: 535] sent MsgVote request to c1becd0abd5925dc at term 7
2020-08-29 16:07:28.334606 W | rafthttp: health check for peer c1becd0abd5925dc could not connect: dial tcp: lookup vinay-k8s-10k-kubemark-master-8c8 on 169.254.169.254:53: no such host
2020-08-29 16:07:28.334664 W | rafthttp: health check for peer c1becd0abd5925dc could not connect: dial tcp: lookup vinay-k8s-10k-kubemark-master-8c8 on 169.254.169.254:53: no such host
2020-08-29 16:07:28.686306 W | etcdserver: read-only range request "key:\"/registry/services/endpoints/kube-system/kube-controller-manager\" " with result "error:context canceled" took too long (996.538674ms) to execute
WARNING: 2020/08/29 16:07:28 grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing"
2020-08-29 16:07:28.865603 W | etcdserver: read-only range request "key:\"/registry/health\" " with result "error:context canceled" took too long (2.000091191s) to execute
WARNING: 2020/08/29 16:07:28 grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing"
2020-08-29 16:07:29.306984 W | etcdserver: read-only range request "key:\"/registry/jobs/\" range_end:\"/registry/jobs0\" limit:500 " with result "error:context canceled" took too long (8.818748064s) to execute
WARNING: 2020/08/29 16:07:29 grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing"
raft2020/08/29 16:07:29 INFO: 83fa9bf147631769 is starting a new election at term 7
raft2020/08/29 16:07:29 INFO: 83fa9bf147631769 became candidate at term 8
raft2020/08/29 16:07:29 INFO: 83fa9bf147631769 received MsgVoteResp from 83fa9bf147631769 at term 8
raft2020/08/29 16:07:29 INFO: 83fa9bf147631769 [logterm: 2, index: 535] sent MsgVote request to c1becd0abd5925dc at term 8
2020-08-29 16:07:29.973398 W | etcdserver: read-only range request "key:\"/registry/services/endpoints/kube-system/kube-scheduler\" " with result "error:context canceled" took too long (9.999617436s) to execute
WARNING: 2020/08/29 16:07:29 grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing"
2020-08-29 16:07:30.870122 W | etcdserver: read-only range request "key:\"/registry/health\" " with result "error:context canceled" took too long (2.000036686s) to execute

@Sindica
Copy link
Collaborator

Sindica commented Aug 31, 2020

@sonyafenge what is the version of the etcd that our arktos etcd based on?

Arktos is running on customized ETCD 3.4.4

@pdgetrf
Copy link
Collaborator Author

pdgetrf commented Aug 31, 2020

some of the tuning in https://etcd.io/docs/v3.4.0/tuning/ looks interesting. for example:

image

@pdgetrf
Copy link
Collaborator Author

pdgetrf commented Aug 31, 2020

here's another discussion about tuning:

https://docs.portworx.com/portworx-install-with-kubernetes/operate-and-maintain-on-kubernetes/etcd/

in specific:

image

@pdgetrf
Copy link
Collaborator Author

pdgetrf commented Aug 31, 2020

@zmn223 zmn223 modified the milestones: 830, 930 Aug 31, 2020
@pdgetrf
Copy link
Collaborator Author

pdgetrf commented Sep 1, 2020

@sonyafenge

for the next runs, can we try the following parameter for etcd

--snapshot-count=5000

Also run this on the host where etcd runs

sudo ionice -c2 -n0 -p `pgrep etcd`

@pdgetrf
Copy link
Collaborator Author

pdgetrf commented Sep 1, 2020

@pdgetrf
Copy link
Collaborator Author

pdgetrf commented Sep 4, 2020

should enable etcd monitoring https://etcd.io/docs/v3.4.0/op-guide/monitoring/

@sonyafenge
Copy link
Collaborator

9/3/2020 [issuedebug][arktos][ETCDperf]

started 5k nodes run and cluster crashed with this error and also another error:https://github.com/futurewei-cloud/arktos/issues/682
configuration change:

  1. ETCD_SNAPSHOT_COUNT to 5000
  2. Run "sudo ionice -c2 -n0 -p pgrep etcd" on kubemark master
sonyali@sonya-uswest2:~/go/src/k8s.io/arktos$ git diff
diff --git a/cluster/gce/manifests/etcd.manifest b/cluster/gce/manifests/etcd.manifest
index c9049da..ef000ba 100644
--- a/cluster/gce/manifests/etcd.manifest
+++ b/cluster/gce/manifests/etcd.manifest
@@ -45,7 +45,7 @@
         "value": "{{ etcd_protocol }}://{{ hostname }}:{{ server_port }}"
       },
       { "name": "ETCD_SNAPSHOT_COUNT",
-        "value": "10000"
+        "value": "5000"
       }
         ],
     "livenessProbe": {

etcd-3a1w1e-kubemark-master /var/log # sudo ionice -c2 -n0 -p `pgrep etcd`

logs can be found under GCP project: workload-controller-manager:

sonya-uswest2:  /home/sonyali/logs/perf-test/gce-5000/arktos/0903-etcd-3a1w1e 

@sonyafenge
Copy link
Collaborator

Please open attached netadata snapshot from: https://registry.my-netdata.io/
netdata-61e075073cf4-20200904-020203-480.txt

@pdgetrf
Copy link
Collaborator Author

pdgetrf commented Sep 4, 2020

crash started roughly at 2:05am
image

around the time of the crash, etcd:

2020-09-04 17:05:54.801274 W | etcdserver: read-only range request "key:\"/registry/pods/\" range_end:\"/registry/pods0\" limit:500 " with result "range_response_count:500 size:624221" took too long (9.604877046s) to execute

that is, almost 10 seconds to execute

@pdgetrf
Copy link
Collaborator Author

pdgetrf commented Sep 4, 2020

possible straw hat broke the camel's back

"WARNING: 2020/09/04 17:05:45 grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing"

after this error response time started to rocket from ms to s

image

@yxiong2020
Copy link

Still working on the root cause

@pdgetrf
Copy link
Collaborator Author

pdgetrf commented Sep 4, 2020

ETCD was down and it seems to be down during a snapshotting attempt.

image

And during this time apiserver was having issue:

image

@pdgetrf
Copy link
Collaborator Author

pdgetrf commented Sep 4, 2020

@sonyafenge let's revert the snapshot period change for the next run. from metrics it looks like our etcd host didn't have high cpu and memory load. the default snapshot interval 10k may be okay or we could probably even pushed it out a little say 15k or 20k. Let's use ETCD_SNAPSHOT_COUNT = 20000 for the next run for experiment.

@pdgetrf
Copy link
Collaborator Author

pdgetrf commented Sep 8, 2020

next step for this work is to enable ETCD monitoring and stress testing etcd.

we need to understand etcd's behavior under stress. "bug/crash driven optimization" can only get us so far.

for monitoring this could be used https://etcd.io/docs/v3.4.0/op-guide/monitoring/

@sonyafenge
Copy link
Collaborator

sonyafenge commented Sep 10, 2020

Started two 5K runs, both disabled workload-controller-manager. Still see killing connection/stream #621.
Both run has a new panic in kube-controller-manager.log: Panic: "invalid memory address or nil pointer dereference" in daemon_controller.go #698

Here is run detail:

  1. etcd Version: 3.4.4-arktos.1

Commit 7983fde on 9/8 with ETCD_SNAPSHOT=20000 and disable workload-controller-manager

1 apiserver, 0 workload, 1 etcd

etcd size after crashed: 2.2GB

logs can be found under GCP project: workload-controller-manager: sonya-uswest2: /home/sonyali/logs/perf-test/gce-5000/arktos/0908-debug-1a1w1e

  1. etcd Version: 3.4.4:

Commit “multi-tenancy CRD e2e tests (#662)” on 9/8 with disable workload-controller-manager

1 apiserver, 0 workload, 1 etcd

etcd size after crashed: 1.5GB

logs can be found under GCP project: workload-controller-manager: sonya-useast1: /home/sonyali/logs/perf-test/gce-5000/arktos/0908-etcd344-1a1w1e

@pdgetrf
Copy link
Collaborator Author

pdgetrf commented Sep 11, 2020

I'm not seeing a machine called sonya-uswest21

@pdgetrf
Copy link
Collaborator Author

pdgetrf commented Sep 11, 2020

both runs have such logs right before runs:

apiserver received an error that is not an metav1.Status: &errors.errorString{s:"context canceled"}

@pdgetrf
Copy link
Collaborator Author

pdgetrf commented Sep 11, 2020

run 2 has this

"2020-09-09 04:16:28.799325 I | pkg/flags: recognized and used environment variable ETCD_SNAPSHOT_COUNT=10000"

@sonyafenge Did we mean to run with ETCD_SNAPSHOT_COUNT=20k instead?

@sonyafenge
Copy link
Collaborator

I'm not seeing a machine called sonya-uswest21

sorry, that's a copy error, it should be sonya-uswest2

@sonyafenge
Copy link
Collaborator

run 2 has this

"2020-09-09 04:16:28.799325 I | pkg/flags: recognized and used environment variable ETCD_SNAPSHOT_COUNT=10000"

@sonyafenge Did we mean to run with ETCD_SNAPSHOT_COUNT=20k instead?

only 1. etcd Version: 3.4.4-arktos.1 changed ETCD_SNAPSHOT=20000

Commit 7983fde on 9/8 with ETCD_SNAPSHOT=20000 and disable workload-controller-manager

run 2 still use the default value, it is 10000

@pdgetrf
Copy link
Collaborator Author

pdgetrf commented Sep 14, 2020

@pdgetrf
Copy link
Collaborator Author

pdgetrf commented Sep 14, 2020

@sonyafenge steps to get etcd metrics using prometheus

this can be done from the master node. come to think of it, it may be better to put this on a non-master node to reduce the load of the master.

  1. get prometheus
export RELEASE="2.2.1"
wget https://github.com/prometheus/prometheus/releases/download/v${RELEASE}/prometheus-${RELEASE}.linux-amd64.tar.gz
tar xvf prometheus-${RELEASE}.linux-amd64.tar.gz
cd prometheus-2.2.1.linux-amd64/
  1. write config file
root@ip-172-31-27-32:/home/ubuntu# cat /tmp/test-etcd.yaml
global:
  scrape_interval: 10s
scrape_configs:
  - job_name: test-etcd
    static_configs:
    - targets: ['172.31.27.32:2379','127.0.0.1:2379']
  1. run prometheus
./prometheus --config.file="/tmp/test-etcd.yaml" --web.listen-address=":9090"

to access the metrics, make sure port 9090 is enabled on the host, and then head to browser and do

http://[host ip]:9090/

replace host ip with the actual public ip of the host

@pdgetrf
Copy link
Collaborator Author

pdgetrf commented Sep 16, 2020

image

etcd metrics for the failed run

@pdgetrf
Copy link
Collaborator Author

pdgetrf commented Sep 16, 2020

More than 5% GRPC request failure detected in Etcd for 5 minutes

image

https://awesome-prometheus-alerts.grep.to/rules.html

@pdgetrf
Copy link
Collaborator Author

pdgetrf commented Sep 16, 2020

other metrics from the run:

image

@pdgetrf
Copy link
Collaborator Author

pdgetrf commented Sep 21, 2020

some observation about etcd compact (also great etcd doc here)

apiserver is setting compact interval to 5m

image

however from etcd log, it seems compact is being done way more frequent:

image

and it compact is running together with snapshotting

image

@pdgetrf
Copy link
Collaborator Author

pdgetrf commented Sep 21, 2020

our snapshot count is at 10k

image

while etcd doc suggests it should be 100,000

image

in specific:

image

@pdgetrf
Copy link
Collaborator Author

pdgetrf commented Sep 21, 2020

@sonyafenge

suggestion on next run:

  • apiserver flag: change --etcd-compaction-interval to 1h from 5m
--etcd-compaction-interval=1h
  • etcd flag: increase the snapshot count --snapshot-count from 10k to 100k

@yb01
Copy link
Collaborator

yb01 commented Sep 21, 2020

each cluster will need to be tuned up. however, in general it is preferred to have smaller, quick compaction OPs to avoid the lock retention. In this etcd issue, etcd-io/etcd#11021, it is hinted for large scale cluster, it should be smaller.

@pdgetrf
Copy link
Collaborator Author

pdgetrf commented Sep 24, 2020

Here’s the sequence of events I’m going with:

  • Apiserver was killed by Kubelet due to probing failure around 7:22.

  • Before and around 7:22, apiserver was showing lots of long “list endpoints” error, “total time” of list goes from 1 minutes to 2 minutes. So who’s busy accessing endpoints during this time? It seems to be the endpoints_controller.

  • Around the same time ETCD was showing some abrupt metrics changes (for example, resource version stopped increasing) from Prometheus but no obvious error in the log

So my questions is about endpoints_controller now. Is it normal that we have this constant error of syncing endpoints? It happened sometimes 10+ times per minute.

@pdgetrf
Copy link
Collaborator Author

pdgetrf commented Sep 24, 2020

I’m seeing tons of logs like this in endpoints_controller.go

image

This started from the very beginning, lasted from 02:38 to around 07:21 (right before crash) and turned into this:
image

@pdgetrf
Copy link
Collaborator Author

pdgetrf commented Sep 25, 2020

@Sindica
Copy link
Collaborator

Sindica commented Oct 5, 2021

No longer an issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants