Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't restore snapshot with etcd snapshot size than 2GB #4771

Closed
DungT9x opened this issue Sep 16, 2023 · 5 comments
Closed

Can't restore snapshot with etcd snapshot size than 2GB #4771

DungT9x opened this issue Sep 16, 2023 · 5 comments

Comments

@DungT9x
Copy link

DungT9x commented Sep 16, 2023

Environmental Info:
RKE2 Version: v1.26.4

Node(s) CPU architecture, OS, and Version: Ubuntu 22.04

Cluster Configuration: Rke2 HA which used keepalived

Describe the bug:

Can't restore snapshot with snapshot size than 2GB

Steps To Reproduce:

  • Installed RKE2:
  • Restore from a snapshot file:
    rke2 server
    --cluster-reset
    --cluster-reset-restore-path

Expected behavior:

How to custom quota-backend-bytes value when use rke2 server --cluster-reset command.

Actual behavior:

Can't restore with 3Gb size of snapshot file when starting temporary etcd to reconcile with datastore only support 2GB (quota-backend-bytes":2147483648 )

Additional context / logs:

INFO[0000] Applying Pod Security Admission Configuration 
INFO[0000] Static pod cleanup in progress               
INFO[0000] Logging temporary containerd to /var/lib/rancher/rke2/agent/containerd/containerd.log 
INFO[0000] Running temporary containerd /var/lib/rancher/rke2/bin/containerd -c /var/lib/rancher/rke2/agent/etc/containerd/config.toml -a /run/k3s/containerd/containerd.sock --state /run/k3s/containerd --root /var/lib/rancher/rke2/agent/containerd 
INFO[0010] Static pod cleanup completed successfully    
WARN[0010] remove /var/lib/rancher/rke2/agent/etc/rke2-agent-load-balancer.json: no such file or directory 
WARN[0010] remove /var/lib/rancher/rke2/agent/etc/rke2-api-server-agent-load-balancer.json: no such file or directory 
INFO[0010] Starting rke2 v1.26.4+rke2r1 (e65f1c66df111d0d38b61da860653e761f6155b1) 
INFO[0010] Managed etcd cluster bootstrap already complete and initialized 
INFO[0010] Starting temporary etcd to reconcile with datastore 
{"level":"info","ts":"2023-09-16T18:37:53.195+0700","caller":"embed/etcd.go:124","msg":"configuring peer listeners","listen-peer-urls":["http://127.0.0.1:2400"]}
{"level":"info","ts":"2023-09-16T18:37:53.196+0700","caller":"embed/etcd.go:132","msg":"configuring client listeners","listen-client-urls":["http://127.0.0.1:2399"]}
{"level":"info","ts":"2023-09-16T18:37:53.196+0700","caller":"embed/etcd.go:306","msg":"starting an etcd server","etcd-version":"3.5.7","git-sha":"Not provided (use ./build instead of go build)","go-version":"go1.19.8 X:boringcrypto","go-os":"linux","go-arch":"amd64","max-cpu-set":8,"max-cpu-available":8,"member-initialized":true,"name":"master-01-3e6003b9","data-dir":"/var/lib/rancher/rke2/server/db/etcd-tmp","wal-dir":"","wal-dir-dedicated":"","member-dir":"/var/lib/rancher/rke2/server/db/etcd-tmp/member","force-new-cluster":true,"heartbeat-interval":"500ms","election-timeout":"5s","initial-election-tick-advance":true,"snapshot-count":10000,"max-wals":0,"max-snapshots":0,"snapshot-catchup-entries":5000,"initial-advertise-peer-urls":["http://127.0.0.1:2400"],"listen-peer-urls":["http://127.0.0.1:2400"],"advertise-client-urls":["http://127.0.0.1:2399"],"listen-client-urls":["http://127.0.0.1:2399"],"listen-metrics-urls":[],"cors":["*"],"host-whitelist":["*"],"initial-cluster":"","initial-cluster-state":"new","initial-cluster-token":"","quota-backend-bytes":2147483648,"max-request-bytes":1572864,"max-concurrent-streams":4294967295,"pre-vote":true,"initial-corrupt-check":true,"corrupt-check-time-interval":"0s","compact-check-time-enabled":false,"compact-check-time-interval":"1m0s","auto-compaction-mode":"","auto-compaction-retention":"0s","auto-compaction-interval":"0s","discovery-url":"","discovery-proxy":"","downgrade-check-interval":"5s"}
{"level":"info","ts":"2023-09-16T18:38:03.197+0700","caller":"etcdserver/backend.go:85","msg":"db file is flocked by another process, or taking too long","path":"/var/lib/rancher/rke2/server/db/etcd-tmp/member/snap/db","took":"10.000993739s"}
{"level":"info","ts":"2023-09-16T18:38:07.166+0700","caller":"etcdserver/server.go:509","msg":"recovered v2 store from snapshot","snapshot-index":696280,"snapshot-size":"11 kB"}
{"level":"info","ts":"2023-09-16T18:38:07.166+0700","caller":"etcdserver/server.go:522","msg":"recovered v3 backend from snapshot","backend-size-bytes":3406200832,"backend-size":"3.4 GB","backend-size-in-use-bytes":3401625600,"backend-size-in-use":"3.4 GB"}
{"level":"info","ts":"2023-09-16T18:38:07.226+0700","caller":"etcdserver/raft.go:572","msg":"discarding uncommitted WAL entries","entry-index":696326,"commit-index-from-wal":696325,"number-of-discarded-entries":1}
{"level":"info","ts":"2023-09-16T18:38:07.228+0700","caller":"etcdserver/raft.go:602","msg":"forcing restart member","cluster-id":"2b8bf02800bf1826","local-member-id":"b6eb87d0003cc46c","commit-index":696327}
{"level":"info","ts":"2023-09-16T18:38:07.228+0700","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"b6eb87d0003cc46c switched to configuration voters=(10298182013373156034 13180778061856818284 15472063055163028652)"}
{"level":"info","ts":"2023-09-16T18:38:07.228+0700","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"b6eb87d0003cc46c became follower at term 10"}
{"level":"info","ts":"2023-09-16T18:38:07.228+0700","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"newRaft b6eb87d0003cc46c [peers: [8eea7dfaa36c8ac2,b6eb87d0003cc46c,d6b7cf5fc96ff4ac], term: 10, commit: 696327, applied: 696280, lastindex: 696327, lastterm: 10]"}
{"level":"info","ts":"2023-09-16T18:38:07.228+0700","caller":"api/capability.go:75","msg":"enabled capabilities for version","cluster-version":"3.5"}
{"level":"info","ts":"2023-09-16T18:38:07.228+0700","caller":"membership/cluster.go:278","msg":"recovered/added member from store","cluster-id":"2b8bf02800bf1826","local-member-id":"b6eb87d0003cc46c","recovered-remote-peer-id":"8eea7dfaa36c8ac2","recovered-remote-peer-urls":["https://192.168.10.65:2380"]}
{"level":"info","ts":"2023-09-16T18:38:07.228+0700","caller":"membership/cluster.go:278","msg":"recovered/added member from store","cluster-id":"2b8bf02800bf1826","local-member-id":"b6eb87d0003cc46c","recovered-remote-peer-id":"b6eb87d0003cc46c","recovered-remote-peer-urls":["https://192.168.10.25:2380"]}
{"level":"info","ts":"2023-09-16T18:38:07.229+0700","caller":"membership/cluster.go:278","msg":"recovered/added member from store","cluster-id":"2b8bf02800bf1826","local-member-id":"b6eb87d0003cc46c","recovered-remote-peer-id":"d6b7cf5fc96ff4ac","recovered-remote-peer-urls":["https://192.168.10.66:2380"]}
{"level":"info","ts":"2023-09-16T18:38:07.229+0700","caller":"membership/cluster.go:287","msg":"set cluster version from store","cluster-version":"3.5"}
{"level":"warn","ts":"2023-09-16T18:38:07.229+0700","caller":"auth/store.go:1234","msg":"simple token is not cryptographically signed"}
{"level":"info","ts":"2023-09-16T18:38:07.230+0700","caller":"mvcc/kvstore.go:323","msg":"restored last compact revision","meta-bucket-name":"meta","meta-bucket-name-key":"finishedCompactRev","restored-compact-revision":25225390}
{"level":"warn","ts":"2023-09-16T18:38:08.192+0700","logger":"etcd-client","caller":"[email protected]/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0009fe1c0/127.0.0.1:2399","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:2399: connect: connection refused\""}
{"level":"info","ts":"2023-09-16T18:38:08.192+0700","logger":"etcd-client","caller":"[email protected]/client.go:210","msg":"Auto sync endpoints failed.","error":"context deadline exceeded"}
{"level":"info","ts":"2023-09-16T18:38:08.375+0700","caller":"mvcc/kvstore.go:393","msg":"kvstore restored","current-rev":25226398}
{"level":"info","ts":"2023-09-16T18:38:08.377+0700","caller":"etcdserver/quota.go:94","msg":"enabled backend quota with default value","quota-name":"v3-applier","quota-size-bytes":2147483648,"quota-size":"2.1 GB"}
{"level":"info","ts":"2023-09-16T18:38:08.378+0700","caller":"rafthttp/peer.go:133","msg":"starting remote peer","remote-peer-id":"8eea7dfaa36c8ac2"}
{"level":"info","ts":"2023-09-16T18:38:08.378+0700","caller":"rafthttp/pipeline.go:72","msg":"started HTTP pipelining with remote peer","local-member-id":"b6eb87d0003cc46c","remote-peer-id":"8eea7dfaa36c8ac2"}
{"level":"info","ts":"2023-09-16T18:38:08.380+0700","caller":"rafthttp/stream.go:169","msg":"started stream writer with remote peer","local-member-id":"b6eb87d0003cc46c","remote-peer-id":"8eea7dfaa36c8ac2"}
{"level":"info","ts":"2023-09-16T18:38:08.380+0700","caller":"rafthttp/stream.go:169","msg":"started stream writer with remote peer","local-member-id":"b6eb87d0003cc46c","remote-peer-id":"8eea7dfaa36c8ac2"}
{"level":"info","ts":"2023-09-16T18:38:08.381+0700","caller":"rafthttp/peer.go:137","msg":"started remote peer","remote-peer-id":"8eea7dfaa36c8ac2"}
{"level":"info","ts":"2023-09-16T18:38:08.381+0700","caller":"rafthttp/stream.go:395","msg":"started stream reader with remote peer","stream-reader-type":"stream MsgApp v2","local-member-id":"b6eb87d0003cc46c","remote-peer-id":"8eea7dfaa36c8ac2"}
{"level":"info","ts":"2023-09-16T18:38:08.381+0700","caller":"rafthttp/transport.go:317","msg":"added remote peer","local-member-id":"b6eb87d0003cc46c","remote-peer-id":"8eea7dfaa36c8ac2","remote-peer-urls":["https://192.168.10.65:2380"]}
{"level":"info","ts":"2023-09-16T18:38:08.381+0700","caller":"rafthttp/peer.go:133","msg":"starting remote peer","remote-peer-id":"d6b7cf5fc96ff4ac"}
{"level":"info","ts":"2023-09-16T18:38:08.381+0700","caller":"rafthttp/pipeline.go:72","msg":"started HTTP pipelining with remote peer","local-member-id":"b6eb87d0003cc46c","remote-peer-id":"d6b7cf5fc96ff4ac"}
{"level":"info","ts":"2023-09-16T18:38:08.381+0700","caller":"rafthttp/stream.go:395","msg":"started stream reader with remote peer","stream-reader-type":"stream Message","local-member-id":"b6eb87d0003cc46c","remote-peer-id":"8eea7dfaa36c8ac2"}
{"level":"info","ts":"2023-09-16T18:38:08.381+0700","caller":"rafthttp/stream.go:169","msg":"started stream writer with remote peer","local-member-id":"b6eb87d0003cc46c","remote-peer-id":"d6b7cf5fc96ff4ac"}
{"level":"info","ts":"2023-09-16T18:38:08.382+0700","caller":"rafthttp/stream.go:169","msg":"started stream writer with remote peer","local-member-id":"b6eb87d0003cc46c","remote-peer-id":"d6b7cf5fc96ff4ac"}
{"level":"info","ts":"2023-09-16T18:38:08.382+0700","caller":"rafthttp/peer.go:137","msg":"started remote peer","remote-peer-id":"d6b7cf5fc96ff4ac"}
{"level":"info","ts":"2023-09-16T18:38:08.382+0700","caller":"rafthttp/stream.go:395","msg":"started stream reader with remote peer","stream-reader-type":"stream MsgApp v2","local-member-id":"b6eb87d0003cc46c","remote-peer-id":"d6b7cf5fc96ff4ac"}
{"level":"info","ts":"2023-09-16T18:38:08.382+0700","caller":"rafthttp/transport.go:317","msg":"added remote peer","local-member-id":"b6eb87d0003cc46c","remote-peer-id":"d6b7cf5fc96ff4ac","remote-peer-urls":["https://192.168.10.66:2380"]}
{"level":"info","ts":"2023-09-16T18:38:08.382+0700","caller":"etcdserver/corrupt.go:95","msg":"starting initial corruption check","local-member-id":"b6eb87d0003cc46c","timeout":"15s"}
{"level":"info","ts":"2023-09-16T18:38:08.382+0700","caller":"rafthttp/stream.go:395","msg":"started stream reader with remote peer","stream-reader-type":"stream Message","local-member-id":"b6eb87d0003cc46c","remote-peer-id":"d6b7cf5fc96ff4ac"}
{"level":"warn","ts":"2023-09-16T18:38:09.014+0700","caller":"etcdserver/corrupt.go:398","msg":"failed hash kv request","local-member-id":"b6eb87d0003cc46c","requested-revision":25226398,"remote-peer-endpoint":"https://192.168.10.65:2380","error":"Get \"https://192.168.10.65:2380/members/hashkv\": dial tcp 192.168.10.65:2380: connect: connection refused"}
{"level":"warn","ts":"2023-09-16T18:38:09.017+0700","caller":"etcdserver/corrupt.go:398","msg":"failed hash kv request","local-member-id":"b6eb87d0003cc46c","requested-revision":25226398,"remote-peer-endpoint":"https://192.168.10.66:2380","error":"Get \"https://192.168.10.66:2380/members/hashkv\": x509: certificate signed by unknown authority"}
{"level":"info","ts":"2023-09-16T18:38:09.017+0700","caller":"etcdserver/corrupt.go:165","msg":"initial corruption checking passed; no corruption","local-member-id":"b6eb87d0003cc46c"}
{"level":"info","ts":"2023-09-16T18:38:09.017+0700","caller":"etcdserver/server.go:845","msg":"starting etcd server","local-member-id":"b6eb87d0003cc46c","local-server-version":"3.5.7","cluster-id":"2b8bf02800bf1826","cluster-version":"3.5"}
{"level":"info","ts":"2023-09-16T18:38:09.017+0700","caller":"etcdserver/server.go:754","msg":"starting initial election tick advance","election-ticks":10}
{"level":"warn","ts":"2023-09-16T18:38:09.018+0700","caller":"etcdserver/util.go:123","msg":"failed to apply request","took":"904.969µs","request":"header:<ID:14153840137077633173 username:\"etcd-client\" auth_revision:1 > txn:<compare:<target:MOD key:\"/registry/events/kube-system/node-local-dns-qm8l6.1785575db9f97a17\" mod_revision:25226391 > success:<request_put:<key:\"/registry/events/kube-system/node-local-dns-qm8l6.1785575db9f97a17\" value_size:697 lease:4930468100222856942 >> failure:<>>","response":"size:20","error":"etcdserver: no space"}
{"level":"warn","ts":"2023-09-16T18:38:09.018+0700","caller":"etcdserver/server.go:2269","msg":"message exceeded backend quota; raising alarm","quota-size-bytes":0,"quota-size":"0 B","error":"etcdserver: no space"}
@brandond
Copy link
Member

@DungT9x
Copy link
Author

DungT9x commented Sep 18, 2023

Hi @brandond
Can you give a specific release of rke2 which has been fixed this issue?
Tks you.

@brandond
Copy link
Member

I would recommend upgrading to the latest 1.26 patch release

@DungT9x
Copy link
Author

DungT9x commented Sep 18, 2023

Hi @brandond I've been apply newer version. It's passed quota value but my snapshot size is so big.
How to increase timeout with too long info.

INFO[0010] Starting temporary etcd to reconcile with datastore 
{"level":"info","ts":"2023-09-18T10:27:30.850+0700","caller":"embed/etcd.go:124","msg":"configuring peer listeners","listen-peer-urls":["http://127.0.0.1:2400"]}
{"level":"info","ts":"2023-09-18T10:27:30.850+0700","caller":"embed/etcd.go:132","msg":"configuring client listeners","listen-client-urls":["http://127.0.0.1:2399"]}
{"level":"info","ts":"2023-09-18T10:27:30.850+0700","caller":"embed/etcd.go:306","msg":"starting an etcd server","etcd-version":"3.5.7","git-sha":"Not provided (use ./build instead of go build)","go-version":"go1.19.9 X:boringcrypto","go-os":"linux","go-arch":"amd64","max-cpu-set":8,"max-cpu-available":8,"member-initialized":true,"name":"master-02-85c3e590","data-dir":"/var/lib/rancher/rke2/server/db/etcd-tmp","wal-dir":"","wal-dir-dedicated":"","member-dir":"/var/lib/rancher/rke2/server/db/etcd-tmp/member","force-new-cluster":true,"heartbeat-interval":"1s","election-timeout":"40s","initial-election-tick-advance":true,"snapshot-count":5000,"max-wals":0,"max-snapshots":0,"snapshot-catchup-entries":5000,"initial-advertise-peer-urls":["http://127.0.0.1:2400"],"listen-peer-urls":["http://127.0.0.1:2400"],"advertise-client-urls":["http://127.0.0.1:2399"],"listen-client-urls":["http://127.0.0.1:2399"],"listen-metrics-urls":[],"cors":["*"],"host-whitelist":["*"],"initial-cluster":"","initial-cluster-state":"new","initial-cluster-token":"","quota-backend-bytes":5368709120,"max-request-bytes":1572864,"max-concurrent-streams":4294967295,"pre-vote":true,"initial-corrupt-check":true,"corrupt-check-time-interval":"0s","compact-check-time-enabled":false,"compact-check-time-interval":"1m0s","auto-compaction-mode":"","auto-compaction-retention":"0s","auto-compaction-interval":"0s","discovery-url":"","discovery-proxy":"","downgrade-check-interval":"5s"}
{"level":"info","ts":"2023-09-18T10:27:40.851+0700","caller":"etcdserver/backend.go:85","msg":"db file is flocked by another process, or taking too long","path":"/var/lib/rancher/rke2/server/db/etcd-tmp/member/snap/db","took":"10.00034707s"}

INFO[0210] Failed to test temporary data store connection: failed to defragment etcd database: context deadline exceeded 
{"level":"info","ts":"2023-09-18T10:30:59.951+0700","caller":"backend/backend.go:549","msg":"finished defragmenting directory","path":"/var/lib/rancher/rke2/server/db/etcd-tmp/member/snap/db","current-db-size-bytes-diff":0,"current-db-size-bytes":3404046336,"current-db-size":"3.4 GB","current-db-size-in-use-bytes-diff":0,"current-db-size-in-use-bytes":3404005376,"current-db-size-in-use":"3.4 GB","took":"35.118877721s"}
{"level":"info","ts":"2023-09-18T10:30:59.951+0700","caller":"v3rpc/maintenance.go:96","msg":"finished defragment"}
{"level":"warn","ts":"2023-09-18T10:30:59.951+0700","caller":"v3rpc/interceptor.go:197","msg":"request stats","start time":"2023-09-18T10:30:24.832+0700","time spent":"35.119163717s","remote":"127.0.0.1:12940","response type":"/etcdserverpb.Maintenance/Defragment","request count":-1,"request size":-1,"response count":-1,"response size":-1,"request content":""}
{"level":"warn","ts":"2023-09-18T10:30:59.951+0700","caller":"v3rpc/interceptor.go:197","msg":"request stats","start time":"2023-09-18T10:30:55.992+0700","time spent":"3.959466011s","remote":"127.0.0.1:60900","response type":"/etcdserverpb.Maintenance/Status","request count":-1,"request size":-1,"response count":-1,"response size":-1,"request content":""}
{"level":"warn","ts":"2023-09-18T10:30:59.952+0700","caller":"etcdserver/util.go:170","msg":"apply request took too long","took":"7.329241724s","expected-duration":"100ms","prefix":"","request":"header:<ID:9998706569705083665 > lease_revoke:<id:74ac8a982fbfd68a>","response":"size:30"}

Tks you.
Dũng

@brandond
Copy link
Member

brandond commented Sep 18, 2023

{"level":"info","ts":"2023-09-18T10:27:40.851+0700","caller":"etcdserver/backend.go:85","msg":"db file is flocked by another process, or taking too long","path":"/var/lib/rancher/rke2/server/db/etcd-tmp/member/snap/db","took":"10.00034707s"}

INFO[0210] Failed to test temporary data store connection: failed to defragment etcd database: context deadline exceeded 
{"level":"info","ts":"2023-09-18T10:30:59.951+0700","caller":"backend/backend.go:549","msg":"finished defragmenting directory","path":"/var/lib/rancher/rke2/server/db/etcd-tmp/member/snap/db","current-db-size-bytes-diff":0,"current-db-size-bytes":3404046336,"current-db-size":"3.4 GB","current-db-size-in-use-bytes-diff":0,"current-db-size-in-use-bytes":3404005376,"current-db-size-in-use":"3.4 GB","took":"35.118877721s"}

It shouldn't take more than 30 seconds to complete the defrag, nor have I seen the lock error you're seeing. What sort of storage are you running this on? Do you have the ability to put it on something faster?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants