ETCD db size is extremely unbalanced among members #7116

armstrongli · 2017-01-09T06:49:35Z

I have one 4 members' cluster. And after running for days, one of members' DB size is extremely larger than the other ones.

https://tess-master-cbtkd-8128.dev.*.*:4001, 87e6a178ab5a0f63, 3.0.15, 11 GB, true, 2868, 59480228
https://tess-master-temb2-9053.dev.*.*:4001, b3f0b5ddc047ed37, 3.0.15, 57 MB, false, 2868, 59480355
https://tess-master-13mny-8094.dev.*.*:4001, cc83177fcd721b4f, 3.0.15, 57 MB, false, 2868, 59480510
https://salt-master-7505.dev.*.*:4001, ea015efc8847afdf, 3.0.15, 57 MB, false, 2868, 59480579

Member 87e6a178ab5a0f63 has one DB file which is 11 GB, but other ones have only 57 MB.

There're a lot of such logs in member 87e6a178ab5a0f63

...

2017-01-09 06:46:38.706113 W | etcdserver: avoid queries with large range/delete range!
2017-01-09 06:46:38.928569 W | etcdserver: apply entries took too long [117.086886ms for 1 entries]
2017-01-09 06:46:38.928628 W | etcdserver: avoid queries with large range/delete range!
2017-01-09 06:46:39.135635 W | etcdserver: apply entries took too long [101.388418ms for 1 entries]
2017-01-09 06:46:39.135953 W | etcdserver: avoid queries with large range/delete range!
2017-01-09 06:46:39.371954 W | etcdserver: apply entries took too long [134.492457ms for 1 entries]
2017-01-09 06:46:39.372024 W | etcdserver: avoid queries with large range/delete range!
2017-01-09 06:46:39.578320 W | etcdserver: apply entries took too long [105.313079ms for 1 entries]
2017-01-09 06:46:39.578539 W | etcdserver: avoid queries with large range/delete range!
2017-01-09 06:46:39.814008 W | etcdserver: apply entries took too long [135.0285ms for 1 entries]
2017-01-09 06:46:39.814085 W | etcdserver: avoid queries with large range/delete range!
2017-01-09 06:46:40.014153 W | etcdserver: apply entries took too long [110.19445ms for 1 entries]
2017-01-09 06:46:40.014689 W | etcdserver: avoid queries with large range/delete range!
2017-01-09 06:46:40.223789 W | etcdserver: apply entries took too long [74.681561ms for 1 entries]
2017-01-09 06:46:40.223833 W | etcdserver: avoid queries with large range/delete range!
2017-01-09 06:46:40.412689 W | etcdserver: apply entries took too long [17.895881ms for 1 entries]
2017-01-09 06:46:40.413091 W | etcdserver: avoid queries with large range/delete range!
...

The cluster shares the same dataset, so the data among them should be almost the same. e.g. 57~65 MB.

I'm expecting response :)

The text was updated successfully, but these errors were encountered:

armstrongli · 2017-01-09T07:37:21Z

After checking the cluster details, I noticed that the etcdv3 should have defragment operation cause of using boltdb.
And after defragging the member, the data size decreated to the normal range.

xiang90 · 2017-01-09T19:35:17Z

@armstrongli etcd should not ever get into this state. can you easily reproduce this? do you still have that 11GB data around?

armstrongli · 2017-01-10T09:25:03Z

@xiang90 I'm trying to reproduce it. It happens for twice. So I raise it here.
The db size of the member recover back to normal size after defragment. The locale of the scenario has gone.

What I have done on the cluster is to upgrade it from 2.2.1 to 3.0.15 in one kubernetes cluster.

xiang90 · 2017-01-10T16:41:50Z

@armstrongli The defrag just fixes the symptom. The size should not be unbalanced (~10 MB difference is possible) at that at the first place. It would be great if you can reproduce it.

Does the leader's db size become 11GB right after the migration? Or it grows to 11GB eventually? If the later, how long does it take to grow to 11GB? Hours? Days?

armstrongli · 2017-01-11T01:43:32Z

@xiang90
it grows to 11GB eventually. And it takes weeks to grow to 11GB.

xiang90 · 2017-01-11T01:48:08Z

@armstrongli Is it still growing now after your defrag?

armstrongli · 2017-01-11T01:53:11Z

@xiang90 Not yet. The cluster runs for weeks until exceeds database space. It runs only 2 days.
Please keep this issue open until I encounter this issue again.
I'll write the operations I have done on the cluster to provide details to help you debug etcd .

xiang90 · 2017-01-11T01:57:44Z

@armstrongli 11GB/x weeks where x < 5 ~ 2GB/week ~ 300MB/day. You should be able to notice even for 2 days it if it continues to grow, I guess.

xiang90 · 2017-01-17T23:07:28Z

@armstrongli Any updates on this?

armstrongli · 2017-01-18T03:33:00Z

@xiang90 The data size doesn't change a lot after running my cluster for 1 week. Close this issue for now.
I'll provide more details if I encounter this issue for another time.

xiang90 · 2017-01-18T03:35:10Z

@armstrongli OK. I guess there is something with your initial setup or migration. Let us know if you meet this again.

xiang90 · 2017-05-04T01:51:03Z

@heyitsanthony @gyuho @fanminshi

Other people reported similar behavior. Probably we should look into this problem.

xiang90 · 2017-05-04T01:54:22Z

@armstrongli Do you remember if you took snapshot from that etcd server periodically?

armstrongli · 2017-05-15T11:35:36Z

@xiang90 sorry for this late response.
yes . i did take snapshots periodically.

xiang90 · 2017-05-15T16:13:21Z

@armstrongli Have you seen it recently? Any chance you can reproduce it?

armstrongli · 2017-05-15T23:56:19Z

no. i haven't found the root cause. we are using it for kubernetes clusters in our company.
and to avoid this kind of issue, we use one side car in the ETCD pod to do defragment periodically. the defragment does work.

xiang90 · 2017-05-16T00:00:07Z

So you are able to somehow reproduce it?

armstrongli · 2017-05-17T15:49:33Z

I redo what I did, the unbalance data didn't get reproduced. :(

armstrongli · 2017-05-17T15:51:51Z

The reproduce steps are a little different from the ones which happened on unbalance data. I used the backup data instead of the hot data(which contains all the tomb data/historical data).
That's the difference.

xiang90 · 2017-05-17T15:57:39Z

@armstrongli As long as you can reproduce it, it will give us a better chance to get it fixed.

armstrongli · 2017-05-19T02:29:32Z

@xiang90 I noticed that one of our clusters encountered data unbalance again after running for months. And it caused some very wired issue on ETCD cluster.
looks like it got split

/ # etcdctl --endpoints [$(echo $member_urls | sed -e "s/ /,/g")] --cacert=/etc/ssl/kubernetes/ca.crt endpoint status -w table;
+----------------------------------------------------------+------------------+---------+---------+-----------+-----------+------------+
|                         ENDPOINT                         |        ID        | VERSION | DB SIZE | IS LEADER | RAFT TERM | RAFT INDEX |
+----------------------------------------------------------+------------------+---------+---------+-----------+-----------+------------+
| https://kubernetes-master-1-1046254.kube.11.****.io:4001 | 178b7f7c9122c3b5 | 3.0.15  | 385 MB  | false     |      1273 | 4261016391 |
| https://kubernetes-master-3-1680624.kube.11.****.io:4001 | 617f02413611a25b | 3.0.15  | 385 MB  | true      |      1273 | 4261016408 |
| https://kubernetes-master-2-1680526.kube.11.****.io:4001 | c17c9a8d17bc74ea | 3.0.15  | 900 MB  | false     |      1273 | 4261016407 |
+----------------------------------------------------------+------------------+---------+---------+-----------+-----------+------------+

The status of kubernetes shows that the data is different among kubernetes master node. From master-1 & master-3, the data is same. Master-2 is totally different from others.

[fedora@kubernetes-master-2-1680526 ~]$ kubectl get no
NAME                                               STATUS                        AGE
kubernetes-master-1-1046254.kube.****.io   NotReady,SchedulingDisabled   1y
kubernetes-master-2-1680526.kube.****.io   Ready,SchedulingDisabled      122d
kubernetes-master-3-1680624.kube.****.io   NotReady,SchedulingDisabled   122d
tess-node-045fs-1046637.kube.****.io       NotReady                      1y
tess-node-0kpnt-1046629.kube.****.io       NotReady                      1y
tess-node-0v4wt-1065026.kube.****.io       NotReady                      358d
tess-node-0zxkw-1046657.kube.****.io       NotReady                      1y
tess-node-13hcc-1046328.kube.****.io       NotReady                      1y
tess-node-1406m-1046654.kube.****.io       NotReady                      1y
tess-node-1k8vs-1046666.kube.****.io       NotReady                      1y
tess-node-1udit-1046680.kube.****.io       NotReady                      1y
tess-node-1zzue-1046686.kube.****.io       NotReady                      1y

We didn't do anything on ETCD just let it run for months.

armstrongli · 2017-05-19T02:33:04Z

data is different in kubernetes data path (/registry/**)
data doesn't get sync
new data in other path gets synced. e.g. I create another one /test

Looks like the raft algorithm in ETCD encounters some corner cases it can't handle.

xiang90 · 2017-05-19T02:36:08Z

is it a fresh v3 cluster? or migrated from v2?

armstrongli · 2017-05-19T02:39:15Z

Migrated from v2

#7116 (comment)

xiang90 · 2017-05-19T02:39:43Z

@armstrongli I would image it is not synced from the very beginning. Are you sure they are synced?

armstrongli · 2017-05-19T02:42:25Z

Yes. they were synced and have been running months healthily. And we got issue today.

xiang90 · 2017-05-19T02:43:06Z

can you provide me the log of all the members around when it happened?

armstrongli · 2017-05-19T03:01:24Z


2017-05-18 11:58:15.746418 I | etcdserver: start to snapshot (applied: 4257496962, lastsnap: 4257486961)
2017-05-18 11:58:15.895415 I | etcdserver: saved snapshot at index 4257496962
2017-05-18 11:58:15.895701 I | etcdserver: compacted raft log at 4257491962
2017-05-18 11:58:20.648673 I | fileutil: purged file /var/etcd/data/member/snap/00000000000004f6-00000000fdc2decb.snap successfully
2017-05-18 11:58:20.651068 I | fileutil: purged file /var/etcd/data/member/snap/00000000000004f7-00000000fdc305dd.snap successfully
2017-05-18 11:58:20.653365 I | fileutil: purged file /var/etcd/data/member/snap/00000000000004f7-00000000fdc32dfe.snap successfully
2017-05-18 11:58:20.655779 I | fileutil: purged file /var/etcd/data/member/snap/00000000000004f7-00000000fdc35514.snap successfully
2017-05-18 11:58:20.658079 I | fileutil: purged file /var/etcd/data/member/snap/00000000000004f7-00000000fdc37c26.snap successfully
2017-05-18 11:58:26.867468 I | wal: segmented wal file /var/etcd/data/member/wal/00000000000040f5-00000000fdc441ad.wal is created
2017-05-18 11:58:50.088632 I | fileutil: purged file /var/etcd/data/member/wal/00000000000040f0-00000000fd9e645a.wal successfully
2017-05-18 12:02:58.528159 I | etcdserver: start to snapshot (applied: 4257506963, lastsnap: 4257496962)
2017-05-18 12:02:58.682533 I | etcdserver: saved snapshot at index 4257506963
2017-05-18 12:02:58.682834 I | etcdserver: compacted raft log at 4257501963
2017-05-18 12:03:20.664157 I | fileutil: purged file /var/etcd/data/member/snap/00000000000004f7-00000000fdc3a337.snap successfully
2017-05-18 12:07:37.935104 I | wal: segmented wal file /var/etcd/data/member/wal/00000000000040f6-00000000fdc48c7c.wal is created
2017-05-18 12:07:45.970030 I | etcdserver: start to snapshot (applied: 4257516964, lastsnap: 4257506963)
2017-05-18 12:07:46.293832 I | etcdserver: saved snapshot at index 4257516964
2017-05-18 12:07:46.294261 I | etcdserver: compacted raft log at 4257511964
2017-05-18 12:07:50.101677 I | fileutil: purged file /var/etcd/data/member/wal/00000000000040f1-00000000fd9eb256.wal successfully
2017-05-18 12:07:50.669514 I | fileutil: purged file /var/etcd/data/member/snap/00000000000004f7-00000000fdc3ca4b.snap successfully
2017-05-18 12:12:15.587835 I | etcdserver: start to snapshot (applied: 4257526966, lastsnap: 4257516964)
2017-05-18 12:12:16.099798 I | etcdserver: saved snapshot at index 4257526966
2017-05-18 12:12:16.100121 I | etcdserver: compacted raft log at 4257521966
2017-05-18 12:12:20.675058 I | fileutil: purged file /var/etcd/data/member/snap/00000000000004f7-00000000fdc3f15e.snap successfully
2017-05-18 12:16:42.161711 I | wal: segmented wal file /var/etcd/data/member/wal/00000000000040f7-00000000fdc4dab3.wal is created
2017-05-18 12:16:49.084965 I | etcdserver: start to snapshot (applied: 4257536967, lastsnap: 4257526966)
...


2017-05-18 14:30:21.103971 I | fileutil: purged file /var/etcd/data/member/snap/00000000000004f7-00000000fdd00609.snap successfully
2017-05-18 14:31:26.354837 I | raft: c17c9a8d17bc74ea is starting a new election at term 1271
2017-05-18 14:31:26.368144 I | raft: c17c9a8d17bc74ea became candidate at term 1272
2017-05-18 14:31:26.375851 I | raft: c17c9a8d17bc74ea received vote from c17c9a8d17bc74ea at term 1272
2017-05-18 14:31:26.380401 I | raft: c17c9a8d17bc74ea [logterm: 1271, index: 4258322857] sent vote request to 178b7f7c9122c3b5 at term 1272
2017-05-18 14:31:26.380417 I | raft: c17c9a8d17bc74ea [logterm: 1271, index: 4258322857] sent vote request to 617f02413611a25b at term 1272
2017-05-18 14:31:26.380429 I | raft: raft.node: c17c9a8d17bc74ea lost leader 178b7f7c9122c3b5 at term 1272
2017-05-18 14:31:26.446054 I | raft: c17c9a8d17bc74ea received vote from 617f02413611a25b at term 1272
2017-05-18 14:31:26.446116 I | raft: c17c9a8d17bc74ea [quorum:2] has received 2 votes and 0 vote rejections
2017-05-18 14:31:26.446881 I | raft: c17c9a8d17bc74ea became leader at term 1272
2017-05-18 14:31:26.446881 I | raft: raft.node: c17c9a8d17bc74ea elected leader c17c9a8d17bc74ea at term 1272
2017-05-18 14:31:26.475828 E | rafthttp: failed to write 178b7f7c9122c3b5 on stream MsgApp v2 (write tcp 10.28.14.164:2380->10.28.11.39:50686: write: broken pipe)
2017-05-18 14:31:26.475859 I | rafthttp: peer 178b7f7c9122c3b5 became inactive
2017-05-18 14:31:26.476846 W | rafthttp: lost the TCP streaming connection with peer 178b7f7c9122c3b5 (stream MsgApp v2 writer)
2017-05-18 14:31:27.645243 W | rafthttp: lost the TCP streaming connection with peer 178b7f7c9122c3b5 (stream Message reader)
2017-05-18 14:31:27.703764 I | rafthttp: peer 178b7f7c9122c3b5 became active
2017-05-18 14:31:27.703837 I | rafthttp: established a TCP streaming connection with peer 178b7f7c9122c3b5 (stream MsgApp v2 writer)
2017-05-18 14:31:27.746997 I | rafthttp: established a TCP streaming connection with peer 178b7f7c9122c3b5 (stream Message reader)
2017-05-18 14:31:29.580269 W | etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for 25.270189ms)
2017-05-18 14:31:29.580305 W | etcdserver: server is likely overloaded
2017-05-18 14:31:29.580316 W | etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for 25.341033ms)
2017-05-18 14:31:29.580322 W | etcdserver: server is likely overloaded
2017-05-18 14:31:45.943834 W | etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for 88.945841ms)
2017-05-18 14:31:45.943869 W | etcdserver: server is likely overloaded
2017-05-18 14:31:45.943880 W | etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for 89.033934ms)

.....



2017-05-19 02:11:00.268965 W | etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for 14.138642ms)
2017-05-19 02:11:00.268976 W | etcdserver: server is likely overloaded
2017-05-19 02:11:14.354939 W | raft: c17c9a8d17bc74ea stepped down to follower since quorum is not active
2017-05-19 02:11:14.355019 I | raft: c17c9a8d17bc74ea became follower at term 1272
2017-05-19 02:11:14.355030 I | raft: raft.node: c17c9a8d17bc74ea lost leader c17c9a8d17bc74ea at term 1272
2017-05-19 02:11:15.754867 I | raft: c17c9a8d17bc74ea is starting a new election at term 1272
2017-05-19 02:11:15.767292 I | raft: c17c9a8d17bc74ea became candidate at term 1273
2017-05-19 02:11:15.767318 I | raft: c17c9a8d17bc74ea received vote from c17c9a8d17bc74ea at term 1273
2017-05-19 02:11:15.767332 I | raft: c17c9a8d17bc74ea [logterm: 1272, index: 4260993216] sent vote request to 178b7f7c9122c3b5 at term 1273
2017-05-19 02:11:15.767343 I | raft: c17c9a8d17bc74ea [logterm: 1272, index: 4260993216] sent vote request to 617f02413611a25b at term 1273
2017-05-19 02:11:15.986834 I | raft: c17c9a8d17bc74ea [term: 1273] ignored a MsgAppResp message with lower term from 617f02413611a25b [term: 1272]
2017-05-19 02:11:15.997938 I | raft: c17c9a8d17bc74ea [term: 1273] ignored a MsgHeartbeatResp message with lower term from 617f02413611a25b [term: 1272]
2017-05-19 02:11:16.072122 I | raft: c17c9a8d17bc74ea [term: 1273] ignored a MsgAppResp message with lower term from 178b7f7c9122c3b5 [term: 1272]
2017-05-19 02:11:16.072170 I | raft: c17c9a8d17bc74ea [term: 1273] ignored a MsgHeartbeatResp message with lower term from 178b7f7c9122c3b5 [term: 1272]
2017-05-19 02:11:16.072186 I | raft: c17c9a8d17bc74ea [term: 1273] ignored a MsgHeartbeatResp message with lower term from 178b7f7c9122c3b5 [term: 1272]
...


2017-05-19 02:11:16.072571 I | raft: c17c9a8d17bc74ea [term: 1273] ignored a MsgHeartbeatResp message with lower term from 617f02413611a25b [term: 1272]
2017-05-19 02:11:16.072596 I | raft: c17c9a8d17bc74ea [term: 1273] ignored a MsgHeartbeatResp message with lower term from 178b7f7c9122c3b5 [term: 1272]
2017-05-19 02:11:16.072606 I | raft: c17c9a8d17bc74ea [term: 1273] ignored a MsgHeartbeatResp message with lower term from 617f02413611a25b [term: 1272]
2017-05-19 02:11:16.072619 I | raft: c17c9a8d17bc74ea [term: 1273] ignored a MsgHeartbeatResp message with lower term from 617f02413611a25b [term: 1272]
2017-05-19 02:11:16.072628 I | raft: c17c9a8d17bc74ea [term: 1273] ignored a MsgHeartbeatResp message with lower term from 617f02413611a25b [term: 1272]
2017-05-19 02:11:16.072639 I | raft: c17c9a8d17bc74ea [term: 1273] ignored a MsgAppResp message with lower term from 178b7f7c9122c3b5 [term: 1272]
2017-05-19 02:11:16.072653 I | raft: c17c9a8d17bc74ea [term: 1273] ignored a MsgAppResp message with lower term from 178b7f7c9122c3b5 [term: 1272]
2017-05-19 02:11:16.072663 I | raft: c17c9a8d17bc74ea [term: 1273] ignored a MsgHeartbeatResp message with lower term from 178b7f7c9122c3b5 [term: 1272]
2017-05-19 02:11:16.072684 I | raft: c17c9a8d17bc74ea received vote rejection from 178b7f7c9122c3b5 at term 1273
2017-05-19 02:11:16.072694 I | raft: c17c9a8d17bc74ea [quorum:2] has received 1 votes and 1 vote rejections
2017-05-19 02:11:16.072704 I | raft: c17c9a8d17bc74ea [term: 1273] ignored a MsgAppResp message with lower term from 617f02413611a25b [term: 1272]
2017-05-19 02:11:16.072718 I | raft: c17c9a8d17bc74ea [term: 1273] ignored a MsgHeartbeatResp message with lower term from 617f02413611a25b [term: 1272]
2017-05-19 02:11:16.072727 I | raft: c17c9a8d17bc74ea [term: 1273] ignored a MsgAppResp message with lower term from 617f02413611a25b [term: 1272]
2017-05-19 02:11:16.072740 I | raft: c17c9a8d17bc74ea [term: 1273] ignored a MsgHeartbeatResp message with lower term from 617f02413611a25b [term: 1272]
2017-05-19 02:11:16.072763 I | raft: c17c9a8d17bc74ea [logterm: 1272, index: 4260993216, vote: c17c9a8d17bc74ea] rejected vote from 617f02413611a25b [logterm: 1272, index: 4260993216] at term 1273
2017-05-19 02:11:16.072840 I | raft: c17c9a8d17bc74ea became follower at term 1273
2017-05-19 02:11:16.072874 I | raft: raft.node: c17c9a8d17bc74ea elected leader 617f02413611a25b at term 1273
2017-05-19 02:13:41.652462 I | wal: segmented wal file /var/etcd/data/member/wal/0000000000004157-00000000fdf9b006.wal is created
2017-05-19 02:13:47.720069 I | etcdserver: start to snapshot (applied: 4260999438, lastsnap: 4260989437)
2017-05-19 02:13:48.476944 I | etcdserver: saved snapshot at index 4260999438

armstrongli · 2017-05-19T03:05:53Z

2017-05-19 02:11:16.072840 I | raft: c17c9a8d17bc74ea became follower at term 1273
2017-05-19 02:11:16.072874 I | raft: raft.node: c17c9a8d17bc74ea elected leader 617f02413611a25b at term 1273

Looks like that this member didn't purge its history things. Or there can be some error on the storage part to catch the wrong history logs.

armstrongli · 2017-05-19T03:08:14Z

The most wired thing is that the leader election kept running a long time. The maser-2 should step down as the cluster has leader and should follow the leader's steps.
It kept raising leader election again and agin.

armstrongli · 2017-05-19T03:31:00Z

my bad about leader election.

armstrongli · 2017-05-22T07:13:27Z

@xiang90 per my understanding, the leader election should happen if the member can't get heart beat from leader, and convert into candidate and send leader election.
And at the same time, it will stop serving traffics, and waiting for the election finish.

If the member gets the cluster leader info and follow up, then it should catch up the cluster and serve traffics.

Now the status of the member looks like that it has caught up with the cluster and served traffic for some time.
The data got chaos, it looks like.

xiang90 · 2017-05-22T16:21:45Z

@armstrongli

Hi, if you can investigate the issue and find the root cause, then it would be great! Probably you have to look into the codebase more and understand how exactly etcd works before you start the investigation. It does not work the way you assume. Or the best option here is probably to provide enough information for etcd dev team to help.

We need the full log around (before 24 hours and after 24 hours ) when the bad thing happened. Do not truncate the log or skip log entries. Raw log would be the best.

Yes. they were synced and have been running months healthily.

How do you know they are synced? Do you have any monitoring data?

armstrongli · 2017-05-26T15:15:25Z

@xiang90
Here are all the logs in 2017-05-18 and 2017-05-19

etcd-unsync-logs.tar.gz

xiang90 · 2017-05-26T17:04:29Z

@armstrongli I read through the log. I do not find anything that can cause data inconsistent in the log. I suspect that the data already got inconsistent before 5.18.

I guess you noticed it since c17c9a8d17bc74ea (the bad one) became leader at 2017-05-18 14:31:26.446881. Then things start to fall apart.

armstrongli · 2017-05-27T06:27:40Z

@xiang90

(the bad one) became leader

My confustion:

How do you define the bad one?
How does this bad one come out?

The bad one actually is extremely dangerous.

armstrongli · 2017-06-01T08:52:01Z

Related issue:
#8009

@xiang90 I have successfully reproduced the data unbalance state of etcd cluster. I have one test kubernetes cluster and keep it running for months. The difference between k8s and benchmark is that

benchmark always creating different keys and put into etcd storage
k8s always updating the same set of keys and keep doing compact from day to day

Then I run defrag on the cluster and you can see the differences between before and after

/ # etcdctl --endpoints [$(echo $member_urls | sed -e "s/ /,/g")] --cacert=/etc/ssl/kubernetes/ca.crt endpoint status -w table;
+---------------------------------------------------------+------------------+---------+---------+-----------+-----------+------------+
|                        ENDPOINT                         |        ID        | VERSION | DB SIZE | IS LEADER | RAFT TERM | RAFT INDEX |
+---------------------------------------------------------+------------------+---------+---------+-----------+-----------+------------+
| https://kubernetes-master-46swx-1781998.19.tess.io:4001 | 43a1252010cf94a4 | 3.0.15  | 340 MB  | false     |      1368 |  317106015 |
| https://kubernetes-master-q0f4f-1783564.19.tess.io:4001 | 7856ae8f8e27e912 | 3.0.15  | 12 GB   | false     |      1368 |  317106015 |
| https://kubernetes-master-1-1728078.19.tess.io:4001     | 88d3f79cc39e44de | 3.0.15  | 14 GB   | false     |      1368 |  317106015 |
| https://kubernetes-master-md9kf-1783616.19.tess.io:4001 | ed565574a43c804e | 3.0.15  | 340 MB  | true      |      1368 |  317106016 |
+---------------------------------------------------------+------------------+---------+---------+-----------+-----------+------------+
/ # etcdctl --endpoints [$(echo $member_urls | sed -e "s/ /,/g")] --cacert=/etc/ssl/kubernetes/ca.crt defrag;
Finished defragmenting etcd member[https://kubernetes-master-46swx-1781998.19.tess.io:4001]
Failed to defragment etcd member[https://kubernetes-master-q0f4f-1783564.19.tess.io:4001] (context deadline exceeded)
Finished defragmenting etcd member[https://kubernetes-master-1-1728078.19.tess.io:4001]
Failed to defragment etcd member[https://kubernetes-master-md9kf-1783616.19.tess.io:4001] (context deadline exceeded)
/ # etcdctl --endpoints [$(echo $member_urls | sed -e "s/ /,/g")] --cacert=/etc/ssl/kubernetes/ca.crt endpoint status -w table;
+---------------------------------------------------------+------------------+---------+---------+-----------+-----------+------------+
|                        ENDPOINT                         |        ID        | VERSION | DB SIZE | IS LEADER | RAFT TERM | RAFT INDEX |
+---------------------------------------------------------+------------------+---------+---------+-----------+-----------+------------+
| https://kubernetes-master-46swx-1781998.19.tess.io:4001 | 43a1252010cf94a4 | 3.0.15  | 14 MB   | false     |      1368 |  317106529 |
| https://kubernetes-master-q0f4f-1783564.19.tess.io:4001 | 7856ae8f8e27e912 | 3.0.15  | 12 GB   | false     |      1368 |  317106529 |
| https://kubernetes-master-1-1728078.19.tess.io:4001     | 88d3f79cc39e44de | 3.0.15  | 14 MB   | false     |      1368 |  317106529 |
| https://kubernetes-master-md9kf-1783616.19.tess.io:4001 | ed565574a43c804e | 3.0.15  | 340 MB  | true      |      1368 |  317106529 |
+---------------------------------------------------------+------------------+---------+---------+-----------+-----------+------------+

after defrag all the members, they have the same data size now.

/ # etcdctl --endpoints [$(echo $member_urls | sed -e "s/ /,/g")] --cacert=/etc/ssl/kubernetes/ca.crt endpoint status -w table;
+---------------------------------------------------------+------------------+---------+---------+-----------+-----------+------------+
|                        ENDPOINT                         |        ID        | VERSION | DB SIZE | IS LEADER | RAFT TERM | RAFT INDEX |
+---------------------------------------------------------+------------------+---------+---------+-----------+-----------+------------+
| https://kubernetes-master-46swx-1781998.19.tess.io:4001 | 43a1252010cf94a4 | 3.0.15  | 14 MB   | false     |      1368 |  317106576 |
| https://kubernetes-master-q0f4f-1783564.19.tess.io:4001 | 7856ae8f8e27e912 | 3.0.15  | 12 GB   | false     |      1368 |  317106576 |
| https://kubernetes-master-1-1728078.19.tess.io:4001     | 88d3f79cc39e44de | 3.0.15  | 14 MB   | false     |      1368 |  317106576 |
| https://kubernetes-master-md9kf-1783616.19.tess.io:4001 | ed565574a43c804e | 3.0.15  | 340 MB  | true      |      1368 |  317106576 |
+---------------------------------------------------------+------------------+---------+---------+-----------+-----------+------------+
/ # etcdctl --endpoints [$(echo $member_urls | sed -e "s/ /,/g")] --cacert=/etc/ssl/kubernetes/ca.crt defrag --command-timeout=20s;
Finished defragmenting etcd member[https://kubernetes-master-46swx-1781998.19.tess.io:4001]
Finished defragmenting etcd member[https://kubernetes-master-q0f4f-1783564.19.tess.io:4001]
Finished defragmenting etcd member[https://kubernetes-master-1-1728078.19.tess.io:4001]
Finished defragmenting etcd member[https://kubernetes-master-md9kf-1783616.19.tess.io:4001]
/ # etcdctl --endpoints [$(echo $member_urls | sed -e "s/ /,/g")] --cacert=/etc/ssl/kubernetes/ca.crt endpoint status -w table;
+---------------------------------------------------------+------------------+---------+---------+-----------+-----------+------------+
|                        ENDPOINT                         |        ID        | VERSION | DB SIZE | IS LEADER | RAFT TERM | RAFT INDEX |
+---------------------------------------------------------+------------------+---------+---------+-----------+-----------+------------+
| https://kubernetes-master-46swx-1781998.19.tess.io:4001 | 43a1252010cf94a4 | 3.0.15  | 14 MB   | false     |      1368 |  317111619 |
| https://kubernetes-master-q0f4f-1783564.19.tess.io:4001 | 7856ae8f8e27e912 | 3.0.15  | 14 MB   | false     |      1368 |  317111619 |
| https://kubernetes-master-1-1728078.19.tess.io:4001     | 88d3f79cc39e44de | 3.0.15  | 14 MB   | false     |      1368 |  317111619 |
| https://kubernetes-master-md9kf-1783616.19.tess.io:4001 | ed565574a43c804e | 3.0.15  | 14 MB   | true      |      1368 |  317111619 |
+---------------------------------------------------------+------------------+---------+---------+-----------+-----------+------------+
/ #

The snapshot size ranges a large scale, too.

bash-4.3# cd backup/
bash-4.3# du -hd 1
325M	./1496242169341370159
325M	./1496263769339524487
325M	./1496285369332812870
14M	./1496306969333338979
987M	.

xiang90 · 2017-06-01T16:29:16Z

@xiang90 I have successfully reproduced the data unbalance state of etcd cluster.

can you provide me the steps to reproduce it? so that we can also reproduce it in our environment?

armstrongli · 2017-06-06T15:47:25Z

I compared 2 snapshots before defrag and after defrag(after defrag and running for ~3 days). There're a lot of free object in snapshot file

snapshot	file size	branches	meta	leaf	freelist	free
1496138366062086842	1.7G	242	3	16343	1	389989
1496748109064920008	194M	259	2	17161	1	17436

It's obvious that there're a lot of free TYPEs analyzed by bolt pages ...

armstrongli · 2017-06-06T15:52:58Z

I thought about the object defined by bolt DB, it defines buckets and fill in the buckets with values and expand the size until it reach the max value and expand it.

I guess(haven't reviewed code details about it): after compact and delete the old revisions, bolt DB does't reuse the blank pages(blocks) anymore and keeps allocating more space and place the data. especially the data is not solid as stone, it's porous as bread.

xiang90 · 2017-06-06T17:26:06Z

@armstrongli We want to understand how your db size got unbalanced among members. Large db size is another issue. It is relevant, but not what we are talking about here. Again, we need to get a reproduce from you to confirm exactly what happened in your environment.

xiang90 · 2017-06-14T23:06:21Z

@armstrongli kindly ping.

armstrongli · 2017-06-16T08:12:07Z

@xiang90 I have been running benchmark on our cluster, and I can't reproduce it. Just keep the kubernetes running for months.
That's how it happens.

What's more, the db size can exceed the 8GB setting.

xiang90 · 2017-06-16T16:25:38Z

Just keep the kubernetes running for months.

well. maybe k8s was doing something strange. or something else happened to the machine. we do not know. and there is no way to help you debug it.

I do not really believe this is an etcd issue. if it is, it is probably dup with #8009.

gjw0317 · 2017-06-21T06:05:30Z

@xiang90 I've had the same problem lately after I migrated from v2 and update the kubernetes's version from 1.4.7 to 1.6.4. There are many anomalies. The etcd data grew very fast, and unbalanced on menberes. And the data version has been sent back after I added the startup parameters(--quota-backend-bytes=5368709120) and restart etcd service. Because this problem can lead to periodic problems of kubernetes cluster, so the impact is very large.

I've read a few related questions: #8009,#45037

StrongPa55word · 2017-06-23T12:57:54Z

I am facing the similar issue. Any permanent fix for this?

xiang90 · 2017-06-23T18:12:24Z

@gjw0317 @titotp

As I mentioned, this issue is probably a dup with #8009. I closed this one since it is messy and has no reproduce steps or useful information.

Please follow up on #8009.

Or if you can reproduce your issue and believe it is different from #8009, please create a new one.

Do not continue to comment on this closed issue please.

armstrongli closed this as completed Jan 9, 2017

vimalk78 mentioned this issue Jan 9, 2017

etcd raft should take database error in count #7117

Closed

xiang90 reopened this Jan 9, 2017

heyitsanthony assigned xiang90 Jan 17, 2017

xiang90 closed this as completed Jan 18, 2017

xiang90 reopened this May 4, 2017

xiang90 closed this as completed Jun 16, 2017

etcd-io locked and limited conversation to collaborators Jun 23, 2017

ETCD db size is extremely unbalanced among members #7116

ETCD db size is extremely unbalanced among members #7116

Comments

armstrongli commented Jan 9, 2017

armstrongli commented Jan 9, 2017

xiang90 commented Jan 9, 2017

armstrongli commented Jan 10, 2017

xiang90 commented Jan 10, 2017

armstrongli commented Jan 11, 2017

xiang90 commented Jan 11, 2017

armstrongli commented Jan 11, 2017

xiang90 commented Jan 11, 2017

xiang90 commented Jan 17, 2017

armstrongli commented Jan 18, 2017

xiang90 commented Jan 18, 2017

xiang90 commented May 4, 2017

xiang90 commented May 4, 2017

armstrongli commented May 15, 2017

xiang90 commented May 15, 2017

armstrongli commented May 15, 2017

xiang90 commented May 16, 2017

armstrongli commented May 17, 2017

armstrongli commented May 17, 2017

xiang90 commented May 17, 2017

armstrongli commented May 19, 2017

armstrongli commented May 19, 2017

xiang90 commented May 19, 2017

armstrongli commented May 19, 2017 • edited Loading

xiang90 commented May 19, 2017

armstrongli commented May 19, 2017

xiang90 commented May 19, 2017

armstrongli commented May 19, 2017

armstrongli commented May 19, 2017 • edited Loading

armstrongli commented May 19, 2017

armstrongli commented May 19, 2017

armstrongli commented May 22, 2017

xiang90 commented May 22, 2017

armstrongli commented May 26, 2017

xiang90 commented May 26, 2017

armstrongli commented May 27, 2017

armstrongli commented Jun 1, 2017 • edited Loading

xiang90 commented Jun 1, 2017

armstrongli commented Jun 6, 2017

armstrongli commented Jun 6, 2017

xiang90 commented Jun 6, 2017 • edited Loading

xiang90 commented Jun 14, 2017

armstrongli commented Jun 16, 2017

xiang90 commented Jun 16, 2017

gjw0317 commented Jun 21, 2017 • edited Loading

StrongPa55word commented Jun 23, 2017

xiang90 commented Jun 23, 2017

armstrongli commented May 19, 2017 •

edited

Loading

armstrongli commented May 19, 2017 •

edited

Loading

armstrongli commented Jun 1, 2017 •

edited

Loading

xiang90 commented Jun 6, 2017 •

edited

Loading

gjw0317 commented Jun 21, 2017 •

edited

Loading