one of the three nodes of etcd cluster has outdated data #8214

abel-von · 2017-07-06T11:13:18Z

Bug reporting

hi, I have encountered a problem in my testing cluster. we deployed a kubernetes cluster, when we run "get pods" command in two different nodes, it display different results.

node1:

host-205-18-1-109:/var/paas/run/etcd # kubectl get pods --all-namespaces
NAMESPACE   NAME                                        READY     STATUS              RESTARTS   AGE
om          alm-almreceiver-3591156403-01dnt            0/1       CrashLoopBackOff    6          8d
om          alm-almreceiver-3591156403-cplpm            0/1       RunContainerError   5          8d
om          alm-forword-3473857616-46ghr                1/1       Running             0          8d
om          alm-forword-3473857616-b1zbn                1/1       Running             0          8d
om          als-config-2677484553-8rvg7                 0/1       RunContainerError   5          8d
om          als-config-2677484553-mq26v                 0/1       RunContainerError   5          8d
om          ams-access-1802240969-c46jz                 1/1       Running             0          8d
om          ams-access-1802240969-xc2t3                 1/1       Running             1          8d
om          ams-calc-1535546111-4p8c4                   0/1       CrashLoopBackOff    10         8d
om          ams-calc-1535546111-9m2c2                   0/1       RunContainerError   6          8d
om          ams-metric-2999871289-t1fhh                 0/1       RunContainerError   6          8d
om          ams-metric-2999871289-zbl5w                 0/1       RunContainerError   6          8d
om          ams-website-3385970016-w3mmv                1/1       Running             0          8d
om          ams-website-3385970016-whwmh                1/1       Running             0          8d
om          aos-apiserver-2725269985-5vhnc              1/1       Running             0          8d
om          aos-apiserver-2725269985-brl93              1/1       NodeLost            2          8d
om          aos-composer-2328970979-0kqhx               1/1       NodeLost            2          8d
om          aos-composer-2328970979-6rr1n               1/1       Running             0          8d
om          aos-dslparser-4285564324-9bhkf              1/1       Running             0          8d
om          aos-dslparser-4285564324-fnz0q              1/1       NodeLost            2          8d
om          aos-workflowengine-3966680679-2b4h8         1/1       Running             0          8d
om          aos-workflowengine-3966680679-pf8pr         1/1       NodeLost            2          8d
om          ats-core-3359653393-18npj                   0/1       CrashLoopBackOff    3          8d
om          ats-core-3359653393-vxx49                   0/1       CrashLoopBackOff    3          8d
om          audit-service-3566800368-0sf2w              1/1       Running             0          8d
om          audit-service-3566800368-wv25q              1/1       NodeLost            2          8d
om          audit-website-2460833792-11xk3              1/1       NodeLost            2          8d
om          audit-website-2460833792-8bvdw              1/1       Running             0          8d
om          cam-registry-2954984153-053vg               1/1       Running             1          8d
om          cam-registry-2954984153-2krx1               1/1       NodeLost            2          8d
om          cam-tiller-4260099647-2rlpx                 1/1       Running             0          8d
om          cam-tiller-4260099647-n5k1f                 1/1       NodeLost            2          8d
om          cron-3849778790-mbvc5                       1/1       Running             0          8d
om          cron-3849778790-v5xb7                       1/1       NodeLost            2          8d
om          cse-ber-ljsrl                               1/1       Running             6          8d
om          cse-ber-xs3k5                               1/1       NodeLost            2          8d
om          cse-bus-8vvd8                               1/1       Running             0          8d
om          cse-bus-bs6p5                               1/1       Running             0          8d

node2:

host-205-18-1-206:/home/cspexpert # /var/paas/kubernetes/kubectl get pods --all-namespaces                                        
NAMESPACE   NAME                                        READY     STATUS             RESTARTS   AGE
om          alm-almreceiver-3591156403-01dnt            1/1       Running            10         8d
om          alm-almreceiver-3591156403-cplpm            1/1       Running            9          8d
om          alm-forword-3473857616-46ghr                1/1       Running            0          8d
om          alm-forword-3473857616-b1zbn                1/1       Running            0          8d
om          als-config-2677484553-8rvg7                 1/1       Running            1640       8d
om          als-config-2677484553-mq26v                 1/1       Running            1660       8d
om          ams-access-1802240969-c46jz                 1/1       Running            0          8d
om          ams-access-1802240969-xc2t3                 1/1       Running            1          8d
om          ams-calc-1535546111-4p8c4                   1/1       Running            45         8d
om          ams-calc-1535546111-9m2c2                   1/1       Running            44         8d
om          ams-metric-2999871289-t1fhh                 1/1       Running            10         8d
om          ams-metric-2999871289-zbl5w                 1/1       Running            10         8d
om          ams-website-3385970016-w3mmv                1/1       Running            0          8d
om          ams-website-3385970016-whwmh                1/1       Running            0          8d
om          aos-apiserver-2725269985-5vhnc              1/1       Running            0          8d
om          aos-apiserver-2725269985-brl93              1/1       Running            5          8d
om          aos-composer-2328970979-0kqhx               1/1       Running            6          8d
om          aos-composer-2328970979-6rr1n               1/1       Running            0          8d
om          aos-dslparser-4285564324-9bhkf              1/1       Running            0          8d
om          aos-dslparser-4285564324-fnz0q              1/1       Running            5          8d
om          aos-workflowengine-3966680679-2b4h8         1/1       Running            0          8d
om          aos-workflowengine-3966680679-pf8pr         1/1       Running            5          8d
om          ats-core-3359653393-18npj                   1/1       Running            32         8d
om          ats-core-3359653393-vxx49                   1/1       Running            30         8d
om          audit-service-3566800368-0sf2w              1/1       Running            0          8d
om          audit-service-3566800368-wv25q              1/1       Running            5          8d
om          audit-website-2460833792-11xk3              1/1       Running            5          8d
om          audit-website-2460833792-8bvdw              1/1       Running            0          8d
om          cam-registry-2954984153-053vg               1/1       Running            1          8d
om          cam-registry-2954984153-2krx1               1/1       Running            5          8d
om          cam-tiller-4260099647-2rlpx                 1/1       Running            0          8d
om          cam-tiller-4260099647-n5k1f                 1/1       Running            5          8d
om          cron-3849778790-mbvc5                       1/1       Running            0          8d
om          cron-3849778790-v5xb7                       1/1       Running            5          8d
om          cse-ber-ljsrl                               1/1       Running            6          8d
om          cse-ber-xs3k5                               1/1       Running            5          8d
om          cse-bus-8vvd8                               1/1       Running            0          8d
om          cse-bus-bs6p5                               1/1       Running            0          8d

As we can see, the first result has many error pods but in the second result they are all running. and we check through docker command that pods are actually running.

then I checked the connection between kube-apiserver and etcd, and find that, on node2, one of etcd server(205.18.1.181) is not connected with kube-apiserver. while on node1(which displays wrong result), all three etcd nodes are connected.

we checked the endpoint status of etcd, the result is like below:

https://205.18.1.181:4001, db672ae3963ad927, 3.2.0+git, 807 MB, false, 258, 5905723
https://205.18.1.109:4001, d1703382376a5277, 3.2.0+git, 18 MB, false, 258, 5905723
https://205.18.1.206:4001, 1700753d3116a680, 3.2.0+git, 18 MB, true, 258, 5905723

as we can see that on 205.18.1.181, the data has size of 807 MB, while the other two nodes has data of 18MB. and I called "snapshot status" to check the db file:

on 205.18.1.181:

etcdctl snapshot status db --write-out=table                                                                                           
+----------+----------+------------+------------+
|   HASH   | REVISION | TOTAL KEYS | TOTAL SIZE |
+----------+----------+------------+------------+
| 4a173aa0 |  1211536 |    1159021 |     810 MB |
+----------+----------+------------+------------+

on 205.18.1.109:

etcdctl snapshot status db --write-out=table                                                                       
+---------+----------+------------+------------+
|  HASH   | REVISION | TOTAL KEYS | TOTAL SIZE |
+---------+----------+------------+------------+
| 364d4b2 |  1836531 |       1834 |      18 MB |
+---------+----------+------------+------------+

we have checked one key on both the two nodes by get /xxxx --consistency="s", and we found that the data in 205.18.1.181 is wrong and outdated.

I think it some bugs in the boltdb or something that produce this problem, but I can't figure out where exactly the bug is. so how can I dig into this and find the root cause of this problem?

by the way, the etcd version is:

[paas@host-205-18-1-181 tmp]$ etcd --version
etcd Version: 3.2.0+git
Git SHA: ec53528
Go Version: go1.7.4
Go OS/Arch: linux/amd64

The text was updated successfully, but these errors were encountered:

abel-von · 2017-07-06T11:16:26Z

after reboot the etcd on 205.18.1.181, the get pods result is now correct on both two nodes. but the etcd data size on 205.18.1.181 is still much bigger than the other two nodes.

abel-von · 2017-07-06T11:41:41Z

Sorry that I can't give out all the logs because of security issues in my company. I checked the log and found that the log is almost consistent in the 205.18.1.181 and the other nodes.

logs on 205.18.1.181:

2017-07-06 14:33:51.275020 I | etcdserver: start to snapshot (applied: 6000060, lastsnap: 5900059)
2017-07-06 14:33:51.281880 I | etcdserver: saved snapshot at index 6000060
2017-07-06 14:33:51.282118 I | etcdserver: compacted raft log at 5995060
2017-07-06 14:34:12.836358 I | pkg/fileutil: purged file /var/paas/etcd/data/member/snap/0000000000000102-000000000053ec97.snap successfully
2017-07-06 14:35:32.085964 I | wal: segmented wal file /var/paas/etcd/data/member/wal/000000000000004e-00000000005b90da.wal is created
2017-07-06 14:35:44.772899 I | pkg/fileutil: purged file /var/paas/etcd/data/member/wal/0000000000000049-000000000055a629.wal successfully
2017-07-06 14:36:22.484599 W | etcdserver: apply entries took too long [1.742258579s for 1 entries]
2017-07-06 14:36:22.484634 W | etcdserver: avoid queries with large range/delete range!
2017-07-06 14:42:49.594521 W | etcdserver: apply entries took too long [1.46363076s for 1 entries]
2017-07-06 14:42:49.594573 W | etcdserver: avoid queries with large range/delete range!
2017-07-06 14:50:13.741853 W | etcdserver: apply entries took too long [2.051838973s for 1 entries]
2017-07-06 14:50:13.785349 W | etcdserver: avoid queries with large range/delete range!
2017-07-06 14:56:03.527963 W | etcdserver: apply entries took too long [1.837329846s for 1 entries]
2017-07-06 14:56:03.528006 W | etcdserver: avoid queries with large range/delete range!
2017-07-06 15:03:24.532307 W | etcdserver: apply entries took too long [1.913507705s for 1 entries]
2017-07-06 15:03:24.532507 W | etcdserver: avoid queries with large range/delete range!

logs on 205.18.1.109:

2017-07-06 14:31:57.724657 W | etcdserver: apply entries took too long [105.894718ms for 1 entries]
2017-07-06 14:31:57.724722 W | etcdserver: avoid queries with large range/delete range!
2017-07-06 14:32:17.002163 W | etcdserver: apply entries took too long [107.543353ms for 1 entries]
2017-07-06 14:32:17.002207 W | etcdserver: avoid queries with large range/delete range!
2017-07-06 14:33:51.270602 I | etcdserver: start to snapshot (applied: 6000060, lastsnap: 5900059)
2017-07-06 14:33:51.283054 I | etcdserver: saved snapshot at index 6000060
2017-07-06 14:33:51.283288 I | etcdserver: compacted raft log at 5995060
2017-07-06 14:34:11.585157 I | mvcc: store.index: compact 1855678
2017-07-06 14:34:11.587721 I | mvcc: finished scheduled compaction at 1855678 (took 1.728646ms)
2017-07-06 14:34:19.081115 I | pkg/fileutil: purged file /var/paas/etcd/data/member/snap/0000000000000102-000000000053ec97.snap successfully
2017-07-06 14:39:11.595931 I | mvcc: store.index: compact 1856170
2017-07-06 14:39:11.597772 I | mvcc: finished scheduled compaction at 1856170 (took 1.350499ms)
2017-07-06 14:44:11.604574 I | mvcc: store.index: compact 1856773
2017-07-06 14:44:11.607903 I | mvcc: finished scheduled compaction at 1856773 (took 2.700669ms)
2017-07-06 14:49:11.839561 I | mvcc: store.index: compact 1857377
2017-07-06 14:49:11.841687 I | mvcc: finished scheduled compaction at 1857377 (took 1.404751ms)
2017-07-06 14:54:11.850168 I | mvcc: store.index: compact 1857975
2017-07-06 14:54:11.852647 I | mvcc: finished scheduled compaction at 1857975 (took 1.899693ms)
2017-07-06 14:59:11.859767 I | mvcc: store.index: compact 1858573
2017-07-06 14:59:11.861873 I | mvcc: finished scheduled compaction at 1858573 (took 1.621251ms)
2017-07-06 15:04:11.879358 I | mvcc: store.index: compact 1859177
2017-07-06 15:04:11.882343 I | mvcc: finished scheduled compaction at 1859177 (took 2.462387ms)
2017-07-06 15:09:11.894468 I | mvcc: store.index: compact 1859777
2017-07-06 15:09:11.896299 I | mvcc: finished scheduled compaction at 1859777 (took 1.377486ms)

xiang90 · 2017-07-06T18:06:07Z

@abel-von do you take snapshots regularly?

abel-von · 2017-07-07T07:46:50Z

@xiang90 thanks for reply, we didn't take any snapshot. it's running normally for days, then we restart 205.18.1.181 and 205.18.1.109 , which are the two of three nodes of etcd. after restarting, the problem comes out. Will the snapshot save operation cause the db inconsistency?

heyitsanthony · 2017-07-07T22:01:03Z

@abel-von can this be reproduced with 3.2.2 or 3.1.9? I can't seem to find a commit that matches ec53528...

xiang90 · 2017-07-12T01:13:34Z

@abel-von kindly ping.

abel-von · 2017-07-12T08:54:22Z

sorry for the confusion made by the git sha, we made etcd as a submodule of a outside project so the git commit is not the etcd's but the outside project's.
I think the commit id is 9b9baa968f317fc2d5870e6aa97185df126bb640 which is not a milestone version.

heyitsanthony · 2017-07-18T18:15:47Z

@abel-von is there any way to easily reproduce this on 3.2.3 or 3.1.10?

abel-von · 2017-07-19T01:50:39Z

@heyitsanthony we changed our etcd version to 3.1.9, and now the issue is reproduced more frequently.
we are using etcd as the backend of kube-apiserver. we found that kube-apiserver asks etcd to do compaction every 5 minute. But we found something wired that one of the three nodes would suddenly stopped doing the compaction, and that is the node which has a much larger data size.

we used etcd-dump-db to dump the db file, we find that the larger data retains many revisions of a key, while in the smaller data file, there is only one revision of a key in it.

in one of the smaller data size node: the compaction is continuous:

2017-07-17 01:03:05.745159 I | mvcc: finished scheduled compaction at 704629 (took 2.816189ms)
2017-07-17 01:08:05.755716 I | mvcc: finished scheduled compaction at 705869 (took 3.201933ms)
2017-07-17 01:13:05.776978 I | mvcc: finished scheduled compaction at 707112 (took 3.033131ms)
2017-07-17 01:18:05.793572 I | mvcc: finished scheduled compaction at 708341 (took 9.157498ms)
2017-07-17 01:23:05.817686 I | mvcc: finished scheduled compaction at 709577 (took 2.739097ms)
2017-07-17 01:28:05.829043 I | mvcc: finished scheduled compaction at 710809 (took 4.29966ms)
2017-07-17 01:33:05.837473 I | mvcc: finished scheduled compaction at 712051 (took 2.970656ms)
2017-07-17 01:38:05.849978 I | mvcc: finished scheduled compaction at 713284 (took 2.917362ms)
2017-07-17 01:43:05.865074 I | mvcc: finished scheduled compaction at 714519 (took 2.749105ms)
2017-07-17 01:48:05.889391 I | mvcc: finished scheduled compaction at 715769 (took 5.606061ms)
2017-07-17 01:53:05.893834 I | mvcc: finished scheduled compaction at 716992 (took 2.747918ms)
2017-07-17 01:58:05.906794 I | mvcc: finished scheduled compaction at 718233 (took 2.902599ms)
2017-07-17 02:03:05.918917 I | mvcc: finished scheduled compaction at 719494 (took 2.738225ms)
2017-07-17 02:08:05.987081 I | mvcc: finished scheduled compaction at 720729 (took 2.924437ms)
2017-07-17 02:13:05.960468 I | mvcc: finished scheduled compaction at 721969 (took 3.349982ms)
2017-07-17 02:18:05.968872 I | mvcc: finished scheduled compaction at 723198 (took 2.791497ms)
2017-07-17 02:23:05.978155 I | mvcc: finished scheduled compaction at 724446 (took 2.861506ms)

in the larger data size node, the compaction suddenly stoped at the time of 2017-07-17 01:38:05:

2017-07-17 01:08:05.759264 I | mvcc: finished scheduled compaction at 705869 (took 3.938301ms)
2017-07-17 01:13:05.786355 I | mvcc: finished scheduled compaction at 707112 (took 4.409887ms)
2017-07-17 01:18:05.807977 I | mvcc: finished scheduled compaction at 708341 (took 3.087963ms)
2017-07-17 01:23:05.832083 I | mvcc: finished scheduled compaction at 709577 (took 2.88093ms)
2017-07-17 01:28:05.844962 I | mvcc: finished scheduled compaction at 710809 (took 3.010959ms)
2017-07-17 01:33:05.838528 I | mvcc: finished scheduled compaction at 712051 (took 3.402203ms)
2017-07-17 01:38:05.851801 I | mvcc: finished scheduled compaction at 713284 (took 3.151377ms)

and we didn't find any error log in this node during the time the node stop doing compaction, there is only some warnings showing that the io delay is high.

2017-07-17 01:26:46.403773 W | etcdserver: apply entries took too long [120.056221ms for 1 entries]
2017-07-17 01:26:46.403814 W | etcdserver: avoid queries with large range/delete range!
2017-07-17 01:28:05.841085 I | mvcc: store.index: compact 710809
2017-07-17 01:28:05.844962 I | mvcc: finished scheduled compaction at 710809 (took 3.010959ms)
2017-07-17 01:28:10.990740 W | etcdserver: apply entries took too long [184.784236ms for 1 entries]
2017-07-17 01:28:10.990781 W | etcdserver: avoid queries with large range/delete range!
2017-07-17 01:31:32.620661 W | etcdserver: apply entries took too long [148.679896ms for 1 entries]
2017-07-17 01:31:32.620687 W | etcdserver: avoid queries with large range/delete range!
2017-07-17 01:33:05.834143 I | mvcc: store.index: compact 712051
2017-07-17 01:33:05.838528 I | mvcc: finished scheduled compaction at 712051 (took 3.402203ms)
2017-07-17 01:36:57.596061 W | etcdserver: apply entries took too long [188.445365ms for 1 entries]
2017-07-17 01:36:57.596106 W | etcdserver: avoid queries with large range/delete range!
2017-07-17 01:38:05.847637 I | mvcc: store.index: compact 713284
2017-07-17 01:38:05.851801 I | mvcc: finished scheduled compaction at 713284 (took 3.151377ms)
2017-07-17 01:38:50.073201 I | etcdserver: start to snapshot (applied: 1800193, lastsnap: 1790192)
2017-07-17 01:38:50.123828 I | etcdserver: saved snapshot at index 1800193
2017-07-17 01:38:50.124042 I | etcdserver: compacted raft log at 1795193
2017-07-17 01:39:02.621138 I | pkg/fileutil: purged file /var/paas/etcd/data/member/snap/0000000000001433-00000000001ab4ac.snap successfully
2017-07-17 01:49:47.024269 W | etcdserver: apply entries took too long [133.807871ms for 1 entries]
2017-07-17 01:49:47.024298 W | etcdserver: avoid queries with large range/delete range!
2017-07-17 01:54:32.086371 I | etcdserver: start to snapshot (applied: 1810194, lastsnap: 1800193)
2017-07-17 01:54:32.145961 I | etcdserver: saved snapshot at index 1810194
2017-07-17 01:54:32.146224 I | etcdserver: compacted raft log at 1805194
2017-07-17 01:54:32.631207 I | pkg/fileutil: purged file /var/paas/etcd/data/member/snap/0000000000001433-00000000001adbbd.snap successfully
2017-07-17 02:04:24.214397 W | etcdserver: apply entries took too long [163.956377ms for 1 entries]
2017-07-17 02:04:24.214438 W | etcdserver: avoid queries with large range/delete range!
2017-07-17 02:06:57.986956 W | wal: sync duration of 1.083674392s, expected less than 1s
2017-07-17 02:10:16.006372 I | etcdserver: start to snapshot (applied: 1820195, lastsnap: 1810194)
2017-07-17 02:10:16.054768 I | etcdserver: saved snapshot at index 1820195
2017-07-17 02:10:16.055072 I | etcdserver: compacted raft log at 1815195
2017-07-17 02:10:32.642091 I | pkg/fileutil: purged file /var/paas/etcd/data/member/snap/0000000000001433-00000000001b02ce.snap successfully
2017-07-17 02:10:54.531128 W | etcdserver: apply entries took too long [204.17913ms for 1 entries]

abel-von · 2017-07-19T02:03:08Z

we checked the snapshot log of the nodes and find that there is a little difference between the larger data node and the smaller data nodes.

in smaller data node:

2017-07-19 06:23:42.791337 I | etcdserver: start to snapshot (applied: 4126960, lastsnap: 4116959)
2017-07-19 06:31:18.224505 I | etcdserver: start to snapshot (applied: 4136961, lastsnap: 4126960)
2017-07-19 06:38:52.406597 I | etcdserver: start to snapshot (applied: 4146962, lastsnap: 4136961)
2017-07-19 06:46:27.908548 I | etcdserver: start to snapshot (applied: 4156963, lastsnap: 4146962)
2017-07-19 06:54:04.166864 I | etcdserver: start to snapshot (applied: 4166964, lastsnap: 4156963)
2017-07-19 07:01:39.354258 I | etcdserver: start to snapshot (applied: 4176965, lastsnap: 4166964)
2017-07-19 07:09:13.906549 I | etcdserver: start to snapshot (applied: 4186966, lastsnap: 4176965)
2017-07-19 07:16:48.424667 I | etcdserver: start to snapshot (applied: 4196967, lastsnap: 4186966)
2017-07-19 07:24:24.091817 I | etcdserver: start to snapshot (applied: 4206968, lastsnap: 4196967)
2017-07-19 07:31:58.908751 I | etcdserver: start to snapshot (applied: 4216969, lastsnap: 4206968)
2017-07-19 07:39:32.684813 I | etcdserver: start to snapshot (applied: 4226970, lastsnap: 4216969)
2017-07-19 07:47:08.233998 I | etcdserver: start to snapshot (applied: 4236971, lastsnap: 4226970)
2017-07-19 07:54:42.451334 I | etcdserver: start to snapshot (applied: 4246972, lastsnap: 4236971)
2017-07-19 08:02:16.546734 I | etcdserver: start to snapshot (applied: 4256973, lastsnap: 4246972)
2017-07-19 08:09:49.906316 I | etcdserver: start to snapshot (applied: 4266974, lastsnap: 4256973)
2017-07-19 08:17:25.005495 I | etcdserver: start to snapshot (applied: 4276975, lastsnap: 4266974)
2017-07-19 08:24:58.112552 I | etcdserver: start to snapshot (applied: 4286976, lastsnap: 4276975)
2017-07-19 08:32:31.332365 I | etcdserver: start to snapshot (applied: 4296977, lastsnap: 4286976)
2017-07-19 08:40:13.543569 I | etcdserver: start to snapshot (applied: 4306978, lastsnap: 4296977)
2017-07-19 08:47:46.753345 I | etcdserver: start to snapshot (applied: 4316979, lastsnap: 4306978)
2017-07-19 08:55:18.661920 I | etcdserver: start to snapshot (applied: 4326980, lastsnap: 4316979)
2017-07-19 09:02:49.116010 I | etcdserver: start to snapshot (applied: 4336981, lastsnap: 4326980)
2017-07-19 09:10:21.354174 I | etcdserver: start to snapshot (applied: 4346982, lastsnap: 4336981)
2017-07-19 09:17:52.579132 I | etcdserver: start to snapshot (applied: 4356983, lastsnap: 4346982)
2017-07-19 09:25:24.363432 I | etcdserver: start to snapshot (applied: 4366984, lastsnap: 4356983)
2017-07-19 09:32:54.985520 I | etcdserver: start to snapshot (applied: 4376985, lastsnap: 4366984)
2017-07-19 09:40:25.765002 I | etcdserver: start to snapshot (applied: 4386986, lastsnap: 4376985)
2017-07-19 09:47:58.089840 I | etcdserver: start to snapshot (applied: 4396987, lastsnap: 4386986)

the larger data node:

2017-07-19 07:04:15.916262 I | etcdserver: start to snapshot (applied: 4180436, lastsnap: 4170435)
2017-07-19 07:11:50.920271 I | etcdserver: start to snapshot (applied: 4190437, lastsnap: 4180436)
2017-07-19 07:19:25.395998 I | etcdserver: start to snapshot (applied: 4200438, lastsnap: 4190437)
2017-07-19 07:27:01.734404 I | etcdserver: start to snapshot (applied: 4210439, lastsnap: 4200438)
2017-07-19 07:34:36.033081 I | etcdserver: start to snapshot (applied: 4220440, lastsnap: 4210439)
2017-07-19 07:42:10.966994 I | etcdserver: start to snapshot (applied: 4230441, lastsnap: 4220440)
2017-07-19 07:49:45.509161 I | etcdserver: start to snapshot (applied: 4240442, lastsnap: 4230441)
2017-07-19 07:57:21.390336 I | etcdserver: start to snapshot (applied: 4250443, lastsnap: 4240442)
2017-07-19 08:04:54.183961 I | etcdserver: start to snapshot (applied: 4260444, lastsnap: 4250443)
2017-07-19 08:12:27.733936 I | etcdserver: start to snapshot (applied: 4270445, lastsnap: 4260444)
2017-07-19 08:20:02.153146 I | etcdserver: start to snapshot (applied: 4280446, lastsnap: 4270445)
2017-07-19 08:27:36.695936 I | etcdserver: start to snapshot (applied: 4290447, lastsnap: 4280446)
2017-07-19 08:35:07.531769 I | etcdserver: start to snapshot (applied: 4300448, lastsnap: 4290447)
2017-07-19 08:42:51.796759 I | etcdserver: start to snapshot (applied: 4310449, lastsnap: 4300448)
2017-07-19 08:50:22.368311 I | etcdserver: start to snapshot (applied: 4320450, lastsnap: 4310449)
2017-07-19 08:57:54.783186 I | etcdserver: start to snapshot (applied: 4330451, lastsnap: 4320450)
2017-07-19 09:05:26.943601 I | etcdserver: start to snapshot (applied: 4340452, lastsnap: 4330451)
2017-07-19 09:12:57.727162 I | etcdserver: start to snapshot (applied: 4350453, lastsnap: 4340452)
2017-07-19 09:20:29.896829 I | etcdserver: start to snapshot (applied: 4360454, lastsnap: 4350453)
2017-07-19 09:28:00.316796 I | etcdserver: start to snapshot (applied: 4370455, lastsnap: 4360454)
2017-07-19 09:35:31.473455 I | etcdserver: start to snapshot (applied: 4380456, lastsnap: 4370455)
2017-07-19 09:43:03.866372 I | etcdserver: start to snapshot (applied: 4390457, lastsnap: 4380456)
2017-07-19 09:50:35.546458 I | etcdserver: start to snapshot (applied: 4400458, lastsnap: 4390457)

I think the applied log index can still be considered as in the same pace, as I know the compaction is also a command in a log entry, the same as put of deleteRange, if the log entries applied in a same pace, then, the compaction should also be applied to kvstore in a same pace.

abel-von · 2017-07-19T02:19:02Z

we used the command kubectl endpoint status --write-out=json | egrep -o '"revision":[0-9]*' | egrep -o '[0-9]*' to check the revision of each node, and the output is below:

1469425
1289989
1469425

the second node is the one which has a larger db file, the revision is much smaller than the other two, does this mean that a lot of log entries is not applied into the kvstore?

the simple output of endpoint status is:

https://xxxx:4001, 5515cd325aac0173, 3.1.9, 182 MB, false, 5172, 4434252
https://xxxx:4001, b143a63e2d946966, 3.1.9, 2.1 GB, true, 5172, 4434254
https://xxxx:4001, e6bcda184371c529, 3.1.9, 182 MB, false, 5172, 4434255

as we can see that the data of the second has data of 2.1GB, which is larger than the default max db size. and now the cluster can not serve any updating request.

abel-von · 2017-07-19T06:36:13Z

This seems to be an issue about the processing of applying an entry log to kvstore.
So I checked the source code of applying, I have a question about ( https://github.com/coreos/etcd/blob/master/etcdserver/server.go#L1273 ) , it seems that the etcdserver will always increment the appliedi no matter if the applying is succeed of failed. Will the data be inconsistent if the log entry is not succeed applied to kv? shouldn't it retry until the applying return succeed before it apply the next log entry?

heyitsanthony · 2017-07-19T18:04:58Z

we have checked one key on both the two nodes by get /xxxx --consistency="s", and we found that the data in 205.18.1.181 is wrong and outdated.

That's a serialized read, it's not expected to be up to date since it doesn't go through quorum but it should match the historical data. If --consistency=l is giving inconsistent data, that's a problem.

the second node is the one which has a larger db file, the revision is much smaller than the other two, does this mean that a lot of log entries is not applied into the kvstore?

The node is behind, but it shouldn't be skipping any entries.

as we can see that the data of the second has data of 2.1GB, which is larger than the default max db size. and now the cluster can not serve any updating request.

The larger node needs to be defragmented; it's raising an alarm on the cluster because it's exceeding is space quota. After defragmenting the member the alarm can be disabled with etcdctl alarm disarm.

it seems that the etcdserver will always increment the appliedi no matter if the applying is succeed of failed

If the apply fails then etcd should panic since the entry has been committed to raft; there's no turning back if there's an error in the apply path.

xiang90 · 2017-07-19T20:29:38Z

The node is behind, but it shouldn't be skipping any entries.

We are not sure if it falls behind actually. k8s uses txn with conditions a lot. If there the state is indeed inconsistent on that node, revision might also not increase due to txn failures.

abel-von · 2017-07-20T01:41:20Z

@heyitsanthony, @xiang90 , thanks for reply, we also checked the --consistency=l and found that one node of the cluster return data inconsistent with the other two. we can defragment the db, but the problem is, the node has a very large db file and retains many versions of a key in it. we have to do a compaction manually before defragment.

and k8s is automatically requesting to compact db, it's just one node of etcd didn't do it.

I have add a log in the function applyEntryNormal that if the applying failed, an error will be logged.
( https://github.com/coreos/etcd/blob/master/etcdserver/server.go#L1354 )

        if ar == nil {
                return
        }
        if ar.err != nil {
                plog.Errorf("Error applying entry %v, error: %v", raftReq, ar.err)
        }
        if ar.err != ErrNoSpace || len(s.alarmStore.Get(pb.AlarmType_NOSPACE)) > 0 {
                s.w.Trigger(id, ar)
                return
        }

and we got the log like below:

2017-07-19 18:50:09.220063 E | etcdserver: Error applying entry {ID:12175021278856673156  0 <nil> <nil> <nil> <nil> <nil> revision:184894  <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil>}, error: mvcc: required revision is a future revision
2017-07-19 18:55:09.243377 E | etcdserver: Error applying entry {ID:12175021278856675351  0 <nil> <nil> <nil> <nil> <nil> revision:185381  <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil>}, error: mvcc: required revision is a future revision
2017-07-19 19:00:09.248930 E | etcdserver: Error applying entry {ID:12175021278856677507  0 <nil> <nil> <nil> <nil> <nil> revision:185860  <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil>}, error: mvcc: required revision is a future revision
2017-07-19 19:05:09.255870 E | etcdserver: Error applying entry {ID:12175021278856679813  0 <nil> <nil> <nil> <nil> <nil> revision:186339  <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil>}, error: mvcc: required revision is a future revision
2017-07-19 19:10:09.263268 E | etcdserver: Error applying entry {ID:12175021278856682125  0 <nil> <nil> <nil> <nil> <nil> revision:186819  <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil>}, error: mvcc: required revision is a future revision
2017-07-19 19:15:09.271818 E | etcdserver: Error applying entry {ID:12175021278856684440  0 <nil> <nil> <nil> <nil> <nil> revision:187299  <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil>}, error: mvcc: required revision is a future revision
2017-07-19 19:20:09.382805 E | etcdserver: Error applying entry {ID:12175021278856686711  0 <nil> <nil> <nil> <nil> <nil> revision:187778  <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil>}, error: mvcc: required revision is a future revision
2017-07-19 19:25:09.339451 E | etcdserver: Error applying entry {ID:12175021278856688889  0 <nil> <nil> <nil> <nil> <nil> revision:188257  <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil>}, error: mvcc: required revision is a future revision

as we can see, the compaction failed because the revision in the request of kube-apiserver is larger than the current revision in this node. that's why the compaction is failed.

I think we still have to check why the revision in one node is much smaller than the others. as in my comment before, we can see that the applied index in three nodes is almost advancing in the same pace. so I don't think the node with larger file falls behind.

I will add more logs to see why the revision in a node is far smaller than the other.

heyitsanthony · 2017-07-20T01:53:04Z

@abel-von OK, it's definitely inconsistent. Without the wal + db I'm not sure if there's much we can do to find the root cause. Removing the faulty member via etcdctl member remove and adding a new one should clear the inconsistency.

Was the member only rebooted with 3.1.9 or was it a freshly initialized member? There was a period during the 3.2 development cycle where there could be inconsistencies but it was caught before 3.2.0. Also, there's no commit matching 9b9baa968f317fc2d5870e6aa97185df126bb640.

abel-von · 2017-07-20T02:08:17Z

@heyitsanthony , yes we are now using 3.1.9. the 9b9baa968f317fc2d5870e6aa97185df126bb640 is in the first cluster where we got this problem, but after that, our version has changed to 3.1.9.

could you please send me the commit ids to solve the inconsistency problem ?

abel-von · 2017-07-20T02:34:06Z

@xiang90 I have checked the code of kube-apiserver, and as you said, all the updating in k8s is a CAS operation, with etcdv3's Txn. it compare the revision of a key and update the key if the compare returns true.
(https://github.com/kubernetes/kubernetes/blob/v1.5.0/pkg/storage/etcd3/store.go#L270)

I am wondering that if one txn failed on a node, than all txns after that will be failed on that node because the revision is not consistent, and as the txn failed more, the revision will be more inconsistent. that makes the system very unstable, isn't it?

will a node fall behind a little fails a txn? if it fails once, then fails forever?

heyitsanthony · 2017-07-20T02:48:51Z

@abel-von https://github.com/coreos/etcd/blob/master/Documentation/op-guide/runtime-configuration.md#replace-a-failed-machine

abel-von · 2017-07-20T03:11:19Z

@heyitsanthony , thanks, but i don't want just repair the failed cluster. actually, this problem comes out in a lot of testing clusters now. I have to find the root cause of this and fix the problem.

heyitsanthony · 2017-07-20T03:17:45Z

@abel-von the member inconsistency can be reproduced with a fresh cluster running 3.1.9?

abel-von · 2017-07-20T03:31:42Z

yes, it can be reproduced, I think if one of the node handle log entries in slightly different pace with the leader, and if you continue updating some particular keys with CAS to compare the ModRevision with a short period, like 10 seconds in k8s to update its' node status. then the issue can be reproduced.

gyuho · 2017-07-20T20:47:48Z

@abel-von 9b9baa968f317fc2d5870e6aa97185df126bb640 does not exist. Do you run official etcd release?

abel-von · 2017-07-21T06:02:51Z

@gyuho please just forget the commit id 9b9baa968f317fc2d5870e6aa97185df126bb640， it's the first time we get this issue. After that we have changed the etcd version to 3.1.9. which is the official release. And the problem is still there with version of 3.1.9

abel-von · 2017-07-24T09:21:34Z

@gyuho @heyitsanthony @xiang90 I'm confused by this Txn function in apply.go. It first do the comparison and then begin a transaction to update the backend data. I thought the compare and swap should be atomic and so they should be processed in ONE transaction. but now the comparison is out of the transaction, will this there be a race condition problem here?

func (a *applierV3backend) Txn(rt *pb.TxnRequest) (*pb.TxnResponse, error) {
	ok := true
	for _, c := range rt.Compare {
		if _, ok = a.applyCompare(c); !ok {
			break
		}
	}

	var reqs []*pb.RequestOp
	if ok {
		reqs = rt.Success
	} else {
		reqs = rt.Failure
	}

	if err := a.checkRequestLeases(reqs); err != nil {
		return nil, err
	}
	if err := a.checkRequestRange(reqs); err != nil {
		return nil, err
	}

	// When executing the operations of txn, we need to hold the txn lock.
	// So the reader will not see any intermediate results.
	txnID := a.s.KV().TxnBegin()

	resps := make([]*pb.ResponseOp, len(reqs))
	for i := range reqs {
		resps[i] = a.applyUnion(txnID, reqs[i])
	}

	err := a.s.KV().TxnEnd(txnID)
	if err != nil {
		panic(fmt.Sprint("unexpected error when closing txn", txnID))
	}

	txnResp := &pb.TxnResponse{}
	txnResp.Header = &pb.ResponseHeader{}
	txnResp.Header.Revision = a.s.KV().Rev()
	txnResp.Responses = resps
	txnResp.Succeeded = ok
	return txnResp, nil
}

heyitsanthony · 2017-07-24T16:37:24Z

@abel-von that's the bug fixed by d173b09; I could reproduce it using ad22aaa cherry-picked to the 3.1 branch. I'm not sure how it would cause a backend inconsistency (in this case there's a race causing inconsistency for reads) since all writes are serialized in the raft path. Possibly a race with compaction processing.

abel-von · 2017-07-26T03:06:38Z

@xiang90 @heyitsanthony @gyuho We have found the root cause of this issue. The etcd cluster is serving not only the kube-apiserver, but also other components, that other components are using the v2 API.

The v2 and v3 data in etcd are separated from each other. there is a modifiedIndex in v2 and ModRevision in v3. and in our cluster the v2 grows faster than v3 because other components update their data more frequently than kube-apiserver.
In our start script of etcd, we did a stupid data migration from v2 to v3, and this migration only happens on one of the three nodes. during migration, the modifiedIndex in v2 became the ModRevision in v3, and as the modifiedIndex in v2 is larger than ModRevision in v3, the modifiedIndex would cover the ModRevision in v3, and this only happened on node of the three. that makes the inconsistency.

As this migration is done obscurely, and after the first migration, the kube-apiserver update data more frequently than others, that makes this issue hard to reproduce in the same cluster.

I am wondering if there is any mechanism to prevent this stupid operation from being successfully done?

heyitsanthony · 2017-07-26T03:15:24Z

@abel-von what's the process for the migration? the member is taken offline, a migration script runs, the server is brought back online, and it rejoins the cluster?

abel-von · 2017-07-26T06:14:10Z

@heyitsanthony , Yes, as the backend data mismatch is not detected by the raft, it can rejoin to the cluster successfully.

heyitsanthony · 2017-07-26T16:30:18Z

@abel-von this sort of thing will be caught by #7125

xiang90 · 2017-07-26T17:16:16Z

@heyitsanthony

Still, it is hard for us, etcd maintainers, to tell how corruption happened. Probably it worth the effort to think about how to detect user faults in a better way.

heyitsanthony · 2017-07-26T17:25:36Z

@xiang90 sure, but detecting corruption within a minute or so of bringing the member online already goes along way for debugging this sort of misconfiguration. The next step would be checking the hash before joining as a raft peer, which would have the advantage of avoiding booting into an inconsistent state and serving garbage.

xiang90 · 2017-07-26T18:03:26Z

@heyitsanthony

Can we create a new issue to capture the detecting data consistency thing? The title of this issue is terrifying. Cannot sleep for a while due to this "bug" :P.

liggitt · 2017-07-26T21:12:59Z

we're seeing a similar issue occur at some point after an etcd v2->v3 migration that does not correct itself after a restart of the etcd members. one member continues to answer incorrect data for a particular query, despite all members claiming to be at the same raft index.

cc @smarterclayton

xiang90 · 2017-07-26T21:25:23Z

@liggitt That is a different issue. This one is a user error. I really need to close it since this is confusing.

liggitt · 2017-07-26T21:30:04Z

I see, will add details to #8305

heyitsanthony added the kind/support label Jul 18, 2017

heyitsanthony mentioned this issue Jul 20, 2017

functional-tester: txn stresser #8293

Closed

xiang90 closed this as completed Jul 26, 2017

heyitsanthony mentioned this issue Jul 26, 2017

etcdserver: check kv hash before serving client/peer traffic #8313

Closed

one of the three nodes of etcd cluster has outdated data #8214

one of the three nodes of etcd cluster has outdated data #8214

Comments

abel-von commented Jul 6, 2017

Bug reporting

abel-von commented Jul 6, 2017

abel-von commented Jul 6, 2017

xiang90 commented Jul 6, 2017

abel-von commented Jul 7, 2017

heyitsanthony commented Jul 7, 2017

xiang90 commented Jul 12, 2017

abel-von commented Jul 12, 2017 • edited Loading

heyitsanthony commented Jul 18, 2017

abel-von commented Jul 19, 2017

abel-von commented Jul 19, 2017

abel-von commented Jul 19, 2017 • edited Loading

abel-von commented Jul 19, 2017 • edited Loading

heyitsanthony commented Jul 19, 2017

xiang90 commented Jul 19, 2017

abel-von commented Jul 20, 2017

heyitsanthony commented Jul 20, 2017

abel-von commented Jul 20, 2017

abel-von commented Jul 20, 2017 • edited Loading

heyitsanthony commented Jul 20, 2017

abel-von commented Jul 20, 2017

heyitsanthony commented Jul 20, 2017

abel-von commented Jul 20, 2017

gyuho commented Jul 20, 2017

abel-von commented Jul 21, 2017 • edited Loading

abel-von commented Jul 24, 2017 • edited Loading

heyitsanthony commented Jul 24, 2017

abel-von commented Jul 26, 2017 • edited Loading

heyitsanthony commented Jul 26, 2017

abel-von commented Jul 26, 2017

heyitsanthony commented Jul 26, 2017

xiang90 commented Jul 26, 2017

heyitsanthony commented Jul 26, 2017

xiang90 commented Jul 26, 2017

liggitt commented Jul 26, 2017 • edited Loading

xiang90 commented Jul 26, 2017

liggitt commented Jul 26, 2017

abel-von commented Jul 12, 2017 •

edited

Loading

abel-von commented Jul 19, 2017 •

edited

Loading

abel-von commented Jul 19, 2017 •

edited

Loading

abel-von commented Jul 20, 2017 •

edited

Loading

abel-von commented Jul 21, 2017 •

edited

Loading

abel-von commented Jul 24, 2017 •

edited

Loading

abel-von commented Jul 26, 2017 •

edited

Loading

liggitt commented Jul 26, 2017 •

edited

Loading