-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
one of the three nodes of etcd cluster has outdated data #8214
Comments
after reboot the etcd on 205.18.1.181, the |
Sorry that I can't give out all the logs because of security issues in my company. I checked the log and found that the log is almost consistent in the 205.18.1.181 and the other nodes. logs on 205.18.1.181:
logs on 205.18.1.109:
|
@abel-von do you take snapshots regularly? |
@xiang90 thanks for reply, we didn't take any snapshot. it's running normally for days, then we restart 205.18.1.181 and 205.18.1.109 , which are the two of three nodes of etcd. after restarting, the problem comes out. Will the |
@abel-von can this be reproduced with 3.2.2 or 3.1.9? I can't seem to find a commit that matches ec53528... |
@abel-von kindly ping. |
sorry for the confusion made by the git sha, we made etcd as a submodule of a outside project so the git commit is not the etcd's but the outside project's. |
@abel-von is there any way to easily reproduce this on 3.2.3 or 3.1.10? |
@heyitsanthony we changed our etcd version to 3.1.9, and now the issue is reproduced more frequently. we used etcd-dump-db to dump the db file, we find that the larger data retains many revisions of a key, while in the smaller data file, there is only one revision of a key in it. in one of the smaller data size node: the compaction is continuous:
in the larger data size node, the compaction suddenly stoped at the time of 2017-07-17 01:38:05:
and we didn't find any error log in this node during the time the node stop doing compaction, there is only some warnings showing that the io delay is high.
|
we checked the snapshot log of the nodes and find that there is a little difference between the larger data node and the smaller data nodes. in smaller data node:
the larger data node:
I think the applied log index can still be considered as in the same pace, as I know the |
we used the command
the second node is the one which has a larger db file, the revision is much smaller than the other two, does this mean that a lot of log entries is not applied into the kvstore? the simple output of endpoint status is:
as we can see that the data of the second has data of 2.1GB, which is larger than the default max db size. and now the cluster can not serve any updating request. |
This seems to be an issue about the processing of applying an entry log to kvstore. |
That's a serialized read, it's not expected to be up to date since it doesn't go through quorum but it should match the historical data. If
The node is behind, but it shouldn't be skipping any entries.
The larger node needs to be defragmented; it's raising an alarm on the cluster because it's exceeding is space quota. After defragmenting the member the alarm can be disabled with
If the apply fails then etcd should panic since the entry has been committed to raft; there's no turning back if there's an error in the apply path. |
We are not sure if it falls behind actually. k8s uses txn with conditions a lot. If there the state is indeed inconsistent on that node, revision might also not increase due to txn failures. |
@heyitsanthony, @xiang90 , thanks for reply, we also checked the and k8s is automatically requesting to compact db, it's just one node of etcd didn't do it. I have add a log in the function
and we got the log like below:
as we can see, the compaction failed because the revision in the request of kube-apiserver is larger than the current revision in this node. that's why the compaction is failed. I think we still have to check why the revision in one node is much smaller than the others. as in my comment before, we can see that the applied index in three nodes is almost advancing in the same pace. so I don't think the node with larger file falls behind. I will add more logs to see why the revision in a node is far smaller than the other. |
@abel-von OK, it's definitely inconsistent. Without the wal + db I'm not sure if there's much we can do to find the root cause. Removing the faulty member via Was the member only rebooted with 3.1.9 or was it a freshly initialized member? There was a period during the 3.2 development cycle where there could be inconsistencies but it was caught before 3.2.0. Also, there's no commit matching 9b9baa968f317fc2d5870e6aa97185df126bb640. |
@heyitsanthony , yes we are now using 3.1.9. the 9b9baa968f317fc2d5870e6aa97185df126bb640 is in the first cluster where we got this problem, but after that, our version has changed to 3.1.9. could you please send me the commit ids to solve the inconsistency problem ? |
@xiang90 I have checked the code of kube-apiserver, and as you said, all the updating in k8s is a CAS operation, with etcdv3's Txn. it compare the revision of a key and update the key if the compare returns true. I am wondering that if one txn failed on a node, than all txns after that will be failed on that node because the revision is not consistent, and as the txn failed more, the revision will be more inconsistent. that makes the system very unstable, isn't it? will a node fall behind a little fails a txn? if it fails once, then fails forever? |
@heyitsanthony , thanks, but i don't want just repair the failed cluster. actually, this problem comes out in a lot of testing clusters now. I have to find the root cause of this and fix the problem. |
@abel-von the member inconsistency can be reproduced with a fresh cluster running 3.1.9? |
yes, it can be reproduced, I think if one of the node handle log entries in slightly different pace with the leader, and if you continue updating some particular keys with CAS to compare the ModRevision with a short period, like 10 seconds in k8s to update its' node status. then the issue can be reproduced. |
@abel-von |
@gyuho please just forget the commit id 9b9baa968f317fc2d5870e6aa97185df126bb640, it's the first time we get this issue. After that we have changed the etcd version to 3.1.9. which is the official release. And the problem is still there with version of 3.1.9 |
@gyuho @heyitsanthony @xiang90 I'm confused by this
|
@abel-von that's the bug fixed by d173b09; I could reproduce it using ad22aaa cherry-picked to the 3.1 branch. I'm not sure how it would cause a backend inconsistency (in this case there's a race causing inconsistency for reads) since all writes are serialized in the raft path. Possibly a race with compaction processing. |
@xiang90 @heyitsanthony @gyuho We have found the root cause of this issue. The etcd cluster is serving not only the kube-apiserver, but also other components, that other components are using the v2 API. The v2 and v3 data in etcd are separated from each other. there is a As this migration is done obscurely, and after the first migration, the kube-apiserver update data more frequently than others, that makes this issue hard to reproduce in the same cluster. I am wondering if there is any mechanism to prevent this stupid operation from being successfully done? |
@abel-von what's the process for the migration? the member is taken offline, a migration script runs, the server is brought back online, and it rejoins the cluster? |
@heyitsanthony , Yes, as the backend data mismatch is not detected by the raft, it can rejoin to the cluster successfully. |
Still, it is hard for us, etcd maintainers, to tell how corruption happened. Probably it worth the effort to think about how to detect user faults in a better way. |
@xiang90 sure, but detecting corruption within a minute or so of bringing the member online already goes along way for debugging this sort of misconfiguration. The next step would be checking the hash before joining as a raft peer, which would have the advantage of avoiding booting into an inconsistent state and serving garbage. |
Can we create a new issue to capture the detecting data consistency thing? The title of this issue is terrifying. Cannot sleep for a while due to this "bug" :P. |
we're seeing a similar issue occur at some point after an etcd v2->v3 migration that does not correct itself after a restart of the etcd members. one member continues to answer incorrect data for a particular query, despite all members claiming to be at the same raft index. |
@liggitt That is a different issue. This one is a user error. I really need to close it since this is confusing. |
I see, will add details to #8305 |
Bug reporting
hi, I have encountered a problem in my testing cluster. we deployed a kubernetes cluster, when we run "get pods" command in two different nodes, it display different results.
node1:
node2:
As we can see, the first result has many error pods but in the second result they are all running. and we check through docker command that pods are actually running.
then I checked the connection between kube-apiserver and etcd, and find that, on node2, one of etcd server(205.18.1.181) is not connected with kube-apiserver. while on node1(which displays wrong result), all three etcd nodes are connected.
we checked the endpoint status of etcd, the result is like below:
as we can see that on 205.18.1.181, the data has size of 807 MB, while the other two nodes has data of 18MB. and I called "snapshot status" to check the db file:
on 205.18.1.181:
on 205.18.1.109:
we have checked one key on both the two nodes by
get /xxxx --consistency="s"
, and we found that the data in 205.18.1.181 is wrong and outdated.I think it some bugs in the boltdb or something that produce this problem, but I can't figure out where exactly the bug is. so how can I dig into this and find the root cause of this problem?
by the way, the etcd version is:
The text was updated successfully, but these errors were encountered: