-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Shard stuck in findSafeCommitPoint / Commit list must not empty #57091
Comments
Pinging @elastic/es-distributed (:Distributed/Engine) |
Can you provide the full logs somewhere? The log entries you showed only cover the failed peer recovery (i.e. recovering a replica failed) where your cluster was yellow, not red. Can you also provide some details about the filesystem that you're using? It sounds like some files just got wiped on disk (perhaps by an external process?) Further, it would be interesting to get a directory listing from |
We've identified the bug here, which looks to affect indices that were created in an ES version before 6.2. These indices can under certain conditions fail to be opened up in ES 7.7.0. We're working on a fix. |
If an upgraded node is restarted multiple times without flushing a new index commit, then we will wrongly exclude all commits from the starting commits. This bug is reproducible with these minimal steps: (1) create an empty index on 6.1.4 with translog retention disabled, (2) upgrade the cluster to 7.7.0, (3) restart the upgraded the cluster. The problem is that with the new translog policy can trim translog without having a new index commit, while the existing commit still refers to the previous translog generation. Closes #57091
If an upgraded node is restarted multiple times without flushing a new index commit, then we will wrongly exclude all commits from the starting commits. This bug is reproducible with these minimal steps: (1) create an empty index on 6.1.4 with translog retention disabled, (2) upgrade the cluster to 7.7.0, (3) restart the upgraded the cluster. The problem is that with the new translog policy can trim translog without having a new index commit, while the existing commit still refers to the previous translog generation. Closes elastic#57091
If an upgraded node is restarted multiple times without flushing a new index commit, then we will wrongly exclude all commits from the starting commits. This bug is reproducible with these minimal steps: (1) create an empty index on 6.1.4 with translog retention disabled, (2) upgrade the cluster to 7.7.0, (3) restart the upgraded the cluster. The problem is that with the new translog policy can trim translog without having a new index commit, while the existing commit still refers to the previous translog generation. Closes #57091
If an upgraded node is restarted multiple times without flushing a new index commit, then we will wrongly exclude all commits from the starting commits. This bug is reproducible with these minimal steps: (1) create an empty index on 6.1.4 with translog retention disabled, (2) upgrade the cluster to 7.7.0, (3) restart the upgraded the cluster. The problem is that with the new translog policy can trim translog without having a new index commit, while the existing commit still refers to the previous translog generation. Closes #57091
Elasticsearch version (
bin/elasticsearch --version
): Version: 7.7.0, Build: default/deb/81a1e9eda8e6183f5237786246f6dced26a10eaf/2020-05-12T02:01:37.602180Z, JVM: 14Plugins installed: []
JVM version (
java -version
): (Elasticsearch bundled) openjdk 14 2020-03-17OS version (
uname -a
if on a Unix-like system): Debian 9.12, Linux elastic-n1 4.9.0-12-amd64 1 SMP Debian 4.9.210-1 (2020-01-20) x86_64 GNU/LinuxDescription of the problem including expected versus actual behavior:
At some point a few indexes just didn't want to recover anymore resulting in data loss. I wouldn't want that to happen.
Total of 24 shards in 4 indices failed. All indices were system indices (.ml-state, .ml-notifications, .ml-anomalies-shared, .logstash) and most really small (~1MB, ml-anomalies-shared 250MB)
Steps to reproduce:
Not entirely sure how it happened and when exactly. Might have been caused by earlier corruption or cluster problems.
I had to delete the affected indices to get reporting working. I did take copies of the files before deletion if anybody wants to take a look at them.
(Restored from snapshot) .logstash index reports version.created: 6010199.
Timeline
19:10:01 elastic-n4 automated updates for bind9 cause Debian to toggle interface down & up. This results in DHCP giving new address [which is really a facepalm]
19:10:34 elastic-n4 (master) lost communications with rest of the cluster
19:10:36 elastic-n1 elected as new master
19:14:47 - 19:16:38 elastic-n1 busyloops (30/sec) monitoring exporter failures due to elastic-n2 queue full
19:15:22 - 19:16:39 elastic-n3 reports master lost and elastic-n1 election twice
19:25:45 elastic-n4 starts spamming elastic-n1 with join requests to wrong IP
20:14:41 restarted elastic-n4; communications finally recover
20:25:11 .ml-state recovery from n1 to n4 fails (Commit list must not empty); index yellow
21:36:56 restarted elastic-n1
21:54:39 .ml-state recovery fails on n1 (Commit list must not empty); index red
I did /_flush/synced at some point but don't remember when. Likely before elastic-n4 restart
Provide logs (if relevant):
https://github.com/elastic/elasticsearch/blob/v7.7.0/server/src/main/java/org/elasticsearch/index/engine/CombinedDeletionPolicy.java#L182
https://github.com/elastic/elasticsearch/blob/v7.7.0/server/src/main/java/org/elasticsearch/index/store/Store.java#L1524
The text was updated successfully, but these errors were encountered: