-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replicas can not get initiliased as they never catch up. #8911
Comments
10000 files in one shard? That's too high ... do you have any tuning to ES merge policy/scheduler settings? |
We don't have any specific setting there. It's all the default configuration. Normal indexes have around 20 commited segments per shard. This only happened for some shards where a replica was initialised. The primary has normal segment size. Only the replica will create many smaller segments as it never catches up and never reaches the step where the transaction log is replayed (maybe of the size?) It's certainly also a corner case somewhere and might only happen under more heavy load? |
Could this be related to #9394 ? |
@bluelu it might. Did you see log messages with |
We don't have the log files anymore, so I don't know as it never appeared anymore. I will close the issue since I think it's related to that. |
We had this happening already on 2 occasions now, still we don't know what exactly triggers this. Didn't see this 1.0.*
We index about 350 million entries (+updates) per day over 10 servers (including 10 shards) on ssd disks. It's about 75GB per server, thus 35 million entries per shard/server.
Cluster turned completely green except for two shards (seperate indexes) which were stuck in RECOVERING. We checked the replicas and they had over 10000 files in the index directories (mainly small ones). At that time there was basicly no traffic in the cluster, so it could not be a bandwith issue.
We had to turn off our indexing in order to get it to go to the next phase: After a few minutes (not directly), the replicas started recovering the transaction log. We checked the transaction log on the main node, it was over 35GB big and streamed very slowly to the replica. We stopped there, turned off the allocation and removed the directories and restarted the nodes. Then it quickly turned green and we turned on indexing again.
Since we restarted the cluster before, there could have been a lot of traffic copying over all the shards again as we had deleted the old checksum files because of the checksum bug. We will upgrade our indexes to the new ES version to fix this.
I don't know when our cluster started recovering those 2 node, but it could have started right away when there was more traffic. Still afterwards, no bandwith issues should have been there.
Still, it should recover after some time, without us having to stop indexing?
The text was updated successfully, but these errors were encountered: