Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replicas can not get initiliased as they never catch up. #8911

Closed
bluelu opened this issue Dec 11, 2014 · 5 comments
Closed

Replicas can not get initiliased as they never catch up. #8911

bluelu opened this issue Dec 11, 2014 · 5 comments

Comments

@bluelu
Copy link

bluelu commented Dec 11, 2014

We had this happening already on 2 occasions now, still we don't know what exactly triggers this. Didn't see this 1.0.*

We index about 350 million entries (+updates) per day over 10 servers (including 10 shards) on ssd disks. It's about 75GB per server, thus 35 million entries per shard/server.

Cluster turned completely green except for two shards (seperate indexes) which were stuck in RECOVERING. We checked the replicas and they had over 10000 files in the index directories (mainly small ones). At that time there was basicly no traffic in the cluster, so it could not be a bandwith issue.

We had to turn off our indexing in order to get it to go to the next phase: After a few minutes (not directly), the replicas started recovering the transaction log. We checked the transaction log on the main node, it was over 35GB big and streamed very slowly to the replica. We stopped there, turned off the allocation and removed the directories and restarted the nodes. Then it quickly turned green and we turned on indexing again.

Since we restarted the cluster before, there could have been a lot of traffic copying over all the shards again as we had deleted the old checksum files because of the checksum bug. We will upgrade our indexes to the new ES version to fix this.

I don't know when our cluster started recovering those 2 node, but it could have started right away when there was more traffic. Still afterwards, no bandwith issues should have been there.

Still, it should recover after some time, without us having to stop indexing?

@mikemccand
Copy link
Contributor

10000 files in one shard? That's too high ... do you have any tuning to ES merge policy/scheduler settings?

@bluelu
Copy link
Author

bluelu commented Dec 12, 2014

We don't have any specific setting there. It's all the default configuration. Normal indexes have around 20 commited segments per shard.

This only happened for some shards where a replica was initialised. The primary has normal segment size. Only the replica will create many smaller segments as it never catches up and never reaches the step where the transaction log is replayed (maybe of the size?)

It's certainly also a corner case somewhere and might only happen under more heavy load?

@bluelu
Copy link
Author

bluelu commented Feb 6, 2015

Could this be related to #9394 ?
If yes, please close.

@bleskes
Copy link
Contributor

bleskes commented Feb 6, 2015

@bluelu it might. Did you see log messages with now throttling indexing ?

@bluelu
Copy link
Author

bluelu commented Feb 6, 2015

We don't have the log files anymore, so I don't know as it never appeared anymore. I will close the issue since I think it's related to that.

@bluelu bluelu closed this as completed Feb 6, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants