Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hard limit on total number of shards in a cluster #20705

Closed
alexbrasetvik opened this issue Sep 30, 2016 · 23 comments
Closed

Hard limit on total number of shards in a cluster #20705

alexbrasetvik opened this issue Sep 30, 2016 · 23 comments
Assignees
Labels
:Data Management/Indices APIs APIs to create and manage indices and templates >enhancement good first issue low hanging fruit

Comments

@alexbrasetvik
Copy link
Contributor

We're seeing hundreds of cases of too many shards causing problems vs. problems caused by having too few.

It would be great to have a default hard limit, even though it can be increased (through the cluster settings API). It'll raise awareness to this issue hopefully in a "I can bump this now, but need to fix it"-way.

@javanna
Copy link
Member

javanna commented Sep 30, 2016

@alexbrasetvik I think this was done yesterday on #20682 ;)

@jasontedor
Copy link
Member

@javanna That's different, that's per index but the request here is per cluster.

@alexbrasetvik
Copy link
Contributor Author

Just synced up with @s1monw who asked me to create this issue while we talked about the per-index limit. This one is indeed per cluster, as a total number of shards - whether it's a few indices with a lot of shards, or many single-shard indices.

@javanna
Copy link
Member

javanna commented Sep 30, 2016

sounds good thanks for clarifying.

@clintongormley
Copy link
Contributor

max_shards_per_node

This setting would be checked on user actions like create index, restore snapshot, open index. If the total number of shards in the cluster is greater than max_shards_per_node * number_of_nodes then the user action can be rejected. This implementation allows the max value to be exceeded if (eg) a node fails, resulting in a lower total max shards per cluster.

We would default to a high number during 5.x (eg 1000), giving sysadmins the ability to set it to whatever makes sense for their cluster, and we can look at lowering this value for 6.0.

@gmoskovicz
Copy link
Contributor

We would default to a high number during 5.x (eg 1000), giving sysadmins the ability to set it to whatever makes sense for their cluster, and we can look at lowering this value for 6.0.

I would say that ~500/600 shards per node is a good limit.

@jasontedor
Copy link
Member

@s1monw Raising this one to you.

@cdahlqvist
Copy link

cdahlqvist commented Jun 23, 2017

Should the limit of shards per node not be linked to the amount of heap space a node has, e.g. 20 shard limit per GB of heap a node has allocated?

@gmoskovicz
Copy link
Contributor

@cdahlqvist i like that idea!

@ron-totango
Copy link

Could someone explain the motivation for the shard limit per node? Is it related to the node type - the amount of memory it has? disk space? Anything else?
We have 40K shards (using per day indexes) and we're hitting issues of Large cluster states that we don't know how to resolve...

@DaveCTurner
Copy link
Contributor

@ron-totango that's a question that's better suited to the support forums over at https://discuss.elastic.co - 40k shards sounds like too many, and the forums should be able to help you reduce it to something more reasonable.

@ron-totango
Copy link

Thanks @DaveCTurner . Already tried to ask at https://discuss.elastic.co/t/configuring-a-cluster-for-a-large-number-of-indexes/115731 but didn't get any meaningful reply :-(

@majormoses
Copy link
Contributor

When we are talking about the limit of shards per node (averaged through the cluster) are we considering primary or do replica shards count as well?

@clintongormley clintongormley added :Distributed Indexing/Distributed A catch all label for anything in the Distributed Area. Please avoid if you can. and removed :Cluster labels Feb 13, 2018
@bleskes bleskes added the good first issue low hanging fruit label Mar 27, 2018
@ywelsch ywelsch added the :Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) label Mar 27, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@ywelsch ywelsch removed the :Distributed Indexing/Distributed A catch all label for anything in the Distributed Area. Please avoid if you can. label Mar 27, 2018
@DaveCTurner
Copy link
Contributor

This is currently labelled :Distributed/Allocation but I think it's not a great idea to solve this in the allocator by refusing to allocate more than a certain number of shards per node. It seems like a better idea to check this on actions that create the shards-to-be-allocated:

This setting would be checked on user actions like create index, restore snapshot, open index.

I think, given the above comment, that this'd be better labelled :Core/Index APIs, so I'm doing so.

@DaveCTurner DaveCTurner added the :Data Management/Indices APIs APIs to create and manage indices and templates label Apr 30, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra

@DaveCTurner DaveCTurner removed the :Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) label Apr 30, 2018
@dakrone dakrone added team-discuss and removed help wanted adoptme labels Jul 11, 2018
@dakrone
Copy link
Member

dakrone commented Jul 25, 2018

We discussed this during the core/infra sync, we agreed that a limit is good, and that doing it at the validation layer is a good idea (rather than doing it at the allocation decider level). We agreed on Clint's proposal of making the limit a factor of the number of nodes. Marking this as adoptme and removing the discussion label now.

@dakrone dakrone added help wanted adoptme and removed team-discuss labels Jul 25, 2018
@cdahlqvist
Copy link

@dakrone What is the reason this will be based on the number of nodes rather than the available heap size? I would expect a 3-node 2GB Elastic Cloud cluster to need a much lower limit than a 3-node 64GB Elastic Cloud cluster.

@jasontedor
Copy link
Member

@cdahlqvist That's a concern about what the default per node should be, not whether or not it should be based on the number of nodes. We will likely start simple with a blanket per node default and can consider over time making the default ergonomic to the heap size.

@cdekker
Copy link

cdekker commented Jul 27, 2018

Will this include the number of replicas?

@gwbrown gwbrown self-assigned this Aug 7, 2018
@jasontedor jasontedor removed the help wanted adoptme label Aug 8, 2018
gwbrown added a commit to gwbrown/elasticsearch that referenced this issue Aug 14, 2018
Adds a safety limit on the number of shards in a cluster, based on
the number of nodes in the cluster. The limit is checked on operations
that add (or activate) shards, such as index creation, snapshot
restoration, and opening closed indices, and can be changed via the
cluster settings API.

Closes elastic#20705
@gwbrown
Copy link
Contributor

gwbrown commented Oct 24, 2018

@cdekker The implementation merged in #34021 counts replicas towards the limit, as replicas consume resources in much the same way as primary shards.

@vigyasharma
Copy link
Contributor

An overall high shard count in cluster also loads up master node operations. Are there plans for a high overall limit for cluster irrespective of number of nodes? Or limit on number of nodes in the cluster?

@Bukhtawar
Copy link
Contributor

Bukhtawar commented Apr 6, 2019

Should master log warning/prevent index creation or add mappings if the heap on master is too low to support the cluster limit which IMO should also factor in heap on the data node as pointed out by @cdahlqvist

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Data Management/Indices APIs APIs to create and manage indices and templates >enhancement good first issue low hanging fruit
Projects
None yet
Development

Successfully merging a pull request may close this issue.