-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Only send diff of cluster state instead of full cluster state #6295
Comments
@bluelu which version are you using? I haven't seen a compressed cluster state that is 500mb, we compress and all in our infra when we publish and so on. Changing to send deltas will require quite a big change, its an option of course, but its a really big change since we rely on the full cluster state to be continuously published. I have helped several users with large cluster state, and improved things internally. The size was not the issue, it was things like inefficient processing of large cluster state. |
We are currently using elasticsearch-1.0.2. { We have more than 500 nodes in the cluster, so each update (e.g. when a shard gets rebalanced) will cause the state to be send to all nodes, which in total is close to 500 MB in total to be sent. This takes a few seconds to complete and takes up all the bandwith. I can also send you the complete cluster state in private if that helps. |
Do you have any update on this if it will be integrated or not? We also plan on using query warmers in the future, and then the updates also need to be dispatched to all the nodes. Creating a text diff of state (diff algorithm) and then applying that on the non master nodes, with fallback to the full cluster state if previous and current versions differ, might not be the cleanest solution, but certainly shouldn't be all to difficult, as there is already code that the master waits for the nodes to confirm the updated cluster state. This would dramatically reduce traffic and potentially as well the heap usage on the master node on larger clusters. |
@bluelu no work has happened on this yet. Diff is one of the options, though generating the diff is one of the tricky parts here (we could do it more easily when we work on the object model, btw, of the cluster state). Indeed, we could send a diff and if the node receiving it can't apply due to changes, they can then request the full cluster state to be sent (with careful checks not to create a storm of full updates that are not needed). It would be interesting how things would work in more recent versions. In your cluster state size, note that we do 2 things when publishing the cluster state, we serialize it using our internal serialization mechanism, so comparing it to json representation is not a real comparison, as its considerably smaller, and then we compress it. Also, on recent versions there is a better logic in applying cluster states on the receiving nodes. Also, in upcoming 1.4, with the new zen discovery that is slated to it, there is much better support for multiple nodes. This will help as well. The diff is something that would be interesting to explore in future versions, I am mainly trying to asses the urgency of it here, as in, is it really a problem in your use case? |
We throttled the bandwith on our master node so that it doesn't take all bandwith on the switch. It's not optimal but it seems to work for now. Before we had spikes of 1 gigabit over multiple seconds blocking all other connections on that switch (1 stream to each node). Now it just takes a little longer with less bandwith. Apart from the slow restart (which we reduced by not starting all nodes at once, we will check behaviour on 1.3) it's the last one of the bigger issues which we saw so far on larger clusters. For us, it's a problem, but it "works" with our workaround. But we would love to see this in >1.5 if possible. We will upgrade to the newest 1.3.* in 2 weeks and then to 1.4.* (when it's stable for sure) and report back then. |
@bluelu thanks for the feedback!. I have another question, when you saw the 1gb saturation, was that when the cluster was forming (since there are a lot of cluster states updates happening then, which we reduced significantly in upcoming 1.4). If you issue a simple reroute after the cluster formed, what did you see then? |
@kimchy The cluster is forming without any issues and we didn't check the traffic then. This doesn't take much time at all until all nodes are added. But it takes more minutes then until the cluster state jumps from 0 unassigned shards to let's say 5000 shards. After the first update, it get's faster. The more accurate the unassigned shards number is, the faster it seems to update. It's a little scary the first time, as one might expect that all data is lost ;-). The traffic spikes occur if nodes are being rebalanced, we update the mapping, or do some other operation which requires the state to be sent. We actually found out the bandwith issue as we had custom rebalancing code which shuffled indexes between 2 nodes back and forth and we wondered why everything felt so slow. This should be equal to a reroute command? Here are more details about the slow start up (master and non master log): |
Fyi, We upgraded to 1.4.1 today. We are still seeing a lot of load on the network interface when the master node sends the cluster state on startup time. (About every minute one update) I also observed that it takes some time for the cluster state to reach the nodes in that case (about 20 seconds...). I don't know if this will cause an issue in the future? hopefully not. Node: Master node: |
Refactor how settings filters are handled. Instead of specifying settings filters as filtering class, settings filters are now specified as list of settings that needs to be filtered out. Regex syntax is supported. This is breaking change and will require small change in plugins that are using settingsFilters. This change is needed in order to simplify cluster state diff implementation. Contributes to elastic#6295
First iteration of cluster state diffs that adds support for diffs to the most frequently changing elements - cluster state, meta data and routing table. Closes elastic#6295
Adds support for calculating and sending diffs instead of full cluster state of the most frequently changing elements - cluster state, meta data and routing table. Closes elastic#6295
we had a boat load of failures related to this so I branched off |
Adds support for calculating and sending diffs instead of full cluster state of the most frequently changing elements - cluster state, meta data and routing table. Closes #6295
Adds support for calculating and sending diffs instead of full cluster state of the most frequently changing elements - cluster state, meta data and routing table. Closes elastic#6295
I was wondering if there are plans to address the cluster state in an upcoming 2.x release. We currently need to have thousands of small indexes to best support our data flow. Right now we have almost 3000 indices and the GET request for cluster state takes around 40 seconds and effectively makes most admin tools (HQ, Head) non-functional. |
@DrGonzo424 this change (which was merged into 2.0.0) only improves node-to-node communication which works over transport protocol. HQ and Head are using REST API and will not be able to take advantage of cluster state diffs. |
Thanks for the quick reply, that is great to know that node-to-node communication will benefit from cluster state diffs as we are currently on 2.2. Do you know if any of the admin tools do a better job of paging or caching the cluster state, to improve usability? Right now it makes me a but uncomfortable that our administrative tools are not scaling with our cluster. Perhaps there are some better admin tools out there. |
@DrGonzo424 I think https://discuss.elastic.co/ would be a much better place to ask this question. My favorite admin tool is _cat API it works fast even on large cluster states and does everything I need, but it might be too minimalistic for your purposes. |
If you have many nodes, and many index, even a small cluster state will trigger 500 MB of data to be sent to all nodes from the master. A few small rebalancing operations will kill the cluster.
This is a big issue.
Would be good if only the diff could be sent, then later merged. If node isn't at previous version, then fall back to current behaviour and send full state.
The text was updated successfully, but these errors were encountered: