-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add nuance around stretched clusters #77360
Add nuance around stretched clusters #77360
Conversation
Today the multi-zone-cluster design docs say to keep all the nodes in a single datacenter. This doesn't really reflect what we do in practice: each zone in AWS/GCP/Azure/etc is a separate datacenter with decent connectivity to the other zones in the same region. This commit adjusts the docs to allow for this.
Pinging @elastic/es-docs (Team:Docs) |
Pinging @elastic/es-distributed (Team:Distributed) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's good, added some comments but mainly to add some field context. So fee free to ignore the literal suggestion and just adapt accordingly if the comment resonates.
Pinging @deepybee as he's a better word smith then I am :-)
It is not unusual for nodes to share some common infrastructure, such as a power | ||
supply or network router. If so, you should plan for the failure of this | ||
It is not unusual for nodes to share some common infrastructure, such as network | ||
interconnects or a power supply. If so, you should plan for the failure of this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
interconnects or a power supply. If so, you should plan for the failure of this | |
interconnects, power supply or, in the case of virtualization, physical hosts. If so, you should plan for the failure of this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure about this. I mean it's correct but it does make the sentence much more complicated. Is it worth the extra words? Do we need to clarify that nodes on the same physical host share infrastructure like power and network? Seems kinda obvious to me but this is a genuine question, I'm not the one on the front line for this kind of thing.
partition heals. If you want your data to be available in multiple data centres, | ||
deploy a separate cluster in each data centre and use | ||
<<modules-cross-cluster-search,{ccs}>> or <<xpack-ccr,{ccr}>> to link the | ||
{es} expects its node-to-node connections to be reliable and have low latency |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
{es} expects its node-to-node connections to be reliable and have low latency | |
{es} expects its node-to-node connections to be reliable, have low latency |
deploy a separate cluster in each data centre and use | ||
<<modules-cross-cluster-search,{ccs}>> or <<xpack-ccr,{ccr}>> to link the | ||
{es} expects its node-to-node connections to be reliable and have low latency | ||
and good bandwidth. Many of the tasks that {es} performs require multiple |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I fully understand why you say 'good bandwidth' at the same time customers have varying notions of good here. For some a dedicated, non-shared 1Gbit is deemed good, others have 10, 25, 40 or 100Gbit with dual nic in a LAG and I guess depending on the their use-case either could be right. It's when their notion of 'good' apart from what they need.
I guess we can get away with 'enough' as in, enough bandwidth
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Enough bandwidth" feels awkward to me, how about "adequate bandwidth"? See b0fae80.
into a noticeable performance penalty. {es} will automatically recover from a | ||
network partition as quickly as it can but your cluster may be partly | ||
unavailable during a partition and will need to spend time and resources to | ||
resynchronize any missing data and rebalance itself once a partition heals. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
re:bandwidth above, recovery / reallocation typically is the thing that consumes the bandwidth and lack of bandwidth might go unnoticed until the customer decides to make cluster changes / upgrade / has a node failure. Perhaps mentioning something with respects to time to recovery makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, thanks. Added a sentence at the end of this paragraph about recovery time in b0fae80.
If you have divided your cluster into zones then typically the network | ||
connections within each zone are of higher quality than the connections between | ||
the zones. You must make sure that the network connections between zones are of | ||
sufficiently high quality. You will see the best results by locating all your |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sufficiently high quality. You will see the best results by locating all your | |
sufficiently high quality. You will see the highest performance by locating all your |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather be slightly more vague here: it's not just about performance, reliability is also a big deal.
minimum network performance required to run a healthy {es} cluster. In theory a | ||
cluster will work correctly even if the round-trip latency between nodes is | ||
several hundred milliseconds. In practice if your network is that slow then the | ||
cluster performance will be very poor. In addition, slow networks are often |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cluster performance will be very poor. In addition, slow networks are often | |
cluster performance will likely be at unacceptable levels. In addition, slow networks are often |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I started with something like that but then I figured we'd have to dive into what "unacceptable" means and how you'd determine what is or isn't acceptable. I saw someone running a very stretched cluster over satellite links once. Its performance was terrible in an absolute sense, and yet it was still acceptable to them. There's certainly a place for that sort of discussion but it's not here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this update @DaveCTurner. Great writing as usual. I left some non-blocking feedback that you can ignore if wanted.
cluster may be partly unavailable during a partition and will need to spend | ||
time and resources to resynchronize any missing data and rebalance itself once | ||
the partition heals. Recovering from a failure may involve copying a large |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like we're talking around reallocation and shard recovery here. Is there a reason we don't just directly mention and xref those two concepts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't think we had any docs on those topics, at least not concept-level ones that would be suitable for linking from here. If you have some in mind then sure we can add links.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you're right. We have some setting reference:
- https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-cluster.html
- https://www.elastic.co/guide/en/elasticsearch/reference/master/recovery.html
However, I don't think those are great links to use here. I've opened #77515 to track this gap and add those docs.
This looks fine to me in the meantime.
Co-authored-by: James Rodewig <[email protected]>
Merging this in a spirit of progress over perfection, we can continue to iterate (adding links and following up on thoughts from Michael and Dan and so on) as needed. |
Today the multi-zone-cluster design docs say to keep all the nodes in a single datacenter. This doesn't really reflect what we do in practice: each zone in AWS/GCP/Azure/etc is a separate datacenter with decent connectivity to the other zones in the same region. This commit adjusts the docs to allow for this. Co-authored-by: James Rodewig <[email protected]>
Today the multi-zone-cluster design docs say to keep all the nodes in a single datacenter. This doesn't really reflect what we do in practice: each zone in AWS/GCP/Azure/etc is a separate datacenter with decent connectivity to the other zones in the same region. This commit adjusts the docs to allow for this. Co-authored-by: James Rodewig <[email protected]>
Today the multi-zone-cluster design docs say to keep all the nodes in a single datacenter. This doesn't really reflect what we do in practice: each zone in AWS/GCP/Azure/etc is a separate datacenter with decent connectivity to the other zones in the same region. This commit adjusts the docs to allow for this. Co-authored-by: James Rodewig <[email protected]>
PR #77360 clarifies that a cluster's nodes don't need to be in the same data center. This adds a similar clarification to the ES introduction docs. Co-authored-by: David Turner <[email protected]>
PR #77360 clarifies that a cluster's nodes don't need to be in the same data center. This adds a similar clarification to the ES introduction docs. Co-authored-by: David Turner <[email protected]>
PR #77360 clarifies that a cluster's nodes don't need to be in the same data center. This adds a similar clarification to the ES introduction docs. Co-authored-by: David Turner <[email protected]>
PR #77360 clarifies that a cluster's nodes don't need to be in the same data center. This adds a similar clarification to the ES introduction docs. Co-authored-by: David Turner <[email protected]>
Today the multi-zone-cluster design docs say to keep all the nodes in a
single datacenter. This doesn't really reflect what we do in practice:
each zone in AWS/GCP/Azure/etc is a separate datacenter with decent
connectivity to the other zones in the same region. This commit adjusts
the docs to allow for this.