New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Add nuance around stretched clusters #77360

Merged

DaveCTurner merged 5 commits into elastic:master from DaveCTurner:2021-09-07-stretched-cluster-docs

Sep 9, 2021

Contributor

DaveCTurner commented Sep 7, 2021

Today the multi-zone-cluster design docs say to keep all the nodes in a
single datacenter. This doesn't really reflect what we do in practice:
each zone in AWS/GCP/Azure/etc is a separate datacenter with decent
connectivity to the other zones in the same region. This commit adjusts
the docs to allow for this.


          Add nuance around stretched clusters

09dcdb4

Today the multi-zone-cluster design docs say to keep all the nodes in a
single datacenter. This doesn't really reflect what we do in practice:
each zone in AWS/GCP/Azure/etc is a separate datacenter with decent
connectivity to the other zones in the same region. This commit adjusts
the docs to allow for this.

DaveCTurner added >docs :Distributed Indexing/Distributed v8.0.0 v7.16.0 v7.15.1 v7.14.2 labels

DaveCTurner requested review from mjmbischoff and jrodewig

September 7, 2021 14:30

elasticmachine added Team:Distributed (Obsolete) Team:Docs labels

Collaborator

elasticmachine commented Sep 7, 2021

Pinging @elastic/es-docs (Team:Docs)

Collaborator

elasticmachine commented Sep 7, 2021

Pinging @elastic/es-distributed (Team:Distributed)

mjmbischoff approved these changes

View reviewed changes

Contributor

mjmbischoff left a comment

I think it's good, added some comments but mainly to add some field context. So fee free to ignore the literal suggestion and just adapt accordingly if the comment resonates.

Pinging @deepybee as he's a better word smith then I am :-)

docs/reference/high-availability/cluster-design.asciidoc

    
              It is not unusual for nodes to share some common infrastructure, such as a power

              supply or network router. If so, you should plan for the failure of this

              It is not unusual for nodes to share some common infrastructure, such as network

              interconnects or a power supply. If so, you should plan for the failure of this

Contributor

mjmbischoff Sep 7, 2021 •

edited by jrodewig

Loading

Suggested change

      
            interconnects or a power supply. If so, you should plan for the failure of this
          
            interconnects, power supply or, in the case of virtualization, physical hosts. If so, you should plan for the failure of this

Contributor Author

DaveCTurner Sep 7, 2021

I'm not sure about this. I mean it's correct but it does make the sentence much more complicated. Is it worth the extra words? Do we need to clarify that nodes on the same physical host share infrastructure like power and network? Seems kinda obvious to me but this is a genuine question, I'm not the one on the front line for this kind of thing.

docs/reference/high-availability/cluster-design.asciidoc Outdated

-              partition heals. If you want your data to be available in multiple data centres,
-              deploy a separate cluster in each data centre and use
-              <<modules-cross-cluster-search,{ccs}>> or <<xpack-ccr,{ccr}>> to link the
+              {es} expects its node-to-node connections to be reliable and have low latency

Contributor

mjmbischoff Sep 7, 2021

Suggested change

      
            {es} expects its node-to-node connections to be reliable and have low latency
          
            {es} expects its node-to-node connections to be reliable, have low latency

docs/reference/high-availability/cluster-design.asciidoc Outdated

-              deploy a separate cluster in each data centre and use
-              <<modules-cross-cluster-search,{ccs}>> or <<xpack-ccr,{ccr}>> to link the
+              {es} expects its node-to-node connections to be reliable and have low latency
+              and good bandwidth. Many of the tasks that {es} performs require multiple

Contributor

mjmbischoff Sep 7, 2021

I fully understand why you say 'good bandwidth' at the same time customers have varying notions of good here. For some a dedicated, non-shared 1Gbit is deemed good, others have 10, 25, 40 or 100Gbit with dual nic in a LAG and I guess depending on the their use-case either could be right. It's when their notion of 'good' apart from what they need.

I guess we can get away with 'enough' as in, enough bandwidth

Contributor Author

DaveCTurner Sep 7, 2021

"Enough bandwidth" feels awkward to me, how about "adequate bandwidth"? See b0fae80.

docs/reference/high-availability/cluster-design.asciidoc Outdated

+              into a noticeable performance penalty. {es} will automatically recover from a
+              network partition as quickly as it can but your cluster may be partly
+              unavailable during a partition and will need to spend time and resources to
+              resynchronize any missing data and rebalance itself once a partition heals.

Contributor

mjmbischoff Sep 7, 2021

re:bandwidth above, recovery / reallocation typically is the thing that consumes the bandwidth and lack of bandwidth might go unnoticed until the customer decides to make cluster changes / upgrade / has a node failure. Perhaps mentioning something with respects to time to recovery makes sense.

Contributor Author

DaveCTurner Sep 7, 2021

Good point, thanks. Added a sentence at the end of this paragraph about recovery time in b0fae80.

docs/reference/high-availability/cluster-design.asciidoc Outdated

+              If you have divided your cluster into zones then typically the network
+              connections within each zone are of higher quality than the connections between
+              the zones. You must make sure that the network connections between zones are of
+              sufficiently high quality. You will see the best results by locating all your

Contributor

mjmbischoff Sep 7, 2021

Suggested change

      
            sufficiently high quality. You will see the best results by locating all your
          
            sufficiently high quality. You will see the highest performance by locating all your

Contributor Author

DaveCTurner Sep 7, 2021

I'd rather be slightly more vague here: it's not just about performance, reliability is also a big deal.

docs/reference/high-availability/cluster-design.asciidoc Outdated

+              minimum network performance required to run a healthy {es} cluster. In theory a
+              cluster will work correctly even if the round-trip latency between nodes is
+              several hundred milliseconds. In practice if your network is that slow then the
+              cluster performance will be very poor. In addition, slow networks are often

Contributor

mjmbischoff Sep 7, 2021

Suggested change

      
            cluster performance will be very poor. In addition, slow networks are often
          
            cluster performance will likely be at unacceptable levels. In addition, slow networks are often

Contributor Author

DaveCTurner Sep 7, 2021

I started with something like that but then I figured we'd have to dive into what "unacceptable" means and how you'd determine what is or isn't acceptable. I saw someone running a very stretched cluster over satellite links once. Its performance was terrible in an absolute sense, and yet it was still acceptable to them. There's certainly a place for that sort of discussion but it's not here.


          Note recovery time <-> bandwidth

b0fae80

jrodewig approved these changes

View reviewed changes

Contributor

jrodewig left a comment

Thanks for this update @DaveCTurner. Great writing as usual. I left some non-blocking feedback that you can ignore if wanted.

docs/reference/high-availability/cluster-design.asciidoc Outdated Show resolved Hide resolved

docs/reference/high-availability/cluster-design.asciidoc Outdated Show resolved Hide resolved

docs/reference/high-availability/cluster-design.asciidoc Outdated Show resolved Hide resolved

docs/reference/high-availability/cluster-design.asciidoc Outdated Show resolved Hide resolved

docs/reference/high-availability/cluster-design.asciidoc Outdated

Comment on lines 246 to 248

+              cluster may be partly unavailable during a partition and will need to spend
+              time and resources to resynchronize any missing data and rebalance itself once
+              the partition heals. Recovering from a failure may involve copying a large

Contributor

jrodewig Sep 7, 2021

It seems like we're talking around reallocation and shard recovery here. Is there a reason we don't just directly mention and xref those two concepts?

Contributor Author

DaveCTurner Sep 9, 2021

I didn't think we had any docs on those topics, at least not concept-level ones that would be suitable for linking from here. If you have some in mind then sure we can add links.

Contributor

jrodewig Sep 9, 2021

I think you're right. We have some setting reference:

However, I don't think those are great links to use here. I've opened #77515 to track this gap and add those docs.

This looks fine to me in the meantime.

docs/reference/high-availability/cluster-design.asciidoc Outdated Show resolved Hide resolved

docs/reference/high-availability/cluster-design.asciidoc Outdated Show resolved Hide resolved

docs/reference/high-availability/cluster-design.asciidoc Outdated Show resolved Hide resolved

jrodewig reviewed

View reviewed changes

docs/reference/high-availability/cluster-design.asciidoc Outdated Show resolved Hide resolved

docs/reference/high-availability/cluster-design.asciidoc Outdated Show resolved Hide resolved

DaveCTurner and others added 3 commits

September 9, 2021 19:56


          Apply suggestions from code review

3818e8f

Co-authored-by: James Rodewig <[email protected]>


          Merge branch 'master' into 2021-09-07-stretched-cluster-docs


          Reformat

486f192

DaveCTurner merged commit 1fb1ad2 into elastic:master

DaveCTurner deleted the 2021-09-07-stretched-cluster-docs branch

September 9, 2021 19:11

Contributor Author

DaveCTurner commented Sep 9, 2021

Merging this in a spirit of progress over perfection, we can continue to iterate (adding links and following up on thoughts from Michael and Dan and so on) as needed.

DaveCTurner added a commit that referenced this pull request


          Add nuance around stretched clusters (#77360)

830a947

Today the multi-zone-cluster design docs say to keep all the nodes in a
single datacenter. This doesn't really reflect what we do in practice:
each zone in AWS/GCP/Azure/etc is a separate datacenter with decent
connectivity to the other zones in the same region. This commit adjusts
the docs to allow for this.

Co-authored-by: James Rodewig <[email protected]>

DaveCTurner added a commit that referenced this pull request


          Add nuance around stretched clusters (#77360)

6f6863d

Today the multi-zone-cluster design docs say to keep all the nodes in a
single datacenter. This doesn't really reflect what we do in practice:
each zone in AWS/GCP/Azure/etc is a separate datacenter with decent
connectivity to the other zones in the same region. This commit adjusts
the docs to allow for this.

Co-authored-by: James Rodewig <[email protected]>

DaveCTurner added a commit that referenced this pull request


          Add nuance around stretched clusters (#77360)

22a4d0b

Today the multi-zone-cluster design docs say to keep all the nodes in a
single datacenter. This doesn't really reflect what we do in practice:
each zone in AWS/GCP/Azure/etc is a separate datacenter with decent
connectivity to the other zones in the same region. This commit adjusts
the docs to allow for this.

Co-authored-by: James Rodewig <[email protected]>

jrodewig mentioned this pull request

[DOCS] Update ES intro for stretched clusters #77651

Merged

jrodewig added a commit that referenced this pull request


          [DOCS] Update ES intro for stretched clusters (#77651)

88fcb67

PR #77360 clarifies that a cluster's nodes don't need to be in the same data
center. This adds a similar clarification to the ES introduction docs.

Co-authored-by: David Turner <[email protected]>

jrodewig added a commit that referenced this pull request


          [DOCS] Update ES intro for stretched clusters (#77651) (#77666)

PR #77360 clarifies that a cluster's nodes don't need to be in the same data
center. This adds a similar clarification to the ES introduction docs.

Co-authored-by: David Turner <[email protected]>

jrodewig added a commit that referenced this pull request


          [DOCS] Update ES intro for stretched clusters (#77651) (#77667)

4d7bdba

PR #77360 clarifies that a cluster's nodes don't need to be in the same data
center. This adds a similar clarification to the ES introduction docs.

Co-authored-by: David Turner <[email protected]>

jrodewig added a commit that referenced this pull request


          [DOCS] Update ES intro for stretched clusters (#77651) (#77668)

3e6c6aa

PR #77360 clarifies that a cluster's nodes don't need to be in the same data
center. This adds a similar clarification to the ES introduction docs.

Co-authored-by: David Turner <[email protected]>

jakelandis added v8.0.0-alpha2 and removed v8.0.0 labels

shainaraskas mentioned this pull request

Round up shard allocation / recovery / relocation concepts #109943

Merged

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Distributed Indexing/Distributed >docs Team:Distributed (Obsolete) Team:Docs v7.14.2 v7.15.1 v7.16.0 v8.0.0-alpha2