stability: Need to improve performance of widely distributed clusters under chaos #17611

a-robinson · 2017-08-12T00:56:43Z

Indigo gets pretty decimated by even the friendliest of disruptions. I'm running it on eef2e04 with no custom settings other than trace.debug.enable=true, and this happens when I simply run supervisorctl restart cockroach on one of the nodes:

The process took as much time as possible draining before hitting its limit and hard-stopping itself, then was restarted by supervisor in less than a second. I assume that more antagonistic forms of chaos would be very bad for the cluster as well.

We should devote some time to this for 1.2, but it's unlikely to be a small project. cc @nstewart

The text was updated successfully, but these errors were encountered:

tbg · 2017-10-05T19:59:41Z

Chatted with Masha and she'll look into this because there's little hope of even reproducing #14768 cleanly as long as this is happening.

tbg · 2017-10-13T17:30:48Z

@a-robinson assigning you per your comment on #19236. @m-schneider is focusing on the case in which the node is gone for >10min, #19165, though you may wanna touch base to see if there are any common problems.

a-robinson · 2017-10-16T13:27:06Z

We've very much been touching base on her investigations. I don't plan to look at distributed clusters in the near term, so I'm going to unassign myself for now. If #19236 happens to be the same thing, then great, otherwise I don't want to make any promises.

petermattis · 2017-10-16T13:32:10Z

@awoods187 Just mentioned that stability of widely distributed clusters is going to affect partitioning. More specifically, stabilizing partitioning is going to be more difficult if it also involves stabilizing distributed clusters in the absence of partitioning.

tbg · 2017-11-07T21:30:22Z

@mberhault having checked in with @cuongdo, I'd like to set up a long-lived geo-replicated cluster (SF, US, EU? Six nodes is probably enough) similar to the PM cluster which we can use for principled testing. I'm happy to set this up myself but am pretty sure I'd get nowhere without having you get me started first. Think you could send me off this week or next?

Once the cluster is ready, my first items would be setting up various tables with different zone configurations, loading a bunch of data, adding load, and checking that the latencies and behavior under chaos are in line with expectations.

As partitioning features and details of our prospective users for this feature become available, they would be incorporated into testing as well.

mberhault · 2017-11-07T21:33:05Z

You may want to take a look at https://github.com/cockroachlabs/production/issues/500 which did some of this.
A reasonably easy way to get started would be to resurrect indigo (decommissioned in https://github.com/cockroachlabs/production/pull/504), or at least to look at what was going on there.

spencerkimball · 2017-11-07T21:35:45Z

Yes, we definitely need a cluster with at least three localities – need this for serious admin UI testing as well. I think a cluster which spans east coast and one that spans continents would be two very useful cases. The spanned continents will be critical to geo-partitioning testing.

…

On Tue, Nov 7, 2017 at 4:33 PM marc ***@***.***> wrote: You may want to take a look at cockroachlabs/production#500 <cockroachlabs/production#500> which did some of this. A reasonably easy way to get started would be to resurrect indigo (decommissioned in cockroachlabs/production#504 <cockroachlabs/production#504>), or at least to look at what was going on there. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#17611 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AF3MTX-mBiaLc4qPgcyHhFVQItmzIj-tks5s0MyggaJpZM4O1Qzm> .

tbg · 2017-11-07T21:44:18Z

@mberhault indigo was on az, is that still what we'd want to use?

mberhault · 2017-11-07T21:46:32Z

Probably not, we may want to go for DO for that. We also don't need the hardware profile that it did, I don't think this is for performance work.

tbg · 2017-11-10T01:36:47Z

Small update: I have a script ready that can create geo-partitioned clusters on DigitalOcean. Since I won't be looking until next week anyway, I won't create a cluster just yet, though.

tbg · 2017-11-10T23:58:02Z

@danhhz with the upcoming partitioning work, what do you think a reasonable distributed cluster topology looks like? I'm currently running a New York - San Francisco - Singapore - Bangalore - Frankfurt cluster (latencies below) but that seems a little crazy. I'm thinking that three localities are enough? For now I'm running N block writers which each write to their own table (simulating an N-fold partitioned table, so most reads should be fast assuming the leaseholders realign suitably, and writes, well, can eat pretty bad latency). I'm planning to set up chaos, automated zone config changes, and make the workload of type "follow the sun". Let me know if you think there's another dimension I should look into testing.

tbg · 2017-11-11T04:38:29Z

I don't understand why I see a zero flatline for local request optimization here:

Running kv locally against a single-node cluster I see 100% optimized, and I'd expect to at least hit the optimization for a fraction.

tbg · 2017-11-11T05:02:26Z

Ok, I know why.

// GetLocalInternalServerForAddr returns the context's internal batch server
// for target, if it exists.
func (ctx *Context) GetLocalInternalServerForAddr(target string) roachpb.InternalServer {
	// Prints: cockroach-celebes-0005.crdb.io:26257 cockroach-celebes-0005:26257
	fmt.Println(target, ctx.Addr)
	if target == ctx.Addr {
		return ctx.localInternalServer
	}
	return nil
}

Seems unfortunate that we're not comparing IP addresses here. Suppose I could fix this by using --advertise-addr, but quite unfortunate that this doesn't "just work".

tbg · 2017-11-11T05:08:56Z

correction: it's already using --advertise-host, with cockroach-celebes-0005.crdb.io, so I'm confused why this doesn't work. The node also uses cockroach-celebes-0005 in its start output.

I made this "work" by using --host cockroach-celebes-0005.crdb.io instead of --advertise-host cockroach-celebes-0005.crdb.io (and no --host). I'll file a separate issue about this, just happened to run into it here.

Edit: filed #19991

danhhz · 2017-11-29T21:24:58Z

@danhhz with the upcoming partitioning work, what do you think a reasonable distributed cluster topology looks like?

Sorry it took so long for me to get to this! I think 3 locations is good enough. I'm a little more interested in 2 nodes per location than 1, so you can end up with local write quorums.

spencerkimball · 2017-11-29T21:40:58Z

I think it makes sense to start simple:

3 nodes in each of two localities. For the meta record ranges, make sure the zone config requires one node in each of the two localities.

tbg · 2018-03-03T05:11:18Z

Once roachtest supports chaos (#20651) we should just make this a reproducible test.

a-robinson added this to the 1.2 milestone Aug 12, 2017

a-robinson changed the title ~~stability: Need to do improve performance of widely distributed clusters under chaos~~ stability: Need to improve performance of widely distributed clusters under chaos Aug 12, 2017

tbg assigned m-schneider Oct 5, 2017

tbg assigned a-robinson Oct 13, 2017

tbg mentioned this issue Oct 13, 2017

Severe performance degradation when shutting one of 3 nodes down. #19236

Closed

cuongdo assigned tbg and m-schneider and unassigned a-robinson and m-schneider Oct 17, 2017

awoods187 added the A-partitioning label Nov 9, 2017

This was referenced Nov 11, 2017

kv: local server optimization is brittle #19991

Closed

sql: provide better visibility into SQL statements that conflict #18473

Closed

jordanlewis modified the milestones: 2.0, 2.1 Mar 13, 2018

nvanbenschoten added C-performance Perf of queries or internals. Solution not expected to change functional behavior. A-kv-distribution Relating to rebalancing and leasing. labels Apr 24, 2018

nvanbenschoten modified the milestones: 2.1, 2.2 Sep 25, 2018

petermattis removed this from the 2.2 milestone Oct 5, 2018

knz unassigned m-schneider Jan 13, 2019

lunevalex added the X-stale label Apr 23, 2021

lunevalex closed this as completed Apr 23, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

stability: Need to improve performance of widely distributed clusters under chaos #17611

stability: Need to improve performance of widely distributed clusters under chaos #17611

a-robinson commented Aug 12, 2017

tbg commented Oct 5, 2017

tbg commented Oct 13, 2017

a-robinson commented Oct 16, 2017

petermattis commented Oct 16, 2017

tbg commented Nov 7, 2017

mberhault commented Nov 7, 2017

spencerkimball commented Nov 7, 2017 via email

tbg commented Nov 7, 2017

mberhault commented Nov 7, 2017

tbg commented Nov 10, 2017 •

edited

Loading

tbg commented Nov 10, 2017

tbg commented Nov 11, 2017

tbg commented Nov 11, 2017

tbg commented Nov 11, 2017 •

edited

Loading

danhhz commented Nov 29, 2017

spencerkimball commented Nov 29, 2017

tbg commented Mar 3, 2018

stability: Need to improve performance of widely distributed clusters under chaos #17611

stability: Need to improve performance of widely distributed clusters under chaos #17611

Comments

a-robinson commented Aug 12, 2017

tbg commented Oct 5, 2017

tbg commented Oct 13, 2017

a-robinson commented Oct 16, 2017

petermattis commented Oct 16, 2017

tbg commented Nov 7, 2017

mberhault commented Nov 7, 2017

spencerkimball commented Nov 7, 2017 via email

tbg commented Nov 7, 2017

mberhault commented Nov 7, 2017

tbg commented Nov 10, 2017 • edited Loading

tbg commented Nov 10, 2017

tbg commented Nov 11, 2017

tbg commented Nov 11, 2017

tbg commented Nov 11, 2017 • edited Loading

danhhz commented Nov 29, 2017

spencerkimball commented Nov 29, 2017

tbg commented Mar 3, 2018

tbg commented Nov 10, 2017 •

edited

Loading

tbg commented Nov 11, 2017 •

edited

Loading