Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stability: Need to improve performance of widely distributed clusters under chaos #17611

Closed
a-robinson opened this issue Aug 12, 2017 · 17 comments
Assignees
Labels
A-kv-distribution Relating to rebalancing and leasing. A-partitioning C-performance Perf of queries or internals. Solution not expected to change functional behavior. X-stale

Comments

@a-robinson
Copy link
Contributor

Indigo gets pretty decimated by even the friendliest of disruptions. I'm running it on eef2e04 with no custom settings other than trace.debug.enable=true, and this happens when I simply run supervisorctl restart cockroach on one of the nodes:

screen shot 2017-08-11 at 8 50 58 pm

screen shot 2017-08-11 at 8 52 51 pm

The process took as much time as possible draining before hitting its limit and hard-stopping itself, then was restarted by supervisor in less than a second. I assume that more antagonistic forms of chaos would be very bad for the cluster as well.

We should devote some time to this for 1.2, but it's unlikely to be a small project. cc @nstewart

@a-robinson a-robinson added this to the 1.2 milestone Aug 12, 2017
@a-robinson a-robinson changed the title stability: Need to do improve performance of widely distributed clusters under chaos stability: Need to improve performance of widely distributed clusters under chaos Aug 12, 2017
@tbg
Copy link
Member

tbg commented Oct 5, 2017

Chatted with Masha and she'll look into this because there's little hope of even reproducing #14768 cleanly as long as this is happening.

@tbg
Copy link
Member

tbg commented Oct 13, 2017

@a-robinson assigning you per your comment on #19236. @m-schneider is focusing on the case in which the node is gone for >10min, #19165, though you may wanna touch base to see if there are any common problems.

@a-robinson
Copy link
Contributor Author

We've very much been touching base on her investigations. I don't plan to look at distributed clusters in the near term, so I'm going to unassign myself for now. If #19236 happens to be the same thing, then great, otherwise I don't want to make any promises.

@petermattis
Copy link
Collaborator

@awoods187 Just mentioned that stability of widely distributed clusters is going to affect partitioning. More specifically, stabilizing partitioning is going to be more difficult if it also involves stabilizing distributed clusters in the absence of partitioning.

@cuongdo cuongdo assigned tbg and m-schneider and unassigned a-robinson and m-schneider Oct 17, 2017
@tbg
Copy link
Member

tbg commented Nov 7, 2017

@mberhault having checked in with @cuongdo, I'd like to set up a long-lived geo-replicated cluster (SF, US, EU? Six nodes is probably enough) similar to the PM cluster which we can use for principled testing. I'm happy to set this up myself but am pretty sure I'd get nowhere without having you get me started first. Think you could send me off this week or next?

Once the cluster is ready, my first items would be setting up various tables with different zone configurations, loading a bunch of data, adding load, and checking that the latencies and behavior under chaos are in line with expectations.

As partitioning features and details of our prospective users for this feature become available, they would be incorporated into testing as well.

@mberhault
Copy link
Contributor

You may want to take a look at https://github.com/cockroachlabs/production/issues/500 which did some of this.
A reasonably easy way to get started would be to resurrect indigo (decommissioned in https://github.com/cockroachlabs/production/pull/504), or at least to look at what was going on there.

@spencerkimball
Copy link
Member

spencerkimball commented Nov 7, 2017 via email

@tbg
Copy link
Member

tbg commented Nov 7, 2017

@mberhault indigo was on az, is that still what we'd want to use?

@mberhault
Copy link
Contributor

Probably not, we may want to go for DO for that. We also don't need the hardware profile that it did, I don't think this is for performance work.

@tbg
Copy link
Member

tbg commented Nov 10, 2017

Small update: I have a script ready that can create geo-partitioned clusters on DigitalOcean. Since I won't be looking until next week anyway, I won't create a cluster just yet, though.

@tbg
Copy link
Member

tbg commented Nov 10, 2017

@danhhz with the upcoming partitioning work, what do you think a reasonable distributed cluster topology looks like? I'm currently running a New York - San Francisco - Singapore - Bangalore - Frankfurt cluster (latencies below) but that seems a little crazy. I'm thinking that three localities are enough? For now I'm running N block writers which each write to their own table (simulating an N-fold partitioned table, so most reads should be fast assuming the leaseholders realign suitably, and writes, well, can eat pretty bad latency). I'm planning to set up chaos, automated zone config changes, and make the workload of type "follow the sun". Let me know if you think there's another dimension I should look into testing.

image

@tbg
Copy link
Member

tbg commented Nov 11, 2017

I don't understand why I see a zero flatline for local request optimization here:

image

Running kv locally against a single-node cluster I see 100% optimized, and I'd expect to at least hit the optimization for a fraction.

@tbg
Copy link
Member

tbg commented Nov 11, 2017

Ok, I know why.

// GetLocalInternalServerForAddr returns the context's internal batch server
// for target, if it exists.
func (ctx *Context) GetLocalInternalServerForAddr(target string) roachpb.InternalServer {
	// Prints: cockroach-celebes-0005.crdb.io:26257 cockroach-celebes-0005:26257
	fmt.Println(target, ctx.Addr)
	if target == ctx.Addr {
		return ctx.localInternalServer
	}
	return nil
}

Seems unfortunate that we're not comparing IP addresses here. Suppose I could fix this by using --advertise-addr, but quite unfortunate that this doesn't "just work".

@tbg
Copy link
Member

tbg commented Nov 11, 2017

correction: it's already using --advertise-host, with cockroach-celebes-0005.crdb.io, so I'm confused why this doesn't work. The node also uses cockroach-celebes-0005 in its start output.

I made this "work" by using --host cockroach-celebes-0005.crdb.io instead of --advertise-host cockroach-celebes-0005.crdb.io (and no --host). I'll file a separate issue about this, just happened to run into it here.

Edit: filed #19991

@danhhz
Copy link
Contributor

danhhz commented Nov 29, 2017

@danhhz with the upcoming partitioning work, what do you think a reasonable distributed cluster topology looks like?

Sorry it took so long for me to get to this! I think 3 locations is good enough. I'm a little more interested in 2 nodes per location than 1, so you can end up with local write quorums.

@spencerkimball
Copy link
Member

I think it makes sense to start simple:

3 nodes in each of two localities. For the meta record ranges, make sure the zone config requires one node in each of the two localities.

@tbg
Copy link
Member

tbg commented Mar 3, 2018

Once roachtest supports chaos (#20651) we should just make this a reproducible test.

@jordanlewis jordanlewis modified the milestones: 2.0, 2.1 Mar 13, 2018
@nvanbenschoten nvanbenschoten added C-performance Perf of queries or internals. Solution not expected to change functional behavior. A-kv-distribution Relating to rebalancing and leasing. labels Apr 24, 2018
@nvanbenschoten nvanbenschoten modified the milestones: 2.1, 2.2 Sep 25, 2018
@petermattis petermattis removed this from the 2.2 milestone Oct 5, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-distribution Relating to rebalancing and leasing. A-partitioning C-performance Perf of queries or internals. Solution not expected to change functional behavior. X-stale
Projects
None yet
Development

No branches or pull requests