-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
stability: Need to improve performance of widely distributed clusters under chaos #17611
Comments
Chatted with Masha and she'll look into this because there's little hope of even reproducing #14768 cleanly as long as this is happening. |
@a-robinson assigning you per your comment on #19236. @m-schneider is focusing on the case in which the node is gone for >10min, #19165, though you may wanna touch base to see if there are any common problems. |
We've very much been touching base on her investigations. I don't plan to look at distributed clusters in the near term, so I'm going to unassign myself for now. If #19236 happens to be the same thing, then great, otherwise I don't want to make any promises. |
@awoods187 Just mentioned that stability of widely distributed clusters is going to affect partitioning. More specifically, stabilizing partitioning is going to be more difficult if it also involves stabilizing distributed clusters in the absence of partitioning. |
@mberhault having checked in with @cuongdo, I'd like to set up a long-lived geo-replicated cluster (SF, US, EU? Six nodes is probably enough) similar to the PM cluster which we can use for principled testing. I'm happy to set this up myself but am pretty sure I'd get nowhere without having you get me started first. Think you could send me off this week or next? Once the cluster is ready, my first items would be setting up various tables with different zone configurations, loading a bunch of data, adding load, and checking that the latencies and behavior under chaos are in line with expectations. As partitioning features and details of our prospective users for this feature become available, they would be incorporated into testing as well. |
You may want to take a look at https://github.com/cockroachlabs/production/issues/500 which did some of this. |
Yes, we definitely need a cluster with at least three localities – need
this for serious admin UI testing as well. I think a cluster which spans
east coast and one that spans continents would be two very useful cases.
The spanned continents will be critical to geo-partitioning testing.
…On Tue, Nov 7, 2017 at 4:33 PM marc ***@***.***> wrote:
You may want to take a look at cockroachlabs/production#500
<cockroachlabs/production#500> which did some
of this.
A reasonably easy way to get started would be to resurrect indigo
(decommissioned in cockroachlabs/production#504
<cockroachlabs/production#504>), or at least to
look at what was going on there.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#17611 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AF3MTX-mBiaLc4qPgcyHhFVQItmzIj-tks5s0MyggaJpZM4O1Qzm>
.
|
@mberhault indigo was on |
Probably not, we may want to go for DO for that. We also don't need the hardware profile that it did, I don't think this is for performance work. |
Small update: I have a script ready that can create geo-partitioned clusters on DigitalOcean. Since I won't be looking until next week anyway, I won't create a cluster just yet, though. |
@danhhz with the upcoming partitioning work, what do you think a reasonable distributed cluster topology looks like? I'm currently running a New York - San Francisco - Singapore - Bangalore - Frankfurt cluster (latencies below) but that seems a little crazy. I'm thinking that three localities are enough? For now I'm running N block writers which each write to their own table (simulating an N-fold partitioned table, so most reads should be fast assuming the leaseholders realign suitably, and writes, well, can eat pretty bad latency). I'm planning to set up chaos, automated zone config changes, and make the workload of type "follow the sun". Let me know if you think there's another dimension I should look into testing. |
Ok, I know why. // GetLocalInternalServerForAddr returns the context's internal batch server
// for target, if it exists.
func (ctx *Context) GetLocalInternalServerForAddr(target string) roachpb.InternalServer {
// Prints: cockroach-celebes-0005.crdb.io:26257 cockroach-celebes-0005:26257
fmt.Println(target, ctx.Addr)
if target == ctx.Addr {
return ctx.localInternalServer
}
return nil
} Seems unfortunate that we're not comparing IP addresses here. Suppose I could fix this by using |
correction: it's already using I made this "work" by using Edit: filed #19991 |
Sorry it took so long for me to get to this! I think 3 locations is good enough. I'm a little more interested in 2 nodes per location than 1, so you can end up with local write quorums. |
I think it makes sense to start simple: 3 nodes in each of two localities. For the meta record ranges, make sure the zone config requires one node in each of the two localities. |
Once roachtest supports chaos (#20651) we should just make this a reproducible test. |
Indigo gets pretty decimated by even the friendliest of disruptions. I'm running it on eef2e04 with no custom settings other than
trace.debug.enable=true
, and this happens when I simply runsupervisorctl restart cockroach
on one of the nodes:The process took as much time as possible draining before hitting its limit and hard-stopping itself, then was restarted by supervisor in less than a second. I assume that more antagonistic forms of chaos would be very bad for the cluster as well.
We should devote some time to this for 1.2, but it's unlikely to be a small project. cc @nstewart
The text was updated successfully, but these errors were encountered: