sql: update demo for new multi-region abstractions #62025

awoods187 · 2021-03-15T17:30:44Z

We currently have both cockroach demo --geo-partitioned-replicas and global. Neither exactly makes sense for 21.1. with our new abstractions.

I'm proposing a new --multi-region that shows the idealized state using movr like the previous --geo-partitioned-replicas (and deprecating the --geo-partitioned-replicas).

I'm also proposing a clean demo slate with locality set up for general docs usage or to build up demos. I currently use /cockroach demo movr --nodes 9 but perhaps without the movr?

Jira issue: CRDB-2689

The text was updated successfully, but these errors were encountered:

awoods187 · 2021-03-16T03:11:06Z

It's a bit late in the cycle to squeeze in changes and the team is hard at work bug fixing so we are proposing the following plan of attack:

Deprecating --geo-partitoned-replicas from our docs but not our codebase in 21.1
Using ./cockroach demo --nodes 9 --empty --global to start from scratch (it contains 3 nodes each in us-east1, us-west1, and europe-west1`. This can be the starting point for non-movr related demos.
Using ./cockroach demo --movr --nodes 9 --global for demonstrating how to convert movr from a single-region database into a multi-region database. It will come with limitations including the pre 21.1 schema (e.g., compound-primary keys) but can be used to demonstrate all of the DDL as well as auto-homing inserts and locality optimized searches. This will not show latency.
Adding in 21.1.X (e.g., backporting) a --multi-region flag with a new movr schema and showing the correct multi-region abstractions by default.

What do you think @rmloveland and @jseldess?

awoods187 · 2021-03-16T03:39:10Z

@jordanlewis any concerns with including --global here? I think its really nice to include but we could potentially drop it if there were concerns and use ./cockroach demo --nodes 9 --empty

andreimatei · 2021-03-16T22:51:34Z

Consider roachprod local in this conversation. At least for development, that's a much more useful platform cause you don't lose all the state with restarts.

jordanlewis · 2021-03-17T02:39:20Z

--global is unmaintained and fairly brittle and non-extensible. It might work okay for this purpose, but I'm concerned that people playing with it at our behest will run into confusing issues that we won't have the bandwidth to fix. I think the demo pipeline is pretty important to keep clean, because people will likely judge our product based on it after downloading.

If the docs team is okay with "testing" --global's suitability for this project in its current state, I feel reasonably okay about using it for the docs. But it's worth noting that we don't quality-test this feature past a very basic test that ensures that the database starts up.

As for brittle, this is what we do for the latencies. It's hardcoded and not really changeable for now. Note that we wouldn't have flexibility to change the number of nodes to be more than 9 either.

func init() {
	regionToRegionToLatency = make(map[string]map[string]int)
	// Latencies collected from http://cloudping.co on 2019-09-11.
	for pair, latency := range map[regionPair]int{
		{regionA: "us-east1", regionB: "us-west1"}:     66,
		{regionA: "us-east1", regionB: "europe-west1"}: 64,
		{regionA: "us-west1", regionB: "europe-west1"}: 146,
	} {
		insertPair(pair, latency)
		insertPair(regionPair{
			regionA: pair.regionB,
			regionB: pair.regionA,
		}, latency)
	}
}

awoods187 · 2021-03-17T02:52:19Z

I think it would be fine for us to use this behind the scenes for katacoda as we can control the number of nodes etc. I also think it would be fine to list the known limitations for docs and demos as providing a hard-coded configuration. cc @jseldess @rmloveland

Can we add some tests in the stability period now to harden this up? I'm not asking to extend the capabilities just to make 9 nodes with --global well tested enough for us in docs and to back katacoda like we do for the spatial tutorial.

rmloveland · 2021-03-17T14:32:31Z

If the docs team is okay with "testing" --global's suitability for this project in its current state, I feel reasonably okay about using it for the docs

I'll take a look at it. I'm not feeling super psyched about it due to old tech writer saying: "you have to build it in your basement". This is pretty much the opposite of that for a multi-region deployment. Feels like we are not giving users the real dogfood. (Also why roachprod is DQ'd - not dog food)

However, if it can do the following, it could be made to ~work, maybe?

start cluster, latencies are "high"
run some multi-region SQL where we say "make this table GLOBAL, make this table REGIONAL"
cockroach moves some things around
the "high" latencies start to "drop" in the DB Console

Will look at it.

jseldess · 2021-03-17T14:44:52Z

Thanks for investigating, Rich. Overall, I agree with your point in the linked ticket that setup/provisioning shoudn't be the focus of the tutorial, so if we can this to work via demo or docker, that'd be awesome.

rmloveland · 2021-03-19T17:56:36Z

Did some testing and wrote up my findings so far in a wiki page: https://cockroachlabs.atlassian.net/wiki/x/BQDpXg

tl;dr: I think it works? but I may be tricking myself, and it's not super easy to set up (still easier than doing everything via docker tho IMO)

62435: cli: make --global flag for demo public r=knz a=otan See individual commits for details. Refs: #62025 63971: kv: commit-wait before running commit triggers and resolving intents r=nvanbenschoten a=nvanbenschoten Fixes a serious bug revealed by #63747. This commit fixes a bug revealed by kvnemesis where a range-merge watcher on the right-hand side of a range merge could incorrectly determine that a range merge txn had succeeded, when in reality, it had failed. The watcher would then put the RHS leaseholder replica into a stalled state by setting `r.mu.destroyStatus` to `destroyReasonMergePending`, effectively stalling any operation on the range indefinitely. The setup for this bug was that a range was operating with a `global_reads` zone configuration attribute, so it was pushing all writers into the future. The range was split and then rapidly merged back together. During the merge txn, a range-merge watcher (see `maybeWatchForMergeLocked`) began monitoring the state of the range merge txn. The problem was that at the time that the range merge txn committed, neither the meta descriptor version written by the merge or even the meta descriptor version written by the split were visible to the watcher's follow-up query. Because the watcher read below the split txn's descriptor, it came to the wrong conclusion about the merge. It is interesting to think about what is going wrong here, because it's not immediately obvious who is at fault. If a transaction has a commit timestamp in the future of present time, it will need to commit-wait before acknowledging the client. Typically, this is performed in the TxnCoordSender after the transaction has committed and resolved its intents (see TxnCoordSender.maybeCommitWait). It is safe to wait after a future-time transaction has committed and resolved intents without compromising linearizability because the uncertainty interval of concurrent and later readers ensures atomic visibility of the effects of the transaction. In other words, all of the transaction's intents will become visible and will remain visible at once, which is sometimes called "monotonic reads". This is true even if the resolved intents are at a high enough timestamp such that they are not visible to concurrent readers immediately after they are resolved, but only become visible sometime during the writer's commit-wait sleep. This property is central to the correctness of non-blocking transactions. See: https://github.com/cockroachdb/cockroach/blob/master/docs/RFCS/20200811_non_blocking_txns.md However, if a transaction has a commit trigger, the side-effects of the trigger will go into effect immediately upon applying the corresponding Raft log entry. This poses a problem, because we do not want part of a transaction's effects (e.g. its commit trigger) to become visible to onlookers before the rest of its effects do (e.g. its intent writes). To avoid this problem, this commit adds special server-side logic to perform the commit-wait stage of a transaction with a commit trigger early, before its EndTxn evaluates and its commit trigger fires. This results in the transaction waiting longer to commit, run its commit trigger, and resolve its intents, but it is otherwise safe and effective. Interestingly, this is quite similar to how Spanner handles its commit-wait rule: > Before allowing any coordinator replica to apply the commit record, the > coordinator leader waits until TT.after(s), so as to obey the commit-wait rule > described in Section 4.1.2. Because the coordinator leader chose s based on > TT.now().latest, and now waits until that timestamp is guaranteed to be in the > past, the expected wait is at least 2 ∗ . This wait is typically overlapped with > Paxos communication. After commit wait, the coordinator sends the commit > timestamp to the client and all other participant leaders. Each participant > leader logs the transaction’s outcome through Paxos. All participants apply at > the same timestamp and then release locks. Of course, the whole point of non-blocking transactions is that we release locks early and use clocks (through uncertainty intervals + a reader-side commit-wait rule) to enforce consistency, so we don't want to make this change for standard transactions. Before this change, I could hit the bug in about 5 minutes of stressing kvnemesis on a roachprod cluster. After this change, I've been able to run kvnemesis for a few hours without issue. Release note (bug fix): Fixed a rare bug present in betas where rapid range splits and merges on a GLOBAL table could lead to a stuck leaseholder replica. The situation is no longer possible. cc. @cockroachdb/kv 64012: roachtest: update expected failures for pgjdbc r=RichardJCai a=rafiss When it was easy, I also added the issue number that is causing the failure. fixes #63890 Release note: None Co-authored-by: Oliver Tan <[email protected]> Co-authored-by: Nathan VanBenschoten <[email protected]> Co-authored-by: Rafi Shamim <[email protected]>

awoods187 · 2021-05-05T12:51:46Z

@ajstorm have our concerns with demo been sufficiently addressed to check some parts of this issue off?

ajstorm · 2021-05-05T14:42:54Z

@awoods187 I checked off the one item we got to in 21.1 above. We need your movr changes before we can make progress on the rest of the items.

awoods187 · 2021-05-05T23:53:07Z

is the --global work all complete now? I somehow lost the tracking issue for that

ajstorm · 2021-05-06T01:55:45Z

I don't think it's all complete just yet. @otan can you comment on what's left?

…

On Wed, May 5, 2021 at 7:53 PM Andy Woods ***@***.***> wrote: is the --global work all complete now? I somehow lost the tracking issue for that — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#62025 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AEMXVOREM7OKWGNMS5GZ4YLTMHK7DANCNFSM4ZG7XJ6Q> .

otan · 2021-05-07T00:17:23Z

--global is slightly less experimental than before, and is ready to be consumed by the user with many caveats inbuilt.

github-actions · 2023-08-30T11:09:38Z

We have marked this issue as stale because it has been inactive for
18 months. If this issue is still relevant, removing the stale label
or adding a comment will keep it active. Otherwise, we'll close it in
10 days to keep the issue queue tidy. Thank you for your contribution
to CockroachDB!

awoods187 added the C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) label Mar 15, 2021

rmloveland mentioned this issue Mar 17, 2021

Maybe write a new page: Simulate a multi-region CockroachDB cluster on localhost cockroachdb/docs#10007

Closed

otan mentioned this issue Mar 30, 2021

cli: make --global flag for demo public #62435

Merged

jlinder added the T-multiregion label Jun 16, 2021

github-actions bot added the no-issue-activity label Aug 30, 2023

knz removed the no-issue-activity label Aug 31, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sql: update demo for new multi-region abstractions #62025

sql: update demo for new multi-region abstractions #62025

awoods187 commented Mar 15, 2021 •

edited by cockroach-jira-scripts

Loading

awoods187 commented Mar 16, 2021 •

edited by ajstorm

Loading

awoods187 commented Mar 16, 2021

andreimatei commented Mar 16, 2021

jordanlewis commented Mar 17, 2021

awoods187 commented Mar 17, 2021

rmloveland commented Mar 17, 2021

jseldess commented Mar 17, 2021

rmloveland commented Mar 19, 2021

awoods187 commented May 5, 2021

ajstorm commented May 5, 2021

awoods187 commented May 5, 2021

ajstorm commented May 6, 2021 via email

otan commented May 7, 2021

github-actions bot commented Aug 30, 2023

sql: update demo for new multi-region abstractions #62025

sql: update demo for new multi-region abstractions #62025

Comments

awoods187 commented Mar 15, 2021 • edited by cockroach-jira-scripts Loading

awoods187 commented Mar 16, 2021 • edited by ajstorm Loading

awoods187 commented Mar 16, 2021

andreimatei commented Mar 16, 2021

jordanlewis commented Mar 17, 2021

awoods187 commented Mar 17, 2021

rmloveland commented Mar 17, 2021

jseldess commented Mar 17, 2021

rmloveland commented Mar 19, 2021

awoods187 commented May 5, 2021

ajstorm commented May 5, 2021

awoods187 commented May 5, 2021

ajstorm commented May 6, 2021 via email

otan commented May 7, 2021

github-actions bot commented Aug 30, 2023

awoods187 commented Mar 15, 2021 •

edited by cockroach-jira-scripts

Loading

awoods187 commented Mar 16, 2021 •

edited by ajstorm

Loading