Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: restore2TB/nodes=10 failed [stuck in kv] #61396

Closed
cockroach-teamcity opened this issue Mar 3, 2021 · 24 comments · Fixed by #64060
Closed

roachtest: restore2TB/nodes=10 failed [stuck in kv] #61396

cockroach-teamcity opened this issue Mar 3, 2021 · 24 comments · Fixed by #64060
Assignees
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). GA-blocker O-roachtest O-robot Originated from a bot.

Comments

@cockroach-teamcity
Copy link
Member

(roachtest).restore2TB/nodes=10 failed on master@6de4313ec216161c79fe725fcc31fc87ef1804ea:

		  | 				WITH into_db = 'restore2tb'" returned
		  | stderr:
		  | ERROR: communication error: rpc error: code = Canceled desc = context canceled
		  | SQLSTATE: 58C01
		  | Failed running "sql"
		  | Error: COMMAND_PROBLEM: exit status 1
		  | (1) COMMAND_PROBLEM
		  | Wraps: (2) Node 1. Command with error:
		  |   | ```
		  |   | ./cockroach sql --insecure -e "
		  |   | 				RESTORE csv.bank FROM
		  |   | 				'gs://cockroach-fixtures/workload/bank/version=1.0.0,payload-bytes=10240,ranges=0,rows=65104166,seed=1/bank'
		  |   | 				WITH into_db = 'restore2tb'"
		  |   | ```
		  | Wraps: (3) exit status 1
		  | Error types: (1) errors.Cmd (2) *hintdetail.withDetail (3) *exec.ExitError
		  |
		  | stdout:
		Wraps: (4) exit status 20
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *main.withCommandDetails (4) *exec.ExitError

	cluster.go:2688,restore.go:264,test_runner.go:767: monitor failure: monitor task failed: t.Fatal() was called
		(1) attached stack trace
		  -- stack trace:
		  | main.(*monitor).WaitE
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2676
		  | main.(*monitor).Wait
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2684
		  | main.registerRestore.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/restore.go:264
		  | main.(*testRunner).runTest.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:767
		Wraps: (2) monitor failure
		Wraps: (3) attached stack trace
		  -- stack trace:
		  | main.(*monitor).wait.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2732
		Wraps: (4) monitor task failed
		Wraps: (5) attached stack trace
		  -- stack trace:
		  | main.init
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2646
		  | runtime.doInit
		  | 	/usr/local/go/src/runtime/proc.go:5652
		  | runtime.main
		  | 	/usr/local/go/src/runtime/proc.go:191
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1374
		Wraps: (6) t.Fatal() was called
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *withstack.withStack (6) *errutil.leafError

More

Artifacts: /restore2TB/nodes=10

See this test on roachdash
powered by pkg/cmd/internal/issues

@cockroach-teamcity cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Mar 3, 2021
@dt
Copy link
Member

dt commented Mar 3, 2021

huh, this looks suspect: slow range RPC: have been waiting 2735.59s (1 attempts) for RPC AdminSplit [‹×›,‹×›) to r32156

Looks like it likely got stuck waiting on stuck KV requests?

@dt
Copy link
Member

dt commented Mar 3, 2021

huh, lots and lots of

E210303 10:08:02.580388 24687451 kv/kvserver/queue.go:1093 ⋮ [n10,raftsnapshot,s10,r32467/2:‹/Table/53/1/5710{0571…-3843}›] 65300  (n4,s4):3: remote couldn't accept VIA_SNAPSHOT_QUEUE snapshot
 ‹11eb4e47› at applied index 18 with error: ‹[n4,s4],r32467: cannot apply snapshot: snapshot intersects existing range; initiated GC: [n4,s4,r32154/3:/Table/53/1/5{6991216-877977…}] (incoming /Table/53/1/5710{0571/0-3843})›

https://teamcity.cockroachdb.com/repository/download/Cockroach_Nightlies_WorkloadNightly/2738824:id/restore2TB/nodes%3D10/run_1/logs/10.unredacted/cockroach.log

@dt
Copy link
Member

dt commented Mar 3, 2021

I210303 09:29:25.123568 603520 kv/kvserver/replica_command.go:397 â‹® 
[n8,s8,r32156/1:‹/Table/53/1/5{2896598-318337…}›] 21720  initiating a split of this range at key ‹/Table/53/1/52896598/0› 
[r32158] (‹manual›)‹; replica r32156/2 not caught up: StateSnapshot match=0 next=1 paused; 
delayed split for 50.0s to avoid Raft snapshot (without success); r32156/2 is waiting for a Raft snapshot›

https://teamcity.cockroachdb.com/repository/download/Cockroach_Nightlies_WorkloadNightly/2738824:id/restore2TB/nodes%3D10/run_1/logs/8.unredacted/cockroach.teamcity-2738824-1614754482-21-n10cpu4-0008.ubuntu.2021-03-03T09_10_02Z.007756.log

@tbg
Copy link
Member

tbg commented Mar 8, 2021

health: 10:14:11 restore.go:126: n1/s1  498.00   metrics  queue.raftsnapshot.pending
n2/s2  564.00   metrics  queue.raftsnapshot.pending
n3/s3  1369.00  metrics  queue.raftsnapshot.pending
n4/s4  776.00   metrics  queue.raftsnapshot.pending
n5/s5  1072.00  metrics  queue.raftsnapshot.pending
n6/s6  951.00   metrics  queue.raftsnapshot.pending
n7/s7  1059.00  metrics  queue.raftsnapshot.pending
n8/s8  1235.00  metrics  queue.raftsnapshot.pending
n9/s9  358.00   metrics  queue.raftsnapshot.pending

Just at first glance, I wonder if this is related to #61541.

@tbg
Copy link
Member

tbg commented Mar 8, 2021

It does look similar, though not the same. I have some of the same questions, like how did we get DistSender RPCs to spend hundreds of seconds in a single attempt without also having a slow replication attempt.
Either way, independently, we also definitely did cause 359 raft snapshots in this test via splits:

grep split combined.log  | grep "is waiting for a" | wc -l
359

It worries me that #61541 and this issue are popping up now, as it's likely that we have regressed in some way.

@tbg tbg added GA-blocker and removed release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Mar 29, 2021
@tbg tbg changed the title roachtest: restore2TB/nodes=10 failed roachtest: restore2TB/nodes=10 failed [stuck in kv] Mar 29, 2021
@aliher1911
Copy link
Contributor

Capturing what I saw so far.
I run this test 4 times and with the same build it always succeeded but the running time vary wildly. From under 2 hours (2 times) which is in line with current successful nightly builds to over 4 hours. It looks like raft snapshot queue correlates with the duration of the run, for long running tests queue builds up and stays high for the whole duration of the test, for shorter runs queue builds up initially but then gradually goes down. I captured debug close to the end of the test for slow and fast runs.

@tbg
Copy link
Member

tbg commented Mar 31, 2021

[The test history suggests that the nightly test times also vary wildly](https://teamcity.cockroachdb.com/project.html?projectId=Cockroach_Nightlies&testNameId=741854490244253129&tab=testDetails&branch_Cockroach_Nightlies=%3Cdefault%3E

image

I spent a lot of time a few years ago trying to get the numbers of snapshots down. For a good (if a bit outdated) read on why avoiding snapshots is unusually complex, see here (internal link).

@tbg
Copy link
Member

tbg commented Mar 31, 2021

Internal link to debug artifacts for a slow run (not the failing one here): https://drive.google.com/file/d/1Zm4l6TEk6IiEWpCx4vGG9qQigerXksnD/view?usp=sharing

@tbg
Copy link
Member

tbg commented Mar 31, 2021

In a group session with @aliher1911 @erikgrinaker @angelapwen we looked at logs of a recent slow restore2TB run and found evidence of ranges being slow to apply the split trigger. This then leads to the right hand sides of the splits (on the slow nodes) to ask for a snapshot (indicated via the message "would have dropped incoming MsgApp to wait for split trigger, but allowing due to ..."). That snapshot is then blocked by the slow pre-split range on the recipient follower (since it's in all likelihood not gc'able, just slow/stuck for whatever reason).

@aliher1911 will try to get a repro of this with --debug and a restore2TB version that disables merges (to keep more state around) and which fails when it would usually signal success. That should give us a cluster to poke at on which we can look at timeseries to investigate why the follower was slow to apply the splits. One cause could be overload or a poor state of the storage layer.

@tbg
Copy link
Member

tbg commented Apr 16, 2021

I'm taking a look at this as well to be able to team up with @aliher1911 a bit better, working off my repro branch #63784. In the first attempt, I made this a fatal error

log.Fatalf(
ctx,
"would have dropped incoming MsgApp to wait for split trigger, "+
"but allowing due to %d (>%d) ticks",
ticks, maxDelaySplitTriggerTicks)
return false

in the assumption that once this fires, something worth investigating has already gone wrong. I also added some extra logging around when each replica applies the split.

A node crashed with a minute or so of starting the restore. Looking through, I saw

// t=0s
// r1623 is split off, and n5 and n1 promptly apply it. Not n10!
09:08:16.37  [n5,s5,r1622/2g332  initiating a split of this range at key ‹/Table/53/1/3063609› [r1623] (‹manual›)‹›
09:08:16.39  [n5,s5,r1622/2g332  TBG splitPreApplyStart rhs=r1623
[...]
09:08:16.39  [n1,s1,r1622/3g333  TBG splitPostApplyStart rhs=r1623 done after 0.00s

// t=0s still
// r1624 is split off, n5 and n1 promptly apply it. Not n10!
09:08:16.40  [n5,s5,r1623/2g333  initiating a split of this range at key ‹/Table/53/1/3063609/0› [r1624] (‹manual›)‹›
09:08:16.42  [n5,s5,r1623/2g333  TBG splitPreApplyStart rhs=r1624
[...]
09:08:16.42  [n5,s5,r1623/2g334  TBG splitPostApplyStart rhs=r1624 done after 0.00s

// t=0s
// n5 adds and rolls back learner on n9 (this is just noise, unrelated)
[...]
09:08:16.45  [n5,s5,r1624/2g336  rolled back LEARNER n9,s9 in r1624:‹/Table/53/1/3{063609/0-119787}› [(n10,s10):1, (n5,s5):2, (n1,s1):3, (n9,s9):4LEARNER, next=5, gen=335, sticky=1618567696.402001139,0]

// t=17s (!!!) n10 finally applies the *first* split, but not the second
09:08:35.43  [n10,s10,r1622/1g332 TBG splitPreApplyStart rhs=r1623
09:08:35.43  [n10,s10,r1622/1g332 TBG splitPreApplyStart rhs=r1623 done after 0.00s
09:08:35.43  [n10,s10,r1622/1g332 TBG splitPostApplyStart rhs=r1623
09:08:35.43  [n10,s10,r1622/1g333 TBG splitPostApplyStart rhs=r1623 done after 0.00s

// t=18s the replica for the second split on n10 gets impatient and asks for a snapshot
09:08:36.42  [n10,s10,r1624/1g0:{-}] the server is terminating due to a fatal error (see the DEV channel for details)
09:08:36.42  [n10,s10,r1624/1g0:{-}] would have dropped incoming MsgApp to wait for split trigger, but allowing due to 101 (>100) ticks

this does seem a reasonable reduction in the problem that we can iterate on quickly. Replicas falling behind like that on their command application is an issue and we ought to figure out why. I'll keep poking.

@tbg
Copy link
Member

tbg commented Apr 16, 2021

At the same time, we're not seeing this one fire

// remote replicas will likely start campaigning.
if elapsed >= defaultReplicaRaftMuWarnThreshold {
log.Warningf(ctx, "handle raft ready: %.1fs [applied=%d, batches=%d, state_assertions=%d]",
elapsed.Seconds(), stats.entriesProcessed, stats.batchesProcessed, stats.stateAssertions)
}

so no individual ready handling (which includes appending to the log as well as applying entries to the state machine and any sync required to do so) takes more than 500s on any of the nodes.

I just did a second repro and the history was very similar. Hard to tell where to go from here. I'll add some more logging about the progress during splits (to see how many indexes behind the followers are when we split). Note that the splits here weren't delayed. I think the splits just happen to exacerbate the problem too, they're not a vital ingredient. They just cause snapshots which is why we're here looking at it. I suspect followers fall back quite a bit just on their own under restore2TB and I'd like to understand why.

@tbg
Copy link
Member

tbg commented Apr 16, 2021

In another run where n9 hit the fatal error (i.e. was not catching up on its splits), I looked into whether n9 is generally unable to apply splits; doesn't seem like it:

$ grep -E 'splitPostApply.*done' combined.log | grep -Eo 'n[0-9],s' | sort | uniq -c
    525 n1,s
    537 n2,s
    549 n3,s
    585 n4,s
    584 n5,s
    558 n6,s
    578 n7,s
    550 n8,s
    593 n9,s

From when the split is initiated to when it is applied (per replica), time does routinely pass, and it's not limited to any single node.

$ cat combined.log | grep -E 'age=[1-9][0-9]*' | tail
tobias-1618567963-01-n10cpu4-0003> I210416 10:16:36.786717 169 kv/kvserver/store_split.go:48 ⋮ [n3,s3,r1473/1g175:‹/Table/53/1/{394198…-625724}›] 3893  TBG splitPreApplyStart rhs=r2082 age=14.16s
tobias-1618567963-01-n10cpu4-0002> I210416 10:16:36.790529 134 kv/kvserver/store_split.go:48 ⋮ [n2,s2,r1473/2g175:‹/Table/53/1/{394198…-625724}›] 4083  TBG splitPreApplyStart rhs=r2082 age=14.16s
tobias-1618567963-01-n10cpu4-0009> I210416 10:16:36.832990 132 kv/kvserver/store_split.go:48 ⋮ [n9,s9,r1678/3g193:‹/Table/53/1/3{507726…-742894}›] 4382  TBG splitPreApplyStart rhs=r1680 age=12.16s
tobias-1618567963-01-n10cpu4-0008> I210416 10:16:36.840412 159 kv/kvserver/store_split.go:48 ⋮ [n8,s8,r1745/2g216:‹/Table/53/1/1{378429-559247}›] 3657  TBG splitPreApplyStart rhs=r1746 age=6.10s
tobias-1618567963-01-n10cpu4-0008> I210416 10:16:36.940464 159 kv/kvserver/store_split.go:48 ⋮ [n8,s8,r2061/2g233:‹/Table/53/1/{199786…-2173917}›] 3691  TBG splitPreApplyStart rhs=r2062 age=2.90s
tobias-1618567963-01-n10cpu4-0009> I210416 10:16:37.007813 160 kv/kvserver/store_split.go:48 ⋮ [n9,s9,r1800/3g241:‹/Table/53/1/1{709222…-869234}›] 4423  TBG splitPreApplyStart rhs=r1801 age=9.27s
tobias-1618567963-01-n10cpu4-0005> I210416 10:16:37.008088 161 kv/kvserver/store_split.go:48 ⋮ [n5,s5,r1800/1g241:‹/Table/53/1/1{709222…-869234}›] 3886  TBG splitPreApplyStart rhs=r1801 age=9.27s
tobias-1618567963-01-n10cpu4-0004> I210416 10:16:37.010207 168 kv/kvserver/store_split.go:48 ⋮ [n4,s4,r1800/2g241:‹/Table/53/1/1{709222…-869234}›] 5060  TBG splitPreApplyStart rhs=r1801 age=9.27s
tobias-1618567963-01-n10cpu4-0008> I210416 10:16:37.038766 158 kv/kvserver/store_split.go:48 ⋮ [n8,s8,r1986/2g186:‹/Table/53/1/4{183127…-372292}›] 3721  TBG splitPreApplyStart rhs=r1987 age=3.01s
tobias-1618567963-01-n10cpu4-0008> I210416 10:16:37.039076 135 kv/kvserver/store_split.go:48 ⋮ [n8,s8,r2009/3g184:‹/Table/53/1/4{378838…-685733…}›] 3724  TBG splitPreApplyStart rhs=r2010 age=3.07s
$ cat combined.log | grep -E 'age=[1-9][0-9]*' | wc -l
777

In another repro, I printed the behindCount at the time of proceeding to the split as computed here

https://github.com/cockroachdb/cockroach/blob/master/pkg/kv/kvserver/replica_metrics.go#L218-L232

It didn't show anything obvious. The groups are generally very well caught up.

$ grep -Eo 'behindCount=[0-9]+' combined.log | sort | uniq -c
     69 behindCount=1
     15 behindCount=2
      8 behindCount=3
      3 behindCount=4
      2 behindCount=5
      2 behindCount=6

@tbg
Copy link
Member

tbg commented Apr 16, 2021

Ok, this new logging is helpful:

12:35:37.508591 9899 kv/kvserver/replica_command.go:397 ⋮ [n9,s9,r708/1g97:‹/Table/53/1/{197637…-2173917}›] 1536  initiating a split of this range at key ‹/Table/53/1/1979647/PrefixEnd› [r709] (‹manual›)‹›
12:35:39.545864 184 kv/kvserver/split_trigger_helper.go:140 ⋮ [n5,s5,r709/4g0:{-}] 2070  start key is contained in replica [n5,s5,r329/4g70:‹/Table/53/1/{1940277-2173917}›]
12:35:39.545884 184 kv/kvserver/split_trigger_helper.go:143 ⋮ [n5,s5,r709/4g0:{-}] 2072  dropping MsgApp at index 10 to wait for split trigger
12:35:40.546299 176 kv/kvserver/split_trigger_helper.go:140 ⋮ [n5,s5,r709/4g0:{-}] 2633  start key is contained in replica [n5,s5,r330/4g71:‹/Table/53/1/{194027…-2173917}›]
12:35:40.546319 176 kv/kvserver/split_trigger_helper.go:143 ⋮ [n5,s5,r709/4g0:{-}] 2636  dropping MsgApp at index 10 to wait for split trigger
12:35:41.556895 196 kv/kvserver/split_trigger_helper.go:140 ⋮ [n5,s5,r709/4g0:{-}] 3306  start key is contained in replica [n5,s5,r331/4g72:‹/Table/53/1/{1943550-2173917}›]
12:35:41.556914 196 kv/kvserver/split_trigger_helper.go:143 ⋮ [n5,s5,r709/4g0:{-}] 3307  dropping MsgApp at index 10 to wait for split trigger
12:35:42.550614 181 kv/kvserver/split_trigger_helper.go:140 ⋮ [n5,s5,r709/4g0:{-}] 4027  start key is contained in replica [n5,s5,r471/4g74:‹/Table/53/1/{1946823-2173917}›]
12:35:42.550649 181 kv/kvserver/split_trigger_helper.go:143 ⋮ [n5,s5,r709/4g0:{-}] 4028  dropping MsgApp at index 10 to wait for split trigger
12:35:43.550642 180 kv/kvserver/split_trigger_helper.go:140 ⋮ [n5,s5,r709/4g0:{-}] 4860  start key is contained in replica [n5,s5,r475/4g76:‹/Table/53/1/{1950096-2173917}›]
12:35:43.550654 180 kv/kvserver/split_trigger_helper.go:143 ⋮ [n5,s5,r709/4g0:{-}] 4861  dropping MsgApp at index 10 to wait for split trigger
12:35:44.545900 185 kv/kvserver/split_trigger_helper.go:140 ⋮ [n5,s5,r709/4g0:{-}] 5530  start key is contained in replica [n5,s5,r519/4g78:‹/Table/53/1/{1951071-2173917}›]
12:35:44.545922 185 kv/kvserver/split_trigger_helper.go:143 ⋮ [n5,s5,r709/4g0:{-}] 5533  dropping MsgApp at index 10 to wait for split trigger
12:35:45.548291 192 kv/kvserver/split_trigger_helper.go:140 ⋮ [n5,s5,r709/4g0:{-}] 6248  start key is contained in replica [n5,s5,r521/4g79:‹/Table/53/1/{195107…-2173917}›]
12:35:45.548312 192 kv/kvserver/split_trigger_helper.go:143 ⋮ [n5,s5,r709/4g0:{-}] 6249  dropping MsgApp at index 10 to wait for split trigger
12:35:46.570405 194 kv/kvserver/split_trigger_helper.go:140 ⋮ [n5,s5,r709/4g0:{-}] 6989  start key is contained in replica [n5,s5,r524/4g80:‹/Table/53/1/{1954344-2173917}›]
12:35:46.570435 194 kv/kvserver/split_trigger_helper.go:143 ⋮ [n5,s5,r709/4g0:{-}] 6990  dropping MsgApp at index 10 to wait for split trigger
12:35:47.546035 171 kv/kvserver/split_trigger_helper.go:140 ⋮ [n5,s5,r709/4g0:{-}] 7638  start key is contained in replica [n5,s5,r525/4g81:‹/Table/53/1/{195434…-2173917}›]
12:35:47.546055 171 kv/kvserver/split_trigger_helper.go:143 ⋮ [n5,s5,r709/4g0:{-}] 7639  dropping MsgApp at index 10 to wait for split trigger
12:35:48.545334 169 kv/kvserver/split_trigger_helper.go:140 ⋮ [n5,s5,r709/4g0:{-}] 8257  start key is contained in replica [n5,s5,r607/4g82:‹/Table/53/1/{1957617-2173917}›]
12:35:48.545356 169 kv/kvserver/split_trigger_helper.go:143 ⋮ [n5,s5,r709/4g0:{-}] 8260  dropping MsgApp at index 10 to wait for split trigger
12:35:49.546440 166 kv/kvserver/split_trigger_helper.go:140 ⋮ [n5,s5,r709/4g0:{-}] 8911  start key is contained in replica [n5,s5,r619/4g83:‹/Table/53/1/{195761…-2173917}›]
12:35:49.546460 166 kv/kvserver/split_trigger_helper.go:143 ⋮ [n5,s5,r709/4g0:{-}] 8913  dropping MsgApp at index 10 to wait for split trigger
12:35:50.547489 178 kv/kvserver/split_trigger_helper.go:140 ⋮ [n5,s5,r709/4g0:{-}] 9525  start key is contained in replica [n5,s5,r622/4g84:‹/Table/53/1/{1959073-2173917}›]
12:35:50.547519 178 kv/kvserver/split_trigger_helper.go:143 ⋮ [n5,s5,r709/4g0:{-}] 9526  dropping MsgApp at index 10 to wait for split trigger
12:35:51.545363 176 kv/kvserver/split_trigger_helper.go:140 ⋮ [n5,s5,r709/4g0:{-}] 10390  start key is contained in replica [n5,s5,r625/4g86:‹/Table/53/1/{1962346-2173917}›]
12:35:51.545384 176 kv/kvserver/split_trigger_helper.go:143 ⋮ [n5,s5,r709/4g0:{-}] 10392  dropping MsgApp at index 10 to wait for split trigger
12:35:52.546513 193 kv/kvserver/split_trigger_helper.go:140 ⋮ [n5,s5,r709/4g0:{-}] 11053  start key is contained in replica [n5,s5,r669/4g88:‹/Table/53/1/{1965619-2173917}›]
12:35:52.546538 193 kv/kvserver/split_trigger_helper.go:143 ⋮ [n5,s5,r709/4g0:{-}] 11055  dropping MsgApp at index 10 to wait for split trigger
12:35:53.546374 181 kv/kvserver/split_trigger_helper.go:140 ⋮ [n5,s5,r709/4g0:{-}] 11781  start key is contained in replica [n5,s5,r676/4g91:‹/Table/53/1/{196889…-2173917}›]
12:35:53.546400 181 kv/kvserver/split_trigger_helper.go:143 ⋮ [n5,s5,r709/4g0:{-}] 11782  dropping MsgApp at index 10 to wait for split trigger
12:35:54.546599 172 kv/kvserver/split_trigger_helper.go:140 ⋮ [n5,s5,r709/4g0:{-}] 12139  start key is contained in replica [n5,s5,r677/4g92:‹/Table/53/1/{1969829-2173917}›]
12:35:54.546619 172 kv/kvserver/split_trigger_helper.go:143 ⋮ [n5,s5,r709/4g0:{-}] 12140  dropping MsgApp at index 10 to wait for split trigger
12:35:55.563511 176 kv/kvserver/split_trigger_helper.go:140 ⋮ [n5,s5,r709/4g0:{-}] 12495  start key is contained in replica [n5,s5,r681/4g94:‹/Table/53/1/{1973102-2173917}›]
12:35:55.563530 176 kv/kvserver/split_trigger_helper.go:143 ⋮ [n5,s5,r709/4g0:{-}] 12496  dropping MsgApp at index 10 to wait for split trigger
12:35:56.565695 169 kv/kvserver/split_trigger_helper.go:140 ⋮ [n5,s5,r709/4g0:{-}] 12836  start key is contained in replica [n5,s5,r684/4g95:‹/Table/53/1/{197310…-2173917}›]
12:35:56.565714 169 kv/kvserver/split_trigger_helper.go:143 ⋮ [n5,s5,r709/4g0:{-}] 12837  dropping MsgApp at index 10 to wait for split trigger
12:35:57.545996 169 kv/kvserver/split_trigger_helper.go:140 ⋮ [n5,s5,r709/4g0:{-}] 13148  start key is contained in replica [n5,s5,r686/4g96:‹/Table/53/1/{1976375-2173917}›]
12:35:57.546022 169 kv/kvserver/split_trigger_helper.go:143 ⋮ [n5,s5,r709/4g0:{-}] 13150  dropping MsgApp at index 10 to wait for split trigger
12:35:57.546065 169 1@kv/kvserver/split_trigger_helper.go:150 ⋮ [n5,s5,r709/4g0:{-}] 13152  the server is terminating due to a fatal error (see the DEV channel for details)
12:35:57.605651 169 kv/kvserver/split_trigger_helper.go:150 ⋮ [n5,s5,r709/4g0:{-}] 13165  would have dropped incoming MsgApp to wait for split trigger, but allowing due to 101 (>100) ticks

Each line logged is the follower dropping communication from r709's leader because it's still expecting to be initialized locally via split triggers. The interesting bit is that it logs the local replica that it hopes will initialize it. Each generation (gXX) roughly corresponds to one split. Making this a bit terser:

$ cat 709.log | grep -Eo 'in replica.*' | grep -Eo '4g[0-9]+'
4g70
4g71
4g72
4g74
4g76
4g78
4g79
4g80
4g81
4g82
4g83
4g84
4g86
4g88
4g91
4g92
4g94
4g95
4g96

The probing happens in 1s intervals, so we see here that we're executing between one and two split triggers per second (on the chain that will ultimately create r709).

Ok, I think I am starting to understand what's happening here. I looked into r329 which is the earliest logged candidate for initializing the final replica for r709/4. I think what's happening is the following. Restore roughly does

for 2 :=1; i <= 100; i++ {
  split(i) // make range i out of range i-1
  scatter(range i)
}

Every now and then, it will naturally happen that a follower applies the split after the right hand side leader is already trying to append, let's say this happens on the first split. This append to r2 is suppressed ("dropping MsgApp at index 10") to avoid the leader of r2 thinking a snapshot is necessary. However, a side effect is that the leader will append only every second going forward (as it's probing the follower). The follower quickly catches up on the split, but now there's still most of the second remaining until the leader tries again. In the meantime, restore is busy doing split/scatter/split/scatter/...; let's ignore the scatter for now and focus on the splits. For each subsequent split, since r2 still has most of the second to go, the next split is highly likely to have the same problem (the leader makes its RHS, reaches out to the follower, but the follower r3, r4, etc is not there yet because we're still waiting for r2 to catch up into r2, which then has to sit out another good fraction of a second, etc). It becomes the cascade where once you hit this common race, every subsequent rapid split eats approximately a 500ms delay (consistent with seeing about a generation or two pass per second in the logging above). Mixing in the scatter helps and hurts: it does backpressure the rate of splits (since scatter takes time). But! It also removes replicas, and followers take a moment to realize they've been removed, during which time the split-catch-up-chain stalls. In practice, they realize pretty quickly so I'm not actually sure that the scatter is a real negative here. I will try without scatter to see if the problem persists.

@tbg
Copy link
Member

tbg commented Apr 16, 2021

Yeah, the problem also occurs with a version of SCATTER that only ever randomizes leases, and it sure didn't take any longer (but to be fair it's pretty much instant all the time anyway). Curiously in that last run I also saw the pending snapshots shoot up into the hundreds really early in. And I do wonder why that happened, none of the "initiating a split" messages indicated any reason for that. Will add some more logging around that to potentially track it down.

@tbg
Copy link
Member

tbg commented Apr 16, 2021

I looked into the raft snapshot queue buildup I saw in one repro and I think it's just the known problem - while we're 100 splits backed up in a chain, sometimes a root replica gets GC'ed, at which point the remainder of the chain will rightfully ask for snapshots. So this issue is fallout from the chain buildup.

@tbg
Copy link
Member

tbg commented Apr 16, 2021

Logs for a run including this diff (some boring plumbing diffs omitted) look very promising. The cluster isn't hitting the fatal error from timed out snapshot suppression, and the health alerts have cleared out quickly. There was an "underreplicated range" for a while but I verified that this was just the known problem of a removed replica thinking of itself as the range counter (which we should fix anyway, and there might be a question on why the replica stuck around for a minute; probably was quiesced).

@@ -397,8 +400,11 @@ func (r *Replica) adminSplitWithDescriptor(
        log.Infof(ctx, "initiating a split of this range at key %s [r%d] (%s)%s",
                splitKey.StringWithDirs(nil /* valDirs */, 50 /* maxLen */), rightRangeID, reason, extra)
 
+       var rightDesc *roachpb.RangeDescriptor
        if err := r.store.DB().Txn(ctx, func(ctx context.Context, txn *kv.Txn) error {
-               return splitTxnAttempt(ctx, r.store, txn, rightRangeID, splitKey, args.ExpirationTime, desc)
+               var err error
+               _, rightDesc, err = splitTxnAttempt(ctx, r.store, txn, rightRangeID, splitKey, args.ExpirationTime, desc)
+               return err
        }); err != nil {
                // The ConditionFailedError can occur because the descriptors acting
                // as expected values in the CPuts used to update the left or right
@@ -412,6 +418,12 @@ func (r *Replica) adminSplitWithDescriptor(
                }
                return reply, errors.Wrapf(err, "split at key %s failed", splitKey)
        }
+       if err := waitForReplicasInit(
+               ctx, r.store.cfg.NodeDialer, rightDesc.RangeID, rightDesc.Replicas().AsProto(),
+       ); err != nil {
+               log.Infof(ctx, "failed to wait for right hand side to initialize after split: %s", err)
+               _ = err // handled
+       }
        return reply, nil
 }

@tbg
Copy link
Member

tbg commented Apr 16, 2021

There are occasionally liveness failures. I assume one must have happened around this time, when appending suddenly took 10+s:

image

I am not sure what's up with that but it seems like something we should investigate once this test has been updated on master with my findings so far.

@tbg
Copy link
Member

tbg commented Apr 16, 2021

image

^- this isn't a good graph. This should track the per-node info, or top K instead, an average across the cluster isn't useful.

@tbg
Copy link
Member

tbg commented Apr 16, 2021

The spike in log commit latency correlates with a slew of these for various ranges and files

range/pebble.go:596 ⋮ [n2,pebble] 28341 disk slowness detected: write to file ‹/mnt/data1/cockroach/auxiliary/sideloading/r1XXXX/r18271/i21.t6› >

This message comes from here:

eventListener.DiskSlow = func(info pebble.DiskSlowInfo) {
oldDiskSlow(info)

@cockroachdb/storage this is my first time trying to dig into disk slowness, what are the metrics that would have exposed this separately? I.e. is there a metric that could have "proven" that the slow raft ready handling was very likely to have been caused by something that is also slow at the pebble.VFS layer. We could hook this "slow write" event up to a metric, but it would be good to incorporate a more meaningful set of metrics from pebble that can reflect disk slowness. (My angle here is: given unhappy KV, how can we narrow it down to unhappy storage, or directly to unhappy I/O).

@sumeerbhola
Copy link
Collaborator

I believe there are 2 metrics storage.disk-slow and storage.disk-stalled defined in kvserver/metrics.go.

@tbg
Copy link
Member

tbg commented Apr 19, 2021

I'm running 10 instances of the test now and it looks as though major lock-ups during restore are common. Here's another one:

image

This then causes the node to lose liveness.

The disk-slow metric also shows this (n10 is the only affected node):

store_id	name	value
10	storage.disk-slow	6
bble] 22003  disk slowness detected: write to file ‹/mnt/data1/cockroach/auxiliary/sideloading/r1XXXX/r16121/i30.t6› has been ongoing for 5.8s
bble] 22004  disk slowness detected: write to file ‹/mnt/data1/cockroach/auxiliary/sideloading/r1XXXX/r16121/i30.t6› has been ongoing for 7.8s
bble] 22005  disk slowness detected: write to file ‹/mnt/data1/cockroach/auxiliary/sideloading/r1XXXX/r16121/i30.t6› has been ongoing for 9.8s
bble] 22006  disk slowness detected: write to file ‹/mnt/data1/cockroach/auxiliary/sideloading/r1XXXX/r16121/i30.t6› has been ongoing for 11.8s
bble] 22007  disk slowness detected: write to file ‹/mnt/data1/cockroach/auxiliary/sideloading/r1XXXX/r16121/i30.t6› has been ongoing for 13.8s
bble] 22008  disk slowness detected: write to file ‹/mnt/data1/cockroach/auxiliary/sideloading/r1XXXX/r16121/i30.t6› has been ongoing for 15.8s

@sumeerbhola / @dt do we want to upgrade this test to fail in such a scenario?

@tbg
Copy link
Member

tbg commented Apr 19, 2021

The ten test runs passed:

restore2TB/nodes=10 (2370.18s)
restore2TB/nodes=10 (2437.61s)
restore2TB/nodes=10 (2673.72s)
restore2TB/nodes=10 (2697.96s)
restore2TB/nodes=10 (2720.99s)
restore2TB/nodes=10 (2880.92s)
restore2TB/nodes=10 (3265.31s)
restore2TB/nodes=10 (3369.12s)
restore2TB/nodes=10 (3679.54s)
restore2TB/nodes=10 (3691.63s)

This is the total duration of the roachtests, so not the job. The longest import took just north of 1h, the shortest 40 minutes. This is in contrast to the current test (without my provisional fix) which has the same best-case, but will often take multiple hours:

image

@aliher1911
Copy link
Contributor

Interestingly in my testing I got down to the same test run times and no snapshot rejections by bumping maxDelaySplitTriggerTicks to 1000 instead. That allowed splits to propagate even for cases where we try to change raft group memberships "mid-split".

@tbg
Copy link
Member

tbg commented Apr 22, 2021

Interestingly in my testing I got down to the same test run times and no snapshot rejections by bumping maxDelaySplitTriggerTicks to 1000 instead.

This isn't entirely surprising. If the restore bursts a given number N of splits, then the natural delay for the entire sequence of splits to fully apply is N/2 seconds. If maxDelaySplitTriggerTicks is in excess of that, no snapshot will be triggered and things look "normal" except the splits still behaved massively bad. It's possible that in theory we could keep dropping messages eternally and get away with it (given the strong prior knowledge we have that there is a local replica containing our start key), but in practice we need to make sure delays during split application are within reason anyway.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). GA-blocker O-roachtest O-robot Originated from a bot.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants