roachtest: add AS OF SYSTEM TIME to SCRUB test #33152

thoszhang · 2018-12-14T02:03:30Z

Add AS OF SYSTEM TIME to the SCRUB roachtest to reduce contention. I ran
this version of the test and didn't get the deadlock that showed up in #33149.

Eventually we'll want to have AS OF SYSTEM TIME be present by default in the
SELECT query that is run during SCRUB instead of having it be manually added.

Release note: None

cockroach-teamcity · 2018-12-14T02:03:37Z

This change is

vivekmenezes

Thanks for figuring out the problem here.

vivekmenezes · 2018-12-14T23:03:50Z

pkg/cmd/roachtest/scrub.go

@@ -57,9 +57,9 @@ func makeScrubTPCCTest(
 	for i := 0; i < len(stmts); i += len(tpccTables) {
 		for j, t := range tpccTables {
 			if indexOnly {
-				stmts[i+j] = fmt.Sprintf(`EXPERIMENTAL SCRUB TABLE tpcc.%s WITH OPTIONS INDEX ALL`, t)
+				stmts[i+j] = fmt.Sprintf(`EXPERIMENTAL SCRUB TABLE tpcc.%s AS OF SYSTEM TIME '-5s' WITH OPTIONS INDEX ALL`, t)


Have you tried running it with '-1ms' to see if providing it with a specific timestamp itself is enough as opposed to running it in the past.

I changed it to -0s and ran it a few times over the weekend, and it finished within a few hours, so I've updated the PR. (I actually tried this originally but aborted the test run because it still looked like it was slow, but it turned out I'd spoken too soon.)

Add `AS OF SYSTEM TIME` to the `SCRUB` roachtest to reduce contention. I ran this version of the test and didn't get the deadlock that showed up in cockroachdb#33149. Eventually we'll want to have `AS OF SYSTEM TIME` be present by default in the `SELECT` query that is run during SCRUB instead of having it be manually added. Release note: None

petermattis

Reviewable status: complete! 0 of 0 LGTMs obtained

pkg/cmd/roachtest/scrub.go, line 60 at r1 (raw file):

Previously, lucy-zhang (Lucy Zhang) wrote…

I changed it to -0s and ran it a few times over the weekend, and it finished within a few hours, so I've updated the PR. (I actually tried this originally but aborted the test run because it still looked like it was slow, but it turned out I'd spoken too soon.)

What does -0s even do? I would think we'd want -1m to make sure that scrub queries never interfere with foreground traffic.

thoszhang

Reviewable status: complete! 0 of 0 LGTMs obtained

pkg/cmd/roachtest/scrub.go, line 60 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

What does -0s even do? I would think we'd want -1m to make sure that scrub queries never interfere with foreground traffic.

My understanding was that setting the time to -0s (as opposed to not having AS OF SYSTEM TIME at all) prevents the SCRUB query from being retried with a later timestamp due to contention with writes happening at around the same time - is that not the case? (It does seem that adding AOST makes a difference, since the updated test has reliably finished within 5 hours, though I could be wrong about the reason for that.)

thoszhang

Reviewable status: complete! 0 of 0 LGTMs obtained

pkg/cmd/roachtest/scrub.go, line 60 at r1 (raw file):

Previously, lucy-zhang (Lucy Zhang) wrote…

My understanding was that setting the time to -0s (as opposed to not having AS OF SYSTEM TIME at all) prevents the SCRUB query from being retried with a later timestamp due to contention with writes happening at around the same time - is that not the case? (It does seem that adding AOST makes a difference, since the updated test has reliably finished within 5 hours, though I could be wrong about the reason for that.)

(To follow up on that comparison, when I ran the roachtest separately for my original PR for this test without AOST, I never saw timeouts as bad as what we're seeing in the nightly roachtests, but the tests did take closer to 10 hours to complete IIRC.)

vivekmenezes

Reviewable status: complete! 0 of 0 LGTMs obtained

pkg/cmd/roachtest/scrub.go, line 60 at r1 (raw file):

Previously, lucy-zhang (Lucy Zhang) wrote…

(To follow up on that comparison, when I ran the roachtest separately for my original PR for this test without AOST, I never saw timeouts as bad as what we're seeing in the nightly roachtests, but the tests did take closer to 10 hours to complete IIRC.)

I think using -0s makes sense for this test because we have no idea how far in the past the test started. Using -1m could deliver a time before the start of the test.

thoszhang · 2018-12-17T20:09:17Z

bors r+

petermattis

Reviewable status: complete! 0 of 0 LGTMs obtained

pkg/cmd/roachtest/scrub.go, line 60 at r1 (raw file):

Previously, vivekmenezes wrote…

I think using -0s makes sense for this test because we have no idea how far in the past the test started. Using -1m could deliver a time before the start of the test.

Hmm, ok. I'd prefer to leave this with a non-zero time. You can add a time.Sleep() to before runAndLogStmts if you want to make sure the time in the past is valid. A setting of -1m seems intuitively right to me, as it will guarantee a lack of contention with foreground operations.

33152: roachtest: add AS OF SYSTEM TIME to SCRUB test r=lucy-zhang a=lucy-zhang Add `AS OF SYSTEM TIME` to the `SCRUB` roachtest to reduce contention. I ran this version of the test and didn't get the deadlock that showed up in #33149. Eventually we'll want to have `AS OF SYSTEM TIME` be present by default in the `SELECT` query that is run during SCRUB instead of having it be manually added. Release note: None Co-authored-by: Lucy Zhang <[email protected]>

petermattis

Reviewable status: complete! 0 of 0 LGTMs obtained

pkg/cmd/roachtest/scrub.go, line 38 at r2 (raw file):

func registerScrubIndexOnlyTPCC(r *registry) {
	// TODO (lucy): increase numScrubRuns once we improve SCRUB performance
	r.Add(makeScrubTPCCTest(5, 1000, time.Hour*3, true, 2))

Is it necessary to run each scrub test for 3hrs? AFAICT, after the scrub queries run you don't care about continuing to run the test. Also, the scrub checks should be quick(-ish), right? Perhaps the scrub queries should be run at the end of very runTPCC invocation, just as workload check is.

craig · 2018-12-17T20:26:29Z

Build succeeded

GitHub CI (Cockroach)

thoszhang

Reviewable status: complete! 0 of 0 LGTMs obtained

pkg/cmd/roachtest/scrub.go, line 60 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Hmm, ok. I'd prefer to leave this with a non-zero time. You can add a time.Sleep() to before runAndLogStmts if you want to make sure the time in the past is valid. A setting of -1m seems intuitively right to me, as it will guarantee a lack of contention with foreground operations.

We do want to test how the foreground operations affect the operation of SCRUB, so I think it should be fine to keep the -0s as long as the tests don't continue to cause problems. Although maybe using -1m would be more realistic for customers in production, so I'm definitely willing to revisit this.

pkg/cmd/roachtest/scrub.go, line 38 at r2 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Is it necessary to run each scrub test for 3hrs? AFAICT, after the scrub queries run you don't care about continuing to run the test. Also, the scrub checks should be quick(-ish), right? Perhaps the scrub queries should be run at the end of very runTPCC invocation, just as workload check is.

@petermattis Sorry, I merged before I saw your earlier comment. Part of the purpose is to check that SCRUB runs correctly while other reads/writes are happening, which is why the SCRUB checks are running at the same time as TPCC (and run repeatedly in a loop instead of just once). Unfortunately the SCRUB checks aren't that fast - it takes around 3 hours to actually check all the indexes for all the tables once (though they are much faster if we do the checks afterward instead of concurrently). Our plan was to get the roachtests running without timeouts, even if SCRUB is slow, and then try to improve on performance from there.

petermattis

Reviewable status: complete! 0 of 0 LGTMs obtained

pkg/cmd/roachtest/scrub.go, line 60 at r1 (raw file):

Previously, lucy-zhang (Lucy Zhang) wrote…

We do want to test how the foreground operations affect the operation of SCRUB, so I think it should be fine to keep the -0s as long as the tests don't continue to cause problems. Although maybe using -1m would be more realistic for customers in production, so I'm definitely willing to revisit this.

I can guess how foreground operations will interact with SCRUB: badly. But that is because any large scan will interact with foreground operations by causing them to block or retry. In production, I'd recommend a customer use -1m which is why I'm suggesting it here.

pkg/cmd/roachtest/scrub.go, line 38 at r2 (raw file):

Previously, lucy-zhang (Lucy Zhang) wrote…

@petermattis Sorry, I merged before I saw your earlier comment. Part of the purpose is to check that SCRUB runs correctly while other reads/writes are happening, which is why the SCRUB checks are running at the same time as TPCC (and run repeatedly in a loop instead of just once). Unfortunately the SCRUB checks aren't that fast - it takes around 3 hours to actually check all the indexes for all the tables once (though they are much faster if we do the checks afterward instead of concurrently). Our plan was to get the roachtests running without timeouts, even if SCRUB is slow, and then try to improve on performance from there.

I'm surprised that SCRUB tests take 3 hrs. The builtin tpcc checks take less than an hour and they look at all of the table data:

I181216 20:24:21.885005 1 workload/tpcc/tpcc.go:290  check 3.3.2.1 took 193.900364ms
I181216 20:24:56.145593 1 workload/tpcc/tpcc.go:290  check 3.3.2.2 took 34.260534646s
I181216 20:25:11.544024 1 workload/tpcc/tpcc.go:290  check 3.3.2.3 took 15.398370937s
I181216 20:36:28.778891 1 workload/tpcc/tpcc.go:290  check 3.3.2.4 took 11m17.234808977s
I181216 20:42:39.154640 1 workload/tpcc/tpcc.go:290  check 3.3.2.5 took 6m10.375704355s
I181216 21:03:23.750121 1 workload/tpcc/tpcc.go:290  check 3.3.2.7 took 20m44.595429993s
I181216 21:05:03.434510 1 workload/tpcc/tpcc.go:290  check 3.3.2.8 took 1m39.684337978s
I181216 21:06:27.820092 1 workload/tpcc/tpcc.go:290  check 3.3.2.9 took 1m24.385527396s

I'd expect SCRUB to take similar time.

thoszhang

Reviewable status: complete! 0 of 0 LGTMs obtained

pkg/cmd/roachtest/scrub.go, line 60 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

I can guess how foreground operations will interact with SCRUB: badly. But that is because any large scan will interact with foreground operations by causing them to block or retry. In production, I'd recommend a customer use -1m which is why I'm suggesting it here.

I'm opening a PR soon to make some other updates to this test - I'll update there.

pkg/cmd/roachtest/scrub.go, line 38 at r2 (raw file):

Previously, petermattis (Peter Mattis) wrote…

I'm surprised that SCRUB tests take 3 hrs. The builtin tpcc checks take less than an hour and they look at all of the table data:

I181216 20:24:21.885005 1 workload/tpcc/tpcc.go:290  check 3.3.2.1 took 193.900364ms
I181216 20:24:56.145593 1 workload/tpcc/tpcc.go:290  check 3.3.2.2 took 34.260534646s
I181216 20:25:11.544024 1 workload/tpcc/tpcc.go:290  check 3.3.2.3 took 15.398370937s
I181216 20:36:28.778891 1 workload/tpcc/tpcc.go:290  check 3.3.2.4 took 11m17.234808977s
I181216 20:42:39.154640 1 workload/tpcc/tpcc.go:290  check 3.3.2.5 took 6m10.375704355s
I181216 21:03:23.750121 1 workload/tpcc/tpcc.go:290  check 3.3.2.7 took 20m44.595429993s
I181216 21:05:03.434510 1 workload/tpcc/tpcc.go:290  check 3.3.2.8 took 1m39.684337978s
I181216 21:06:27.820092 1 workload/tpcc/tpcc.go:290  check 3.3.2.9 took 1m24.385527396s

I'd expect SCRUB to take similar time.

Checking all the secondary indexes for all tpcc tables only takes about an hour when the checks run by themselves (I don't have an exact number, but this is based on running a subset of the index checks). It only degrades to 3 hours when TPCC is running at the same time.

thoszhang requested a review from vivekmenezes December 14, 2018 02:03

thoszhang mentioned this pull request Dec 14, 2018

SCRUB index check hangs when run concurrently with TPCC #33173

Closed

vivekmenezes reviewed Dec 14, 2018

View reviewed changes

thoszhang force-pushed the scrub-aost branch from 5445a4a to 42a7771 Compare December 17, 2018 18:23

petermattis reviewed Dec 17, 2018

View reviewed changes

thoszhang commented Dec 17, 2018

View reviewed changes

vivekmenezes approved these changes Dec 17, 2018

View reviewed changes

petermattis reviewed Dec 17, 2018

View reviewed changes

craig bot merged commit 42a7771 into cockroachdb:master Dec 17, 2018

thoszhang commented Dec 17, 2018

View reviewed changes

petermattis reviewed Dec 17, 2018

View reviewed changes

thoszhang commented Dec 17, 2018

View reviewed changes

thoszhang deleted the scrub-aost branch February 12, 2019 21:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

roachtest: add AS OF SYSTEM TIME to SCRUB test #33152

roachtest: add AS OF SYSTEM TIME to SCRUB test #33152

thoszhang commented Dec 14, 2018

cockroach-teamcity commented Dec 14, 2018

vivekmenezes left a comment

vivekmenezes Dec 14, 2018

thoszhang Dec 17, 2018

petermattis left a comment

thoszhang left a comment

thoszhang left a comment

vivekmenezes left a comment

thoszhang commented Dec 17, 2018

petermattis left a comment

petermattis left a comment

craig bot commented Dec 17, 2018

thoszhang left a comment

petermattis left a comment

thoszhang left a comment

roachtest: add AS OF SYSTEM TIME to SCRUB test #33152

roachtest: add AS OF SYSTEM TIME to SCRUB test #33152

Conversation

thoszhang commented Dec 14, 2018

cockroach-teamcity commented Dec 14, 2018

vivekmenezes left a comment

Choose a reason for hiding this comment

vivekmenezes Dec 14, 2018

Choose a reason for hiding this comment

thoszhang Dec 17, 2018

Choose a reason for hiding this comment

petermattis left a comment

Choose a reason for hiding this comment

thoszhang left a comment

Choose a reason for hiding this comment

thoszhang left a comment

Choose a reason for hiding this comment

vivekmenezes left a comment

Choose a reason for hiding this comment

thoszhang commented Dec 17, 2018

petermattis left a comment

Choose a reason for hiding this comment

petermattis left a comment

Choose a reason for hiding this comment

craig bot commented Dec 17, 2018

Build succeeded

thoszhang left a comment

Choose a reason for hiding this comment

petermattis left a comment

Choose a reason for hiding this comment

thoszhang left a comment

Choose a reason for hiding this comment