Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: add AS OF SYSTEM TIME to SCRUB test #33152

Merged
merged 1 commit into from
Dec 17, 2018

Conversation

thoszhang
Copy link
Contributor

Add AS OF SYSTEM TIME to the SCRUB roachtest to reduce contention. I ran
this version of the test and didn't get the deadlock that showed up in #33149.

Eventually we'll want to have AS OF SYSTEM TIME be present by default in the
SELECT query that is run during SCRUB instead of having it be manually added.

Release note: None

@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Contributor

@vivekmenezes vivekmenezes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for figuring out the problem here.

@@ -57,9 +57,9 @@ func makeScrubTPCCTest(
for i := 0; i < len(stmts); i += len(tpccTables) {
for j, t := range tpccTables {
if indexOnly {
stmts[i+j] = fmt.Sprintf(`EXPERIMENTAL SCRUB TABLE tpcc.%s WITH OPTIONS INDEX ALL`, t)
stmts[i+j] = fmt.Sprintf(`EXPERIMENTAL SCRUB TABLE tpcc.%s AS OF SYSTEM TIME '-5s' WITH OPTIONS INDEX ALL`, t)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you tried running it with '-1ms' to see if providing it with a specific timestamp itself is enough as opposed to running it in the past.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed it to -0s and ran it a few times over the weekend, and it finished within a few hours, so I've updated the PR. (I actually tried this originally but aborted the test run because it still looked like it was slow, but it turned out I'd spoken too soon.)

Add `AS OF SYSTEM TIME` to the `SCRUB` roachtest to reduce contention. I ran
this version of the test and didn't get the deadlock that showed up in cockroachdb#33149.

Eventually we'll want to have `AS OF SYSTEM TIME` be present by default in the
`SELECT` query that is run during SCRUB instead of having it be manually added.

Release note: None
Copy link
Collaborator

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/cmd/roachtest/scrub.go, line 60 at r1 (raw file):

Previously, lucy-zhang (Lucy Zhang) wrote…

I changed it to -0s and ran it a few times over the weekend, and it finished within a few hours, so I've updated the PR. (I actually tried this originally but aborted the test run because it still looked like it was slow, but it turned out I'd spoken too soon.)

What does -0s even do? I would think we'd want -1m to make sure that scrub queries never interfere with foreground traffic.

Copy link
Contributor Author

@thoszhang thoszhang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/cmd/roachtest/scrub.go, line 60 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

What does -0s even do? I would think we'd want -1m to make sure that scrub queries never interfere with foreground traffic.

My understanding was that setting the time to -0s (as opposed to not having AS OF SYSTEM TIME at all) prevents the SCRUB query from being retried with a later timestamp due to contention with writes happening at around the same time - is that not the case? (It does seem that adding AOST makes a difference, since the updated test has reliably finished within 5 hours, though I could be wrong about the reason for that.)

Copy link
Contributor Author

@thoszhang thoszhang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/cmd/roachtest/scrub.go, line 60 at r1 (raw file):

Previously, lucy-zhang (Lucy Zhang) wrote…

My understanding was that setting the time to -0s (as opposed to not having AS OF SYSTEM TIME at all) prevents the SCRUB query from being retried with a later timestamp due to contention with writes happening at around the same time - is that not the case? (It does seem that adding AOST makes a difference, since the updated test has reliably finished within 5 hours, though I could be wrong about the reason for that.)

(To follow up on that comparison, when I ran the roachtest separately for my original PR for this test without AOST, I never saw timeouts as bad as what we're seeing in the nightly roachtests, but the tests did take closer to 10 hours to complete IIRC.)

Copy link
Contributor

@vivekmenezes vivekmenezes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/cmd/roachtest/scrub.go, line 60 at r1 (raw file):

Previously, lucy-zhang (Lucy Zhang) wrote…

(To follow up on that comparison, when I ran the roachtest separately for my original PR for this test without AOST, I never saw timeouts as bad as what we're seeing in the nightly roachtests, but the tests did take closer to 10 hours to complete IIRC.)

I think using -0s makes sense for this test because we have no idea how far in the past the test started. Using -1m could deliver a time before the start of the test.

@thoszhang
Copy link
Contributor Author

bors r+

Copy link
Collaborator

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/cmd/roachtest/scrub.go, line 60 at r1 (raw file):

Previously, vivekmenezes wrote…

I think using -0s makes sense for this test because we have no idea how far in the past the test started. Using -1m could deliver a time before the start of the test.

Hmm, ok. I'd prefer to leave this with a non-zero time. You can add a time.Sleep() to before runAndLogStmts if you want to make sure the time in the past is valid. A setting of -1m seems intuitively right to me, as it will guarantee a lack of contention with foreground operations.

craig bot pushed a commit that referenced this pull request Dec 17, 2018
33152: roachtest: add AS OF SYSTEM TIME to SCRUB test r=lucy-zhang a=lucy-zhang

Add `AS OF SYSTEM TIME` to the `SCRUB` roachtest to reduce contention. I ran
this version of the test and didn't get the deadlock that showed up in #33149.

Eventually we'll want to have `AS OF SYSTEM TIME` be present by default in the
`SELECT` query that is run during SCRUB instead of having it be manually added.

Release note: None

Co-authored-by: Lucy Zhang <[email protected]>
Copy link
Collaborator

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/cmd/roachtest/scrub.go, line 38 at r2 (raw file):

func registerScrubIndexOnlyTPCC(r *registry) {
	// TODO (lucy): increase numScrubRuns once we improve SCRUB performance
	r.Add(makeScrubTPCCTest(5, 1000, time.Hour*3, true, 2))

Is it necessary to run each scrub test for 3hrs? AFAICT, after the scrub queries run you don't care about continuing to run the test. Also, the scrub checks should be quick(-ish), right? Perhaps the scrub queries should be run at the end of very runTPCC invocation, just as workload check is.

@craig
Copy link
Contributor

craig bot commented Dec 17, 2018

Build succeeded

@craig craig bot merged commit 42a7771 into cockroachdb:master Dec 17, 2018
Copy link
Contributor Author

@thoszhang thoszhang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/cmd/roachtest/scrub.go, line 60 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Hmm, ok. I'd prefer to leave this with a non-zero time. You can add a time.Sleep() to before runAndLogStmts if you want to make sure the time in the past is valid. A setting of -1m seems intuitively right to me, as it will guarantee a lack of contention with foreground operations.

We do want to test how the foreground operations affect the operation of SCRUB, so I think it should be fine to keep the -0s as long as the tests don't continue to cause problems. Although maybe using -1m would be more realistic for customers in production, so I'm definitely willing to revisit this.


pkg/cmd/roachtest/scrub.go, line 38 at r2 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Is it necessary to run each scrub test for 3hrs? AFAICT, after the scrub queries run you don't care about continuing to run the test. Also, the scrub checks should be quick(-ish), right? Perhaps the scrub queries should be run at the end of very runTPCC invocation, just as workload check is.

@petermattis Sorry, I merged before I saw your earlier comment. Part of the purpose is to check that SCRUB runs correctly while other reads/writes are happening, which is why the SCRUB checks are running at the same time as TPCC (and run repeatedly in a loop instead of just once). Unfortunately the SCRUB checks aren't that fast - it takes around 3 hours to actually check all the indexes for all the tables once (though they are much faster if we do the checks afterward instead of concurrently). Our plan was to get the roachtests running without timeouts, even if SCRUB is slow, and then try to improve on performance from there.

Copy link
Collaborator

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/cmd/roachtest/scrub.go, line 60 at r1 (raw file):

Previously, lucy-zhang (Lucy Zhang) wrote…

We do want to test how the foreground operations affect the operation of SCRUB, so I think it should be fine to keep the -0s as long as the tests don't continue to cause problems. Although maybe using -1m would be more realistic for customers in production, so I'm definitely willing to revisit this.

I can guess how foreground operations will interact with SCRUB: badly. But that is because any large scan will interact with foreground operations by causing them to block or retry. In production, I'd recommend a customer use -1m which is why I'm suggesting it here.


pkg/cmd/roachtest/scrub.go, line 38 at r2 (raw file):

Previously, lucy-zhang (Lucy Zhang) wrote…

@petermattis Sorry, I merged before I saw your earlier comment. Part of the purpose is to check that SCRUB runs correctly while other reads/writes are happening, which is why the SCRUB checks are running at the same time as TPCC (and run repeatedly in a loop instead of just once). Unfortunately the SCRUB checks aren't that fast - it takes around 3 hours to actually check all the indexes for all the tables once (though they are much faster if we do the checks afterward instead of concurrently). Our plan was to get the roachtests running without timeouts, even if SCRUB is slow, and then try to improve on performance from there.

I'm surprised that SCRUB tests take 3 hrs. The builtin tpcc checks take less than an hour and they look at all of the table data:

I181216 20:24:21.885005 1 workload/tpcc/tpcc.go:290  check 3.3.2.1 took 193.900364ms
I181216 20:24:56.145593 1 workload/tpcc/tpcc.go:290  check 3.3.2.2 took 34.260534646s
I181216 20:25:11.544024 1 workload/tpcc/tpcc.go:290  check 3.3.2.3 took 15.398370937s
I181216 20:36:28.778891 1 workload/tpcc/tpcc.go:290  check 3.3.2.4 took 11m17.234808977s
I181216 20:42:39.154640 1 workload/tpcc/tpcc.go:290  check 3.3.2.5 took 6m10.375704355s
I181216 21:03:23.750121 1 workload/tpcc/tpcc.go:290  check 3.3.2.7 took 20m44.595429993s
I181216 21:05:03.434510 1 workload/tpcc/tpcc.go:290  check 3.3.2.8 took 1m39.684337978s
I181216 21:06:27.820092 1 workload/tpcc/tpcc.go:290  check 3.3.2.9 took 1m24.385527396s

I'd expect SCRUB to take similar time.

Copy link
Contributor Author

@thoszhang thoszhang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/cmd/roachtest/scrub.go, line 60 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

I can guess how foreground operations will interact with SCRUB: badly. But that is because any large scan will interact with foreground operations by causing them to block or retry. In production, I'd recommend a customer use -1m which is why I'm suggesting it here.

I'm opening a PR soon to make some other updates to this test - I'll update there.


pkg/cmd/roachtest/scrub.go, line 38 at r2 (raw file):

Previously, petermattis (Peter Mattis) wrote…

I'm surprised that SCRUB tests take 3 hrs. The builtin tpcc checks take less than an hour and they look at all of the table data:

I181216 20:24:21.885005 1 workload/tpcc/tpcc.go:290  check 3.3.2.1 took 193.900364ms
I181216 20:24:56.145593 1 workload/tpcc/tpcc.go:290  check 3.3.2.2 took 34.260534646s
I181216 20:25:11.544024 1 workload/tpcc/tpcc.go:290  check 3.3.2.3 took 15.398370937s
I181216 20:36:28.778891 1 workload/tpcc/tpcc.go:290  check 3.3.2.4 took 11m17.234808977s
I181216 20:42:39.154640 1 workload/tpcc/tpcc.go:290  check 3.3.2.5 took 6m10.375704355s
I181216 21:03:23.750121 1 workload/tpcc/tpcc.go:290  check 3.3.2.7 took 20m44.595429993s
I181216 21:05:03.434510 1 workload/tpcc/tpcc.go:290  check 3.3.2.8 took 1m39.684337978s
I181216 21:06:27.820092 1 workload/tpcc/tpcc.go:290  check 3.3.2.9 took 1m24.385527396s

I'd expect SCRUB to take similar time.

Checking all the secondary indexes for all tpcc tables only takes about an hour when the checks run by themselves (I don't have an exact number, but this is based on running a subset of the index checks). It only degrades to 3 hours when TPCC is running at the same time.

@thoszhang thoszhang deleted the scrub-aost branch February 12, 2019 21:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants