Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: randomly run with runtime assertions by default #111949

Merged
merged 2 commits into from
Nov 2, 2023

Conversation

DarrylWong
Copy link
Contributor

@DarrylWong DarrylWong commented Oct 6, 2023

This changes the semantics of t.Cockroach() to use a binary with runtime assertions enabled by default. Performance tests (indicated by the benchmark TestSpec) will be able to continue using the standard binary, without runtime assertions or metamorphic constants, as usual.

This commit also opts-out other tests that cannot easily be run with runtime assertions or metamorphic constants enabled, most often due to timeouts or metamorphic constant incompatibility.

Resolves #86678.
Informs #94986.

Epic: none

Release note: None

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@@ -96,6 +96,7 @@ func registerKVBenchSpec(r registry.Registry, b kvBenchSpec) {
Suites: registry.ManualOnly,
Tags: registry.Tags("manual"),
Owner: registry.OwnerKV,
Benchmark: true, // TODO: Comments imply this is a benchmark test but this was not set
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test name and comments imply this is a benchmark test so I assume the missing benchmark spec is a mistake but if there is a reason why it shouldn't be set let me know.

@DarrylWong
Copy link
Contributor Author

DarrylWong commented Oct 26, 2023

Should be RFAL

Started up Test Run here

And another here

@DarrylWong DarrylWong marked this pull request as ready for review October 26, 2023 20:08
@DarrylWong DarrylWong requested review from a team as code owners October 26, 2023 20:08
@DarrylWong DarrylWong requested review from srosenberg and renatolabs and removed request for a team October 26, 2023 20:08
cockroach string // path to main cockroach binary
cockroachShort string // path to cockroach-short binary compiled with --crdb_test build tag
cockroach string // path to main cockroach binary
cockroachShort string // path to cockroach-short binary compiled with --crdb_test build tag
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: cockroachAssertionsEnabled (or similar, it seems elsewhere we use cockroachEA for "enabled assertions") is a better name than cockroachShort. Yes, the comment explains the real difference between these two things, but everyone sees the variable name, not everyone sees the comment.

@@ -83,7 +83,7 @@ case "$component" in
;;
roachtest)
# Roachtest binary.
bazel_args=(//pkg/cmd/roachtest)
bazel_args=(//pkg/cmd/roachtest --crdb_test)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What benefit does this give exactly?

Copy link
Contributor Author

@DarrylWong DarrylWong Oct 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stealing from what Stan said:

Note, to fully support interop between a roachtest runner, (executing the roachtest binary), and a node, executing the cockroach binary compiled with assertions (i.e., crdb_test), the roachtest binary must also be compiled with assertions to establish the same serialization on the wire. E.g., KVNemesisSeq is used on the wire only under crdb_test builds. The corresponding (sequence) Container type is conditional on the binary being compiled under crdb_test. (See pkg/kv/kvpb/api.proto) Thus, a request serialized from the roachtest runner is compatible with the cockroach binary iff both binaries are built under crdb_test.

The solution here was to just compile all roachtest binaries with crdb_test since it touches very little code, rather than try and figure out a way to have team city selectively use roachtest binaries based on what the cockroach binary is compiled with.

I can add a comment though to make it more clear.

Copy link
Contributor Author

@DarrylWong DarrylWong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but is there a specific reason why we're skipping runtime checks in a few tests that legitimately break?

Specific comments on kv/split tests below but in general no there isn't and I agree with the sentiment we should let them fail.

I'm confused about schemachange/bulkingest`

I want to say the connection failure was just an infra flake since I can't repro it with the same seed and it's never failed before w/ runtime assertions. But I also don't see anything that jumps out to me in the logs so I don't want to write it off just yet. Let me dig into this some more.

PR suggests the timeout was bumped to 60mins

The duration of how long to run the workload was bumped to 60 mins, the timeout is calculated as 2 * duration.

Would be interesting to run a full nightly build in this branch forcing metamorphic builds for all tests, to reveal some flakes we might not have noticed. Did you run a build like that? I'm interested to see the outcome.

Yeah, that's actually what I was doing most of the time. Here's two runs where I ran all of them #1 and #2. Unfortunately you can see they timed out, which is why I reverted it back for the final few runs.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @herkolategan, @renatolabs, @rickystewart, and @srosenberg)


pkg/cmd/roachtest/tests/asyncpg.go line 54 at r5 (raw file):

Previously, renatolabs (Renato Costa) wrote…

In pgjdbc, we're forcing the value of this session var -- can't we do this here too?

The asyncpg tests drops the entire database several times throughout the test which also drops session vars. So even setting the default session value like in pgjdbc isn't enough.


pkg/cmd/roachtest/tests/multitenant_distsql.go line 65 at r5 (raw file):

Previously, renatolabs (Renato Costa) wrote…

We shouldn't skip runtime assertions because this test fails IMO: it's precisely why we're doing this! I think the issue will get a lot more eyes when it is reported as a roachtest failure. My reading of the issue you created is that it is a legitimate bug.

SGTM


pkg/cmd/roachtest/tests/multitenant_utils.go line 281 at r5 (raw file):

Previously, renatolabs (Renato Costa) wrote…

Should we delete this?

Done.


pkg/cmd/roachtest/tests/split.go line 434 at r5 (raw file):

Previously, renatolabs (Renato Costa) wrote…

Same comment re: letting this break.

The root cause of this issue was found and from my understanding it's not really an urgent issue. Also from what I've been told the faulty QPS function is deprecated anyway. When we have a metamorphic exclusion list, we can just disable kv-batch-size and let the other constants/assertions be on.

I'm not opposed to letting this break, I guess I'm more wondering what we do if KV determines the bug isn't worth fixing? Switch it back to t.StandardCockroach()?

Copy link
Contributor

@renatolabs renatolabs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Specific comments on kv/split tests below

👍 sounds good, the exception seems reasonable.

I want to say the connection failure was just an infra flake since I can't repro it

I think the connection failure is a red herring, the test timed out after 2h.

Here's two runs where I ran all of them #1 and #2

These are full runs, but are tests running with metamorphic builds all the time? I don't see any custom parameters to force that. Also, when I look at the logs for some tests in that #2 link, I only see the Running with runtime assertions enabled on global seed message on some tests, not all.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @DarrylWong, @herkolategan, @rickystewart, and @srosenberg)


pkg/cmd/roachtest/test_impl.go line 166 at r6 (raw file):

			// CI build, but we can't assume it exists in every roachtest call.
			if path := t.RuntimeAssertionsCockroach(); path != "" {
				t.l.Printf("Runtime assertions enabled")

For instant knowledge of whether we are running with assertions enabled or not, we could add a log message to both branches.


pkg/cmd/roachtest/tests/asyncpg.go line 54 at r5 (raw file):

Previously, DarrylWong wrote…

The asyncpg tests drops the entire database several times throughout the test which also drops session vars. So even setting the default session value like in pgjdbc isn't enough.

Got it, thanks. Would it make sense to run these tests with assertions enabled, but disabling metamorphic constants?

@DarrylWong DarrylWong force-pushed the roachtest-ea branch 2 times, most recently from c58ba1a to 339f002 Compare October 31, 2023 21:57
Copy link
Contributor Author

@DarrylWong DarrylWong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the connection failure is a red herring, the test timed out after 2h.

Yeah after running it 5+ times I finally got it to timeout on creating indexes... Seems like an actual issue and not a flake but seems weird why it doesn't consistently fail on the same seed. Will have to investigate more.

These are full runs, but are tests running with metamorphic builds all the time?

Yeah I did it in an admittedly not very obvious way of just temporarily replacing t.StandardCockroach() with the runtime binary in the code itself partly because I was tired of typing in a flag every time. I could do another run if you want, more testing wouldn't hurt but it will probably time out.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @herkolategan, @renatolabs, @rickystewart, and @srosenberg)


pkg/cmd/roachtest/test_impl.go line 166 at r6 (raw file):

Previously, renatolabs (Renato Costa) wrote…

For instant knowledge of whether we are running with assertions enabled or not, we could add a log message to both branches.

Done.


pkg/cmd/roachtest/tests/asyncpg.go line 54 at r5 (raw file):

Previously, renatolabs (Renato Costa) wrote…

Got it, thanks. Would it make sense to run these tests with assertions enabled, but disabling metamorphic constants?

I was under the impression that just disabling metamorphic constants wasn't easily doable in roachtests since they get enabled on init. If it's simple functionality I'd be happy to try adding, but otherwise I wonder how much benefit it would be over a metamorphic exclusion list .

Copy link
Contributor

@renatolabs renatolabs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

temporarily replacing t.StandardCockroach() with the runtime binary in the code itself

Ah, I see it in the commit now!

Seems like an actual issue

Could be, the last we hear from the index creation is

executing declarative schema change PostCommitPhase stage 2 of 7 with 1 BackfillType op (rollback=false) for CREATE INDEX

2h before the test timed out, so it got stuck there for a really long time. We should probably let sql-foundations investigate this issue when it occurs on master, as it doesn't seem related to anything we're doing.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @DarrylWong, @herkolategan, @rickystewart, and @srosenberg)


pkg/cmd/roachtest/tests/asyncpg.go line 54 at r5 (raw file):

Previously, DarrylWong wrote…

I was under the impression that just disabling metamorphic constants wasn't easily doable in roachtests since they get enabled on init. If it's simple functionality I'd be happy to try adding, but otherwise I wonder how much benefit it would be over a metamorphic exclusion list .

You can set COCKROACH_INTERNAL_DISABLE_METAMORPHIC_TESTING=true when starting the cockroach process, and that will disable use of metamorphic constants in CRDB. Runtime checks (gated with buildutil.CrdbTestBuild) will still run.

I'm saying there's still value in running tests with these runtime checks (but without changing any constants) instead of always using "standard cockroach" in cases like asyncpg. But this can be done in follow-up work too.

Copy link
Contributor Author

@DarrylWong DarrylWong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably let sql-foundations investigate this issue when it occurs on master,

Sounds good to me

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @herkolategan, @renatolabs, @rickystewart, and @srosenberg)


pkg/cmd/roachtest/tests/asyncpg.go line 54 at r5 (raw file):

Previously, renatolabs (Renato Costa) wrote…

You can set COCKROACH_INTERNAL_DISABLE_METAMORPHIC_TESTING=true when starting the cockroach process, and that will disable use of metamorphic constants in CRDB. Runtime checks (gated with buildutil.CrdbTestBuild) will still run.

I'm saying there's still value in running tests with these runtime checks (but without changing any constants) instead of always using "standard cockroach" in cases like asyncpg. But this can be done in follow-up work too.

Yeah that seems easy enough. In that case, we should only be using StandardCockroach for excessive memory usage and timeouts.

Added it in for asyncpg, kv/split and copy/bank/rows=1000000 and made them all t.Cockroach() again. Initial runs seem good.

This changes the semantics of `t.Cockroach()` to use a binary with
runtime assertions and metamorphic constants enabled by default.
Performance tests (indicated by the benchmark TestSpec) will continue
using the standard binary, without runtime assertions, as usual.

This commit also opts-out other tests that cannot easily be run with
runtime assertions or metamorphic constants enabled, most often due to
timeouts and metamorphic constant incompatibility.

Resolves cockroachdb#86678.
Informs cockroachdb#94986.

Epic: none

Release note: None
Copy link
Contributor

@renatolabs renatolabs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @DarrylWong, @herkolategan, @rickystewart, and @srosenberg)

Copy link
Contributor

@renatolabs renatolabs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something I was thinking about: I think it shouldn't be too difficult to add a "cluster param" in [1] indicating whether the run used metamorphic builds or not. I think it would be nice to give teams an instant way to know if we used metamorphic builds without even having to click through TC artifacts. Thoughts?

[1]

clusterParams := map[string]string{

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @DarrylWong, @herkolategan, @rickystewart, and @srosenberg)

Adds a label indicating that a roachtest failure was
on a metamorphic build on github issue posts.
Copy link
Contributor Author

@DarrylWong DarrylWong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, but unfortunately it's not the cleanest. The binary used is decided at runtime so it's not available through testSpec like the other params are. I had to pass it in as a separate argument which seems a little messy.

It should work but I'm happy to revert the commit and go back to the drawing board if you have any better suggestions.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @herkolategan, @renatolabs, @rickystewart, and @srosenberg)

@DarrylWong DarrylWong added T-testeng TestEng Team backport-23.2.x Flags PRs that need to be backported to 23.2. labels Nov 2, 2023
Copy link
Contributor

@mikeCRL mikeCRL left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confirmed no impacts to docs dependencies within the relevant files.

@renatolabs
Copy link
Contributor

It should work but I'm happy to revert the commit and go back to the drawing board

No, what you did is kind of what I had in mind originally. There's no elegant way to aggregate random decision made at test run-time and solving that problem is out of scope for this PR. Given the value that this brings (immediately knowing if we used metamorphic binaries), I think this is a good trade-off.

@DarrylWong
Copy link
Contributor Author

TFTRs!

bors r=renatolabs, mikeCRL, rickystewart, herkolategan

@craig
Copy link
Contributor

craig bot commented Nov 2, 2023

Build failed (retrying...):

@craig
Copy link
Contributor

craig bot commented Nov 2, 2023

Build succeeded:

@craig craig bot merged commit 722f902 into cockroachdb:master Nov 2, 2023
3 checks passed
Copy link

blathers-crl bot commented Nov 2, 2023

Encountered an error creating backports. Some common things that can go wrong:

  1. The backport branch might have already existed.
  2. There was a merge conflict.
  3. The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.


error creating merge commit from 581e671 to blathers/backport-release-23.2-111949: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 23.2.x failed. See errors above.


🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-23.2.x Flags PRs that need to be backported to 23.2. T-testeng TestEng Team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

enable running non-perf roachtests with runtime assertions (crdb_test build tag)
6 participants