pkg/server: `TestDrain` is flakey #86974

fantapop · 2022-08-27T00:59:16Z

Describe the problem

TestDrain has been flaking a lot today.

Failed
=== RUN   TestDrain
    test_log_scope.go:161: test logs captured to: /artifacts/tmp/_tmp/ab5462d2cb3989e2a33018db970f23de/logTestDrain523638963
    test_log_scope.go:79: use -show-logs to present logs inline
    drain_test.go:254: expected remaining false, got true
    panic.go:500: -- test log scope end --
--- FAIL: TestDrain (74.31s)

Jira issue: CRDB-19064

The text was updated successfully, but these errors were encountered:

rafiss · 2022-08-29T14:42:42Z

Do you have a link to a failure?

I looked at the Test History on Bazel CI and Github CI, and it says 1000 of the last 1000 runs have passed.

fantapop · 2022-08-29T16:51:26Z

its failing in Merged Bazel Extended CI This is one of the build statuses that posts to the flaky-test-fighter-ops channel.

adityamaru · 2022-08-29T18:30:42Z

This test is failing pretty consistently under testrace so I'm going to skip it under race as flaky-test oncall. @rafiss is sql-experience going to own unskipping this or should I rope someone else in?

Informs: cockroachdb#86974 Release note: None Release justification: test only change

rafiss · 2022-08-29T19:20:41Z

KV is a better home for this

86499: sql: prevent dropping roles with synthetic privileges r=rafiss a=RichardJCai Followup work to be done to support DROP OWNED BY and REASSIGN OWNED BY here. Release justification: Bug fix to newly introduced feature Release note: None 87052: server: skip TestDrain under testrace r=srosenberg a=adityamaru Informs: #86974 Release note: None Release justification: test only change 87070: roachpb: clarify parameter name in ConstructStatementFingerprintID r=rafiss a=rafiss "anonymizedStmt" is incorrect here -- the value passed in should be a query with constants removed, but identifiers present. Release note: None Release justification: low risk change Co-authored-by: richardjcai <[email protected]> Co-authored-by: adityamaru <[email protected]> Co-authored-by: Rafi Shamim <[email protected]>

The error messages in cases like cockroachdb#86974 are not useful otherwise. This change allows us to see where the assertion method was called from. Release justification: testing only.

87271: sql: change when txn is checked refresh materialized views r=rafiss a=e-mbrown Due to the transaction being checked during planning and not during execution, refresh materialized views would fail from the client. Now the transaction is checked during execution . Release justification: Bug fix for refresh materialized views Release note: None 87327: bazel: detect go standard library deprecation errors r=rickystewart a=healthy-pod Previously, `staticcheck` only raised deprecation errors from non-stdlib packages because the analysis pass used by the deprecation analyzer is created by `nogo` which doesn't analyze the standard library packages. This code change patches `staticcheck` to let it detect deprecation errors from the standard library and raise them. It also removes two uses of `tls.Config.PreferServerCipherSuites` because it's deprecated since Go 1.18 and is a legacy field that is ignored and has no effect. Closes #84877 Release justification: Non-production code change Release note: None 87426: sql: disable agg funcs in costfuzz and unoptimized query oracle r=mgartner a=mgartner The false-positives of costfuzz and unoptimized-query-oracle caused by aggregate functions are overwhelming. This commit prevents aggregate functions from being generated for these tests. Informs #87353 Release justification: This is a test-only change. Release note: None 87431: server: mark testDrainContext assertion methods as test helpers r=nvanbenschoten a=nvanbenschoten The error messages in cases like #86974 are not useful otherwise. This change allows us to see where the assertion method was called from. Release justification: testing only. Co-authored-by: e-mbrown <[email protected]> Co-authored-by: healthy-pod <[email protected]> Co-authored-by: Marcus Gartner <[email protected]> Co-authored-by: Nathan VanBenschoten <[email protected]>

shralex · 2023-03-03T22:49:42Z

@AlexTalks can you take a look please

blathers-crl · 2023-03-17T16:43:26Z

Hi @nvanbenschoten, please add branch-* labels to identify which branch(es) this release-blocker affects.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.}

AlexTalks · 2023-03-25T01:22:08Z

So here is what I am seeing when I run this under race:

# test: sendDrainNoShutdown()
I230325 00:39:15.227264 14158 1@server/drain.go:145 ⋮ [T1,n1] 373  drain request received with doDrain = true, shutdown = false
...
I230325 00:39:17.782437 14158 1@server/drain.go:399 ⋮ [T1,n1] 377  (DEBUG) initiating kvserver node drain
I230325 00:39:17.782596 14158 1@kv/kvserver/store.go:1559 ⋮ [T1,drain,n1,s1] 378  (DEBUG) store marked as draining
I230325 00:39:18.149121 14158 1@server/drain.go:293 ⋮ [T1,n1] 379  drain remaining: 51
I230325 00:39:18.149263 14158 1@server/drain.go:295 ⋮ [T1,n1] 380  drain details: descriptor leases: 22, liveness record: 1, range lease iterations: 28
I230325 00:39:18.149475 14158 1@server/drain.go:175 ⋮ [T1,n1] 381  drain request completed without server shutdown

# test: sendDrainNoShutdown() again
I230325 00:39:18.151604 14728 1@server/drain.go:145 ⋮ [T1,n1] 383  drain request received with doDrain = true, shutdown = false
...
I230325 00:39:18.155547 986 kv/kvserver/replica_proposal.go:272 ⋮ [T1,n1,s1,r51/1:‹/Table/5{0-1}›,raft] 385  new range lease repl=(n1,s1):1 seq=1 start=0,0 exp=1679704764.152223164,0 pro=1679704758.152223164,0 following repl=(n1,s1):1 seq=1 start=0,0 exp=1679704746.135729956,0 pro=1679704740.135729956,0
I230325 00:39:18.172450 14728 1@server/drain.go:399 ⋮ [T1,n1] 386  (DEBUG) initiating kvserver node drain
I230325 00:39:18.172613 14728 1@kv/kvserver/store.go:1559 ⋮ [T1,drain,n1,s1] 387  (DEBUG) store marked as draining
I230325 00:39:18.182123 14728 1@server/drain.go:293 ⋮ [T1,n1] 388  drain remaining: 1
I230325 00:39:18.182249 14728 1@server/drain.go:295 ⋮ [T1,n1] 389  drain details: range lease iterations: 1
I230325 00:39:18.182404 14728 1@server/drain.go:175 ⋮ [T1,n1] 390  drain request completed without server shutdown

Have confirmed this across multiple test runs, and it seems like after the first drain / before the second drain starts when store.SetDraining(...) is called again, there's a lease extension that happens. It seems like when this occurs, the second call to drain ends up having 1 lease remaining - I'm not sure if this should be transferred away (and thus should be 0) before the remaining count is reported by store.SetDraining(..), or if this is entirely acceptable behavior and it's simply that we need to call sendDrainNoShutdown() repeatedly until we have no more remaining leases, sort of like how the CLI does it.

I have a fix I'm putting up that changes the test to reissue the drain request until there are no more leases remaining, but if there's actually an issue here with the lease extension occuring (or store.setDraining(..) not reporting the remaining number correctly) we'll want to fix that instead.

99458: jobs,*: stop writing payload and progress to system.jobs r=adityamaru a=adityamaru This change introduces a cluster version after which the payload and progress of a job will not be written to the system.jobs table. This will ensure that the system.job_info table is the single, source of truth for these two pieces of information. This cluster version has an associated upgrade that schema changes the `payload` column of the `system.jobs` table to be nullable, thereby allowing us to stop writing to it. This upgrade step is necessary for a future patch where we will drop the payload and progress columns. Without this intermediate upgrade step the `ALTER TABLE ... DROP COLUMN` upgrade job will attempt to write to dropped columns as part of its execution thereby failing to run the upgrade. Informs: #97762 Release note: None 99543: server: fix flaky drain test under race r=AlexTalks a=AlexTalks While previously `TestDrain` would issue a drain request twice, and expect that after the second drain request there would be no remaining leases, we have seen in some race builds that a lease extension can occur before that second drain, leaving one lease remaining after the second drain request. This can be seen in the following log example: ``` I230325 00:39:18.151604 14728 1@server/drain.go:145 ⋮ [T1,n1] 383 drain request received with doDrain = true, shutdown = false ... I230325 00:39:18.155547 986 kv/kvserver/replica_proposal.go:272 ⋮ [T1,n1,s1,r51/1:‹/Table/5{0-1}›,raft] 385 new range lease repl=(n1,s1):1 seq=1 start=0,0 exp=1679704764.152223164,0 pro=1679704758.152223164,0 following repl=(n1,s1):1 seq=1 start=0,0 exp=1679704746.135729956,0 pro=1679704740.135729956,0 I230325 00:39:18.172450 14728 1@server/drain.go:399 ⋮ [T1,n1] 386 (DEBUG) initiating kvserver node drain I230325 00:39:18.172613 14728 1@kv/kvserver/store.go:1559 ⋮ [T1,drain,n1,s1] 387 (DEBUG) store marked as draining I230325 00:39:18.182123 14728 1@server/drain.go:293 ⋮ [T1,n1] 388 drain remaining: 1 I230325 00:39:18.182249 14728 1@server/drain.go:295 ⋮ [T1,n1] 389 drain details: range lease iterations: 1 I230325 00:39:18.182404 14728 1@server/drain.go:175 ⋮ [T1,n1] 390 drain request completed without server shutdown ``` This change modifies the test to repeatedly issue drain requests until there is no remaining work, allowing the drain to complete upon subsequent requests. Fixes: #86974 Release note: None 99665: sql/gc_job,sqlerrors: make GC job robust to missing descriptors r=fqazi a=ajwerner ### sql: do not drop table descriptor independently if we're in drop schema If we have dropped schema IDs, we know that this is not an individual drop table schema change. We only have more than one dropped table when we drop a database or a schema. Before this change, we'd drop the table on its own, and then create another GC job to drop all the tables. This is not actually a bug because we should be robust to this, but it's also bad. ### sql/gc_job,sqlerrors: make GC job robust to missing descriptors The check used for missing descriptors became incorrect in the course of #94695. That change updated the underlying error code used in getters by the GC job. The GC job would subsequently retry forever when the descriptor was missing. This bug has not been shipped yet, so not writing a release note. Fixes: #99590 Release note (bug fix): DROP SCHEMA ... CASCADE could create multiple GC jobs: one for every table and one for the cascaded drop itself. This has been fixed. Co-authored-by: adityamaru <[email protected]> Co-authored-by: Alex Sarkesian <[email protected]> Co-authored-by: ajwerner <[email protected]>

While previously `TestDrain` would issue a drain request twice, and expect that after the second drain request there would be no remaining leases, we have seen in some race builds that a lease extension can occur before that second drain, leaving one lease remaining after the second drain request. This can be seen in the following log example: ``` I230325 00:39:18.151604 14728 1@server/drain.go:145 ⋮ [T1,n1] 383 drain request received with doDrain = true, shutdown = false ... I230325 00:39:18.155547 986 kv/kvserver/replica_proposal.go:272 ⋮ [T1,n1,s1,r51/1:‹/Table/5{0-1}›,raft] 385 new range lease repl=(n1,s1):1 seq=1 start=0,0 exp=1679704764.152223164,0 pro=1679704758.152223164,0 following repl=(n1,s1):1 seq=1 start=0,0 exp=1679704746.135729956,0 pro=1679704740.135729956,0 I230325 00:39:18.172450 14728 1@server/drain.go:399 ⋮ [T1,n1] 386 (DEBUG) initiating kvserver node drain I230325 00:39:18.172613 14728 1@kv/kvserver/store.go:1559 ⋮ [T1,drain,n1,s1] 387 (DEBUG) store marked as draining I230325 00:39:18.182123 14728 1@server/drain.go:293 ⋮ [T1,n1] 388 drain remaining: 1 I230325 00:39:18.182249 14728 1@server/drain.go:295 ⋮ [T1,n1] 389 drain details: range lease iterations: 1 I230325 00:39:18.182404 14728 1@server/drain.go:175 ⋮ [T1,n1] 390 drain request completed without server shutdown ``` This change modifies the test to repeatedly issue drain requests until there is no remaining work, allowing the drain to complete upon subsequent requests. Fixes: cockroachdb#86974 Release note: None

While previously `TestDrain` would issue a drain request twice, and expect that after the second drain request there would be no remaining leases, we have seen in some race builds that a lease extension can occur before that second drain, leaving one lease remaining after the second drain request. This can be seen in the following log example: ``` I230325 00:39:18.151604 14728 1@server/drain.go:145 ⋮ [T1,n1] 383 drain request received with doDrain = true, shutdown = false ... I230325 00:39:18.155547 986 kv/kvserver/replica_proposal.go:272 ⋮ [T1,n1,s1,r51/1:‹/Table/5{0-1}›,raft] 385 new range lease repl=(n1,s1):1 seq=1 start=0,0 exp=1679704764.152223164,0 pro=1679704758.152223164,0 following repl=(n1,s1):1 seq=1 start=0,0 exp=1679704746.135729956,0 pro=1679704740.135729956,0 I230325 00:39:18.172450 14728 1@server/drain.go:399 ⋮ [T1,n1] 386 (DEBUG) initiating kvserver node drain I230325 00:39:18.172613 14728 1@kv/kvserver/store.go:1559 ⋮ [T1,drain,n1,s1] 387 (DEBUG) store marked as draining I230325 00:39:18.182123 14728 1@server/drain.go:293 ⋮ [T1,n1] 388 drain remaining: 1 I230325 00:39:18.182249 14728 1@server/drain.go:295 ⋮ [T1,n1] 389 drain details: range lease iterations: 1 I230325 00:39:18.182404 14728 1@server/drain.go:175 ⋮ [T1,n1] 390 drain request completed without server shutdown ``` This change modifies the test to repeatedly issue drain requests until there is no remaining work, allowing the drain to complete upon subsequent requests. Fixes: #86974 Release note: None

fantapop added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-server-and-security DB Server & Security labels Aug 27, 2022

adityamaru added a commit to adityamaru/cockroach that referenced this issue Aug 29, 2022

server: skip TestDrain under testrace

fa2d6cc

Informs: cockroachdb#86974 Release note: None Release justification: test only change

adityamaru mentioned this issue Aug 29, 2022

server: skip TestDrain under testrace #87052

Merged

adityamaru added the skipped-test label Aug 29, 2022

rafiss removed the T-server-and-security DB Server & Security label Aug 29, 2022

blathers-crl bot added the T-kv KV Team label Aug 29, 2022

nvanbenschoten mentioned this issue Sep 6, 2022

server: mark testDrainContext assertion methods as test helpers #87431

Merged

shralex assigned AlexTalks Mar 3, 2023

nvanbenschoten added the GA-blocker label Mar 17, 2023

nvanbenschoten added the branch-release-23.1 Used to mark GA and release blockers, technical advisories, and bugs for 23.1 label Mar 17, 2023

nvanbenschoten mentioned this issue Mar 23, 2023

server: TestDrain failed #99345

Closed

AlexTalks mentioned this issue Mar 25, 2023

server: fix flaky drain test under race #99543

Merged

craig bot closed this as completed in 9b2229b Mar 27, 2023

blathers-crl bot mentioned this issue Mar 27, 2023

release-23.1: server: fix flaky drain test under race #99703

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pkg/server: `TestDrain` is flakey #86974

pkg/server: `TestDrain` is flakey #86974

fantapop commented Aug 27, 2022 •

edited by cockroach-jira-scripts

Loading

rafiss commented Aug 29, 2022

fantapop commented Aug 29, 2022

adityamaru commented Aug 29, 2022 •

edited

Loading

rafiss commented Aug 29, 2022

shralex commented Mar 3, 2023

blathers-crl bot commented Mar 17, 2023

AlexTalks commented Mar 25, 2023

pkg/server: TestDrain is flakey #86974

pkg/server: TestDrain is flakey #86974

Comments

fantapop commented Aug 27, 2022 • edited by cockroach-jira-scripts Loading

rafiss commented Aug 29, 2022

fantapop commented Aug 29, 2022

adityamaru commented Aug 29, 2022 • edited Loading

rafiss commented Aug 29, 2022

shralex commented Mar 3, 2023

blathers-crl bot commented Mar 17, 2023

AlexTalks commented Mar 25, 2023

pkg/server: `TestDrain` is flakey #86974

pkg/server: `TestDrain` is flakey #86974

fantapop commented Aug 27, 2022 •

edited by cockroach-jira-scripts

Loading

adityamaru commented Aug 29, 2022 •

edited

Loading