-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pkg/server: TestDrain
is flakey
#86974
Comments
its failing in Merged Bazel Extended CI This is one of the build statuses that posts to the flaky-test-fighter-ops channel. |
This test is failing pretty consistently under testrace so I'm going to skip it under race as flaky-test oncall. @rafiss is sql-experience going to own unskipping this or should I rope someone else in? |
Informs: cockroachdb#86974 Release note: None Release justification: test only change
KV is a better home for this |
86499: sql: prevent dropping roles with synthetic privileges r=rafiss a=RichardJCai Followup work to be done to support DROP OWNED BY and REASSIGN OWNED BY here. Release justification: Bug fix to newly introduced feature Release note: None 87052: server: skip TestDrain under testrace r=srosenberg a=adityamaru Informs: #86974 Release note: None Release justification: test only change 87070: roachpb: clarify parameter name in ConstructStatementFingerprintID r=rafiss a=rafiss "anonymizedStmt" is incorrect here -- the value passed in should be a query with constants removed, but identifiers present. Release note: None Release justification: low risk change Co-authored-by: richardjcai <[email protected]> Co-authored-by: adityamaru <[email protected]> Co-authored-by: Rafi Shamim <[email protected]>
The error messages in cases like cockroachdb#86974 are not useful otherwise. This change allows us to see where the assertion method was called from. Release justification: testing only.
87271: sql: change when txn is checked refresh materialized views r=rafiss a=e-mbrown Due to the transaction being checked during planning and not during execution, refresh materialized views would fail from the client. Now the transaction is checked during execution . Release justification: Bug fix for refresh materialized views Release note: None 87327: bazel: detect go standard library deprecation errors r=rickystewart a=healthy-pod Previously, `staticcheck` only raised deprecation errors from non-stdlib packages because the analysis pass used by the deprecation analyzer is created by `nogo` which doesn't analyze the standard library packages. This code change patches `staticcheck` to let it detect deprecation errors from the standard library and raise them. It also removes two uses of `tls.Config.PreferServerCipherSuites` because it's deprecated since Go 1.18 and is a legacy field that is ignored and has no effect. Closes #84877 Release justification: Non-production code change Release note: None 87426: sql: disable agg funcs in costfuzz and unoptimized query oracle r=mgartner a=mgartner The false-positives of costfuzz and unoptimized-query-oracle caused by aggregate functions are overwhelming. This commit prevents aggregate functions from being generated for these tests. Informs #87353 Release justification: This is a test-only change. Release note: None 87431: server: mark testDrainContext assertion methods as test helpers r=nvanbenschoten a=nvanbenschoten The error messages in cases like #86974 are not useful otherwise. This change allows us to see where the assertion method was called from. Release justification: testing only. Co-authored-by: e-mbrown <[email protected]> Co-authored-by: healthy-pod <[email protected]> Co-authored-by: Marcus Gartner <[email protected]> Co-authored-by: Nathan VanBenschoten <[email protected]>
@AlexTalks can you take a look please |
Hi @nvanbenschoten, please add branch-* labels to identify which branch(es) this release-blocker affects. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
So here is what I am seeing when I run this under race:
Have confirmed this across multiple test runs, and it seems like after the first drain / before the second drain starts when I have a fix I'm putting up that changes the test to reissue the drain request until there are no more leases remaining, but if there's actually an issue here with the lease extension occuring (or |
99458: jobs,*: stop writing payload and progress to system.jobs r=adityamaru a=adityamaru This change introduces a cluster version after which the payload and progress of a job will not be written to the system.jobs table. This will ensure that the system.job_info table is the single, source of truth for these two pieces of information. This cluster version has an associated upgrade that schema changes the `payload` column of the `system.jobs` table to be nullable, thereby allowing us to stop writing to it. This upgrade step is necessary for a future patch where we will drop the payload and progress columns. Without this intermediate upgrade step the `ALTER TABLE ... DROP COLUMN` upgrade job will attempt to write to dropped columns as part of its execution thereby failing to run the upgrade. Informs: #97762 Release note: None 99543: server: fix flaky drain test under race r=AlexTalks a=AlexTalks While previously `TestDrain` would issue a drain request twice, and expect that after the second drain request there would be no remaining leases, we have seen in some race builds that a lease extension can occur before that second drain, leaving one lease remaining after the second drain request. This can be seen in the following log example: ``` I230325 00:39:18.151604 14728 1@server/drain.go:145 ⋮ [T1,n1] 383 drain request received with doDrain = true, shutdown = false ... I230325 00:39:18.155547 986 kv/kvserver/replica_proposal.go:272 ⋮ [T1,n1,s1,r51/1:‹/Table/5{0-1}›,raft] 385 new range lease repl=(n1,s1):1 seq=1 start=0,0 exp=1679704764.152223164,0 pro=1679704758.152223164,0 following repl=(n1,s1):1 seq=1 start=0,0 exp=1679704746.135729956,0 pro=1679704740.135729956,0 I230325 00:39:18.172450 14728 1@server/drain.go:399 ⋮ [T1,n1] 386 (DEBUG) initiating kvserver node drain I230325 00:39:18.172613 14728 1@kv/kvserver/store.go:1559 ⋮ [T1,drain,n1,s1] 387 (DEBUG) store marked as draining I230325 00:39:18.182123 14728 1@server/drain.go:293 ⋮ [T1,n1] 388 drain remaining: 1 I230325 00:39:18.182249 14728 1@server/drain.go:295 ⋮ [T1,n1] 389 drain details: range lease iterations: 1 I230325 00:39:18.182404 14728 1@server/drain.go:175 ⋮ [T1,n1] 390 drain request completed without server shutdown ``` This change modifies the test to repeatedly issue drain requests until there is no remaining work, allowing the drain to complete upon subsequent requests. Fixes: #86974 Release note: None 99665: sql/gc_job,sqlerrors: make GC job robust to missing descriptors r=fqazi a=ajwerner ### sql: do not drop table descriptor independently if we're in drop schema If we have dropped schema IDs, we know that this is not an individual drop table schema change. We only have more than one dropped table when we drop a database or a schema. Before this change, we'd drop the table on its own, and then create another GC job to drop all the tables. This is not actually a bug because we should be robust to this, but it's also bad. ### sql/gc_job,sqlerrors: make GC job robust to missing descriptors The check used for missing descriptors became incorrect in the course of #94695. That change updated the underlying error code used in getters by the GC job. The GC job would subsequently retry forever when the descriptor was missing. This bug has not been shipped yet, so not writing a release note. Fixes: #99590 Release note (bug fix): DROP SCHEMA ... CASCADE could create multiple GC jobs: one for every table and one for the cascaded drop itself. This has been fixed. Co-authored-by: adityamaru <[email protected]> Co-authored-by: Alex Sarkesian <[email protected]> Co-authored-by: ajwerner <[email protected]>
While previously `TestDrain` would issue a drain request twice, and expect that after the second drain request there would be no remaining leases, we have seen in some race builds that a lease extension can occur before that second drain, leaving one lease remaining after the second drain request. This can be seen in the following log example: ``` I230325 00:39:18.151604 14728 1@server/drain.go:145 ⋮ [T1,n1] 383 drain request received with doDrain = true, shutdown = false ... I230325 00:39:18.155547 986 kv/kvserver/replica_proposal.go:272 ⋮ [T1,n1,s1,r51/1:‹/Table/5{0-1}›,raft] 385 new range lease repl=(n1,s1):1 seq=1 start=0,0 exp=1679704764.152223164,0 pro=1679704758.152223164,0 following repl=(n1,s1):1 seq=1 start=0,0 exp=1679704746.135729956,0 pro=1679704740.135729956,0 I230325 00:39:18.172450 14728 1@server/drain.go:399 ⋮ [T1,n1] 386 (DEBUG) initiating kvserver node drain I230325 00:39:18.172613 14728 1@kv/kvserver/store.go:1559 ⋮ [T1,drain,n1,s1] 387 (DEBUG) store marked as draining I230325 00:39:18.182123 14728 1@server/drain.go:293 ⋮ [T1,n1] 388 drain remaining: 1 I230325 00:39:18.182249 14728 1@server/drain.go:295 ⋮ [T1,n1] 389 drain details: range lease iterations: 1 I230325 00:39:18.182404 14728 1@server/drain.go:175 ⋮ [T1,n1] 390 drain request completed without server shutdown ``` This change modifies the test to repeatedly issue drain requests until there is no remaining work, allowing the drain to complete upon subsequent requests. Fixes: cockroachdb#86974 Release note: None
While previously `TestDrain` would issue a drain request twice, and expect that after the second drain request there would be no remaining leases, we have seen in some race builds that a lease extension can occur before that second drain, leaving one lease remaining after the second drain request. This can be seen in the following log example: ``` I230325 00:39:18.151604 14728 1@server/drain.go:145 ⋮ [T1,n1] 383 drain request received with doDrain = true, shutdown = false ... I230325 00:39:18.155547 986 kv/kvserver/replica_proposal.go:272 ⋮ [T1,n1,s1,r51/1:‹/Table/5{0-1}›,raft] 385 new range lease repl=(n1,s1):1 seq=1 start=0,0 exp=1679704764.152223164,0 pro=1679704758.152223164,0 following repl=(n1,s1):1 seq=1 start=0,0 exp=1679704746.135729956,0 pro=1679704740.135729956,0 I230325 00:39:18.172450 14728 1@server/drain.go:399 ⋮ [T1,n1] 386 (DEBUG) initiating kvserver node drain I230325 00:39:18.172613 14728 1@kv/kvserver/store.go:1559 ⋮ [T1,drain,n1,s1] 387 (DEBUG) store marked as draining I230325 00:39:18.182123 14728 1@server/drain.go:293 ⋮ [T1,n1] 388 drain remaining: 1 I230325 00:39:18.182249 14728 1@server/drain.go:295 ⋮ [T1,n1] 389 drain details: range lease iterations: 1 I230325 00:39:18.182404 14728 1@server/drain.go:175 ⋮ [T1,n1] 390 drain request completed without server shutdown ``` This change modifies the test to repeatedly issue drain requests until there is no remaining work, allowing the drain to complete upon subsequent requests. Fixes: cockroachdb#86974 Release note: None
While previously `TestDrain` would issue a drain request twice, and expect that after the second drain request there would be no remaining leases, we have seen in some race builds that a lease extension can occur before that second drain, leaving one lease remaining after the second drain request. This can be seen in the following log example: ``` I230325 00:39:18.151604 14728 1@server/drain.go:145 ⋮ [T1,n1] 383 drain request received with doDrain = true, shutdown = false ... I230325 00:39:18.155547 986 kv/kvserver/replica_proposal.go:272 ⋮ [T1,n1,s1,r51/1:‹/Table/5{0-1}›,raft] 385 new range lease repl=(n1,s1):1 seq=1 start=0,0 exp=1679704764.152223164,0 pro=1679704758.152223164,0 following repl=(n1,s1):1 seq=1 start=0,0 exp=1679704746.135729956,0 pro=1679704740.135729956,0 I230325 00:39:18.172450 14728 1@server/drain.go:399 ⋮ [T1,n1] 386 (DEBUG) initiating kvserver node drain I230325 00:39:18.172613 14728 1@kv/kvserver/store.go:1559 ⋮ [T1,drain,n1,s1] 387 (DEBUG) store marked as draining I230325 00:39:18.182123 14728 1@server/drain.go:293 ⋮ [T1,n1] 388 drain remaining: 1 I230325 00:39:18.182249 14728 1@server/drain.go:295 ⋮ [T1,n1] 389 drain details: range lease iterations: 1 I230325 00:39:18.182404 14728 1@server/drain.go:175 ⋮ [T1,n1] 390 drain request completed without server shutdown ``` This change modifies the test to repeatedly issue drain requests until there is no remaining work, allowing the drain to complete upon subsequent requests. Fixes: #86974 Release note: None
While previously `TestDrain` would issue a drain request twice, and expect that after the second drain request there would be no remaining leases, we have seen in some race builds that a lease extension can occur before that second drain, leaving one lease remaining after the second drain request. This can be seen in the following log example: ``` I230325 00:39:18.151604 14728 1@server/drain.go:145 ⋮ [T1,n1] 383 drain request received with doDrain = true, shutdown = false ... I230325 00:39:18.155547 986 kv/kvserver/replica_proposal.go:272 ⋮ [T1,n1,s1,r51/1:‹/Table/5{0-1}›,raft] 385 new range lease repl=(n1,s1):1 seq=1 start=0,0 exp=1679704764.152223164,0 pro=1679704758.152223164,0 following repl=(n1,s1):1 seq=1 start=0,0 exp=1679704746.135729956,0 pro=1679704740.135729956,0 I230325 00:39:18.172450 14728 1@server/drain.go:399 ⋮ [T1,n1] 386 (DEBUG) initiating kvserver node drain I230325 00:39:18.172613 14728 1@kv/kvserver/store.go:1559 ⋮ [T1,drain,n1,s1] 387 (DEBUG) store marked as draining I230325 00:39:18.182123 14728 1@server/drain.go:293 ⋮ [T1,n1] 388 drain remaining: 1 I230325 00:39:18.182249 14728 1@server/drain.go:295 ⋮ [T1,n1] 389 drain details: range lease iterations: 1 I230325 00:39:18.182404 14728 1@server/drain.go:175 ⋮ [T1,n1] 390 drain request completed without server shutdown ``` This change modifies the test to repeatedly issue drain requests until there is no remaining work, allowing the drain to complete upon subsequent requests. Fixes: #86974 Release note: None
Describe the problem
TestDrain has been flaking a lot today.
Jira issue: CRDB-19064
The text was updated successfully, but these errors were encountered: