Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pkg/server: TestDrain is flakey #86974

Closed
fantapop opened this issue Aug 27, 2022 · 7 comments · Fixed by #99543
Closed

pkg/server: TestDrain is flakey #86974

fantapop opened this issue Aug 27, 2022 · 7 comments · Fixed by #99543
Assignees
Labels
branch-release-23.1 Used to mark GA and release blockers, technical advisories, and bugs for 23.1 C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. GA-blocker skipped-test T-kv KV Team

Comments

@fantapop
Copy link
Collaborator

fantapop commented Aug 27, 2022

Describe the problem
Screen Shot 2022-08-26 at 5 53 10 PM

TestDrain has been flaking a lot today.

Failed
=== RUN   TestDrain
    test_log_scope.go:161: test logs captured to: /artifacts/tmp/_tmp/ab5462d2cb3989e2a33018db970f23de/logTestDrain523638963
    test_log_scope.go:79: use -show-logs to present logs inline
    drain_test.go:254: expected remaining false, got true
    panic.go:500: -- test log scope end --
--- FAIL: TestDrain (74.31s)

Jira issue: CRDB-19064

@fantapop fantapop added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-server-and-security DB Server & Security labels Aug 27, 2022
@rafiss
Copy link
Collaborator

rafiss commented Aug 29, 2022

Do you have a link to a failure?

I looked at the Test History on Bazel CI and Github CI, and it says 1000 of the last 1000 runs have passed.

@fantapop
Copy link
Collaborator Author

its failing in Merged Bazel Extended CI This is one of the build statuses that posts to the flaky-test-fighter-ops channel.

@adityamaru
Copy link
Contributor

adityamaru commented Aug 29, 2022

This test is failing pretty consistently under testrace so I'm going to skip it under race as flaky-test oncall. @rafiss is sql-experience going to own unskipping this or should I rope someone else in?

adityamaru added a commit to adityamaru/cockroach that referenced this issue Aug 29, 2022
Informs: cockroachdb#86974

Release note: None

Release justification: test only change
@rafiss
Copy link
Collaborator

rafiss commented Aug 29, 2022

KV is a better home for this

@rafiss rafiss removed the T-server-and-security DB Server & Security label Aug 29, 2022
@blathers-crl blathers-crl bot added the T-kv KV Team label Aug 29, 2022
craig bot pushed a commit that referenced this issue Aug 29, 2022
86499: sql: prevent dropping roles with synthetic privileges r=rafiss a=RichardJCai

Followup work to be done to support DROP OWNED BY and
REASSIGN OWNED BY here.

Release justification: Bug fix to newly introduced feature
Release note: None

87052: server: skip TestDrain under testrace r=srosenberg a=adityamaru

Informs: #86974

Release note: None

Release justification: test only change

87070: roachpb: clarify parameter name in ConstructStatementFingerprintID r=rafiss a=rafiss

"anonymizedStmt" is incorrect here -- the value passed in should be a
query with constants removed, but identifiers present.

Release note: None

Release justification: low risk change

Co-authored-by: richardjcai <[email protected]>
Co-authored-by: adityamaru <[email protected]>
Co-authored-by: Rafi Shamim <[email protected]>
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Sep 6, 2022
The error messages in cases like cockroachdb#86974 are not useful otherwise.
This change allows us to see where the assertion method was called
from.

Release justification: testing only.
craig bot pushed a commit that referenced this issue Sep 6, 2022
87271: sql: change when txn is checked refresh materialized views r=rafiss a=e-mbrown


Due to the transaction being checked during planning
and not during execution, refresh materialized views
would fail from the client. Now the transaction is checked
during execution .

Release justification: Bug fix for refresh materialized views

Release note: None

87327: bazel: detect go standard library deprecation errors r=rickystewart a=healthy-pod

Previously, `staticcheck` only raised deprecation errors
from non-stdlib packages because the analysis pass used
by the deprecation analyzer is created by `nogo` which doesn't
analyze the standard library packages.

This code change patches `staticcheck` to let it detect deprecation
errors from the standard library and raise them. It also removes two
uses of `tls.Config.PreferServerCipherSuites` because it's deprecated
since Go 1.18 and is a legacy field that is ignored and has no effect.

Closes #84877

Release justification: Non-production code change
Release note: None

87426: sql: disable agg funcs in costfuzz and unoptimized query oracle r=mgartner a=mgartner

The false-positives of costfuzz and unoptimized-query-oracle caused by
aggregate functions are overwhelming. This commit prevents aggregate
functions from being generated for these tests.

Informs #87353

Release justification: This is a test-only change.

Release note: None

87431: server: mark testDrainContext assertion methods as test helpers r=nvanbenschoten a=nvanbenschoten

The error messages in cases like #86974 are not useful otherwise. This change allows us to see where the assertion method was called from.

Release justification: testing only.

Co-authored-by: e-mbrown <[email protected]>
Co-authored-by: healthy-pod <[email protected]>
Co-authored-by: Marcus Gartner <[email protected]>
Co-authored-by: Nathan VanBenschoten <[email protected]>
@shralex
Copy link
Contributor

shralex commented Mar 3, 2023

@AlexTalks can you take a look please

@blathers-crl
Copy link

blathers-crl bot commented Mar 17, 2023

Hi @nvanbenschoten, please add branch-* labels to identify which branch(es) this release-blocker affects.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@nvanbenschoten nvanbenschoten added the branch-release-23.1 Used to mark GA and release blockers, technical advisories, and bugs for 23.1 label Mar 17, 2023
@AlexTalks
Copy link
Contributor

So here is what I am seeing when I run this under race:

# test: sendDrainNoShutdown()
I230325 00:39:15.227264 14158 1@server/drain.go:145 ⋮ [T1,n1] 373  drain request received with doDrain = true, shutdown = false
...
I230325 00:39:17.782437 14158 1@server/drain.go:399 ⋮ [T1,n1] 377  (DEBUG) initiating kvserver node drain
I230325 00:39:17.782596 14158 1@kv/kvserver/store.go:1559 ⋮ [T1,drain,n1,s1] 378  (DEBUG) store marked as draining
I230325 00:39:18.149121 14158 1@server/drain.go:293 ⋮ [T1,n1] 379  drain remaining: 51
I230325 00:39:18.149263 14158 1@server/drain.go:295 ⋮ [T1,n1] 380  drain details: descriptor leases: 22, liveness record: 1, range lease iterations: 28
I230325 00:39:18.149475 14158 1@server/drain.go:175 ⋮ [T1,n1] 381  drain request completed without server shutdown

# test: sendDrainNoShutdown() again
I230325 00:39:18.151604 14728 1@server/drain.go:145 ⋮ [T1,n1] 383  drain request received with doDrain = true, shutdown = false
...
I230325 00:39:18.155547 986 kv/kvserver/replica_proposal.go:272 ⋮ [T1,n1,s1,r51/1:‹/Table/5{0-1}›,raft] 385  new range lease repl=(n1,s1):1 seq=1 start=0,0 exp=1679704764.152223164,0 pro=1679704758.152223164,0 following repl=(n1,s1):1 seq=1 start=0,0 exp=1679704746.135729956,0 pro=1679704740.135729956,0
I230325 00:39:18.172450 14728 1@server/drain.go:399 ⋮ [T1,n1] 386  (DEBUG) initiating kvserver node drain
I230325 00:39:18.172613 14728 1@kv/kvserver/store.go:1559 ⋮ [T1,drain,n1,s1] 387  (DEBUG) store marked as draining
I230325 00:39:18.182123 14728 1@server/drain.go:293 ⋮ [T1,n1] 388  drain remaining: 1
I230325 00:39:18.182249 14728 1@server/drain.go:295 ⋮ [T1,n1] 389  drain details: range lease iterations: 1
I230325 00:39:18.182404 14728 1@server/drain.go:175 ⋮ [T1,n1] 390  drain request completed without server shutdown

Have confirmed this across multiple test runs, and it seems like after the first drain / before the second drain starts when store.SetDraining(...) is called again, there's a lease extension that happens. It seems like when this occurs, the second call to drain ends up having 1 lease remaining - I'm not sure if this should be transferred away (and thus should be 0) before the remaining count is reported by store.SetDraining(..), or if this is entirely acceptable behavior and it's simply that we need to call sendDrainNoShutdown() repeatedly until we have no more remaining leases, sort of like how the CLI does it.

I have a fix I'm putting up that changes the test to reissue the drain request until there are no more leases remaining, but if there's actually an issue here with the lease extension occuring (or store.setDraining(..) not reporting the remaining number correctly) we'll want to fix that instead.

craig bot pushed a commit that referenced this issue Mar 27, 2023
99458: jobs,*: stop writing payload and progress to system.jobs r=adityamaru a=adityamaru

This change introduces a cluster version after which the
payload and progress of a job will not be written to the
system.jobs table. This will ensure that the system.job_info
table is the single, source of truth for these two pieces of
information.

This cluster version has an associated upgrade that schema changes
the `payload` column of the `system.jobs` table to be nullable,
thereby allowing us to stop writing to it. This upgrade step
is necessary for a future patch where we will drop the payload
and progress columns. Without this intermediate upgrade step the
`ALTER TABLE ... DROP COLUMN` upgrade job will attempt to write
to dropped columns as part of its execution thereby failing to
run the upgrade.

Informs: #97762

Release note: None

99543: server: fix flaky drain test under race r=AlexTalks a=AlexTalks

While previously `TestDrain` would issue a drain request twice, and expect that after the second drain request there would be no remaining leases, we have seen in some race builds that a lease extension can occur before that second drain, leaving one lease remaining after the second drain request. This can be seen in the following log example:
```
I230325 00:39:18.151604 14728 1@server/drain.go:145 ⋮ [T1,n1] 383  drain request received with doDrain = true, shutdown = false
...
I230325 00:39:18.155547 986 kv/kvserver/replica_proposal.go:272 ⋮ [T1,n1,s1,r51/1:‹/Table/5{0-1}›,raft] 385  new range lease repl=(n1,s1):1 seq=1 start=0,0 exp=1679704764.152223164,0 pro=1679704758.152223164,0 following repl=(n1,s1):1 seq=1 start=0,0 exp=1679704746.135729956,0 pro=1679704740.135729956,0
I230325 00:39:18.172450 14728 1@server/drain.go:399 ⋮ [T1,n1] 386  (DEBUG) initiating kvserver node drain
I230325 00:39:18.172613 14728 1@kv/kvserver/store.go:1559 ⋮ [T1,drain,n1,s1] 387  (DEBUG) store marked as draining
I230325 00:39:18.182123 14728 1@server/drain.go:293 ⋮ [T1,n1] 388  drain remaining: 1
I230325 00:39:18.182249 14728 1@server/drain.go:295 ⋮ [T1,n1] 389  drain details: range lease iterations: 1
I230325 00:39:18.182404 14728 1@server/drain.go:175 ⋮ [T1,n1] 390  drain request completed without server shutdown
```
This change modifies the test to repeatedly issue drain requests until there is no remaining work, allowing the drain to complete upon subsequent requests.

Fixes: #86974

Release note: None

99665: sql/gc_job,sqlerrors: make GC job robust to missing descriptors r=fqazi a=ajwerner

### sql: do not drop table descriptor independently if we're in drop schema

If we have dropped schema IDs, we know that this is not an individual drop table
schema change. We only have more than one dropped table when we drop a database
or a schema. Before this change, we'd drop the table on its own, and then create
another GC job to drop all the tables. This is not actually a bug because we
should be robust to this, but it's also bad.

### sql/gc_job,sqlerrors: make GC job robust to missing descriptors

The check used for missing descriptors became incorrect in the course of
#94695. That change updated
the underlying error code used in getters by the GC job. The GC job would
subsequently retry forever when the descriptor was missing. This bug
has not been shipped yet, so not writing a release note.

Fixes: #99590

Release note (bug fix): DROP SCHEMA ... CASCADE could create multiple
GC jobs: one for every table and one for the cascaded drop itself. This has
been fixed.


Co-authored-by: adityamaru <[email protected]>
Co-authored-by: Alex Sarkesian <[email protected]>
Co-authored-by: ajwerner <[email protected]>
@craig craig bot closed this as completed in 9b2229b Mar 27, 2023
rharding6373 pushed a commit to rharding6373/cockroach that referenced this issue Mar 27, 2023
While previously `TestDrain` would issue a drain request twice, and
expect that after the second drain request there would be no remaining
leases, we have seen in some race builds that a lease extension can
occur before that second drain, leaving one lease remaining after the
second drain request. This can be seen in the following log example:
```
I230325 00:39:18.151604 14728 1@server/drain.go:145 ⋮ [T1,n1] 383  drain request received with doDrain = true, shutdown = false
...
I230325 00:39:18.155547 986 kv/kvserver/replica_proposal.go:272 ⋮ [T1,n1,s1,r51/1:‹/Table/5{0-1}›,raft] 385  new range lease repl=(n1,s1):1 seq=1 start=0,0 exp=1679704764.152223164,0 pro=1679704758.152223164,0 following repl=(n1,s1):1 seq=1 start=0,0 exp=1679704746.135729956,0 pro=1679704740.135729956,0
I230325 00:39:18.172450 14728 1@server/drain.go:399 ⋮ [T1,n1] 386  (DEBUG) initiating kvserver node drain
I230325 00:39:18.172613 14728 1@kv/kvserver/store.go:1559 ⋮ [T1,drain,n1,s1] 387  (DEBUG) store marked as draining
I230325 00:39:18.182123 14728 1@server/drain.go:293 ⋮ [T1,n1] 388  drain remaining: 1
I230325 00:39:18.182249 14728 1@server/drain.go:295 ⋮ [T1,n1] 389  drain details: range lease iterations: 1
I230325 00:39:18.182404 14728 1@server/drain.go:175 ⋮ [T1,n1] 390  drain request completed without server shutdown
```
This change modifies the test to repeatedly issue drain requests until
there is no remaining work, allowing the drain to complete upon
subsequent requests.

Fixes: cockroachdb#86974

Release note: None
aadityasondhi pushed a commit to aadityasondhi/cockroach that referenced this issue Mar 28, 2023
While previously `TestDrain` would issue a drain request twice, and
expect that after the second drain request there would be no remaining
leases, we have seen in some race builds that a lease extension can
occur before that second drain, leaving one lease remaining after the
second drain request. This can be seen in the following log example:
```
I230325 00:39:18.151604 14728 1@server/drain.go:145 ⋮ [T1,n1] 383  drain request received with doDrain = true, shutdown = false
...
I230325 00:39:18.155547 986 kv/kvserver/replica_proposal.go:272 ⋮ [T1,n1,s1,r51/1:‹/Table/5{0-1}›,raft] 385  new range lease repl=(n1,s1):1 seq=1 start=0,0 exp=1679704764.152223164,0 pro=1679704758.152223164,0 following repl=(n1,s1):1 seq=1 start=0,0 exp=1679704746.135729956,0 pro=1679704740.135729956,0
I230325 00:39:18.172450 14728 1@server/drain.go:399 ⋮ [T1,n1] 386  (DEBUG) initiating kvserver node drain
I230325 00:39:18.172613 14728 1@kv/kvserver/store.go:1559 ⋮ [T1,drain,n1,s1] 387  (DEBUG) store marked as draining
I230325 00:39:18.182123 14728 1@server/drain.go:293 ⋮ [T1,n1] 388  drain remaining: 1
I230325 00:39:18.182249 14728 1@server/drain.go:295 ⋮ [T1,n1] 389  drain details: range lease iterations: 1
I230325 00:39:18.182404 14728 1@server/drain.go:175 ⋮ [T1,n1] 390  drain request completed without server shutdown
```
This change modifies the test to repeatedly issue drain requests until
there is no remaining work, allowing the drain to complete upon
subsequent requests.

Fixes: cockroachdb#86974

Release note: None
AlexTalks added a commit that referenced this issue Mar 29, 2023
While previously `TestDrain` would issue a drain request twice, and
expect that after the second drain request there would be no remaining
leases, we have seen in some race builds that a lease extension can
occur before that second drain, leaving one lease remaining after the
second drain request. This can be seen in the following log example:
```
I230325 00:39:18.151604 14728 1@server/drain.go:145 ⋮ [T1,n1] 383  drain request received with doDrain = true, shutdown = false
...
I230325 00:39:18.155547 986 kv/kvserver/replica_proposal.go:272 ⋮ [T1,n1,s1,r51/1:‹/Table/5{0-1}›,raft] 385  new range lease repl=(n1,s1):1 seq=1 start=0,0 exp=1679704764.152223164,0 pro=1679704758.152223164,0 following repl=(n1,s1):1 seq=1 start=0,0 exp=1679704746.135729956,0 pro=1679704740.135729956,0
I230325 00:39:18.172450 14728 1@server/drain.go:399 ⋮ [T1,n1] 386  (DEBUG) initiating kvserver node drain
I230325 00:39:18.172613 14728 1@kv/kvserver/store.go:1559 ⋮ [T1,drain,n1,s1] 387  (DEBUG) store marked as draining
I230325 00:39:18.182123 14728 1@server/drain.go:293 ⋮ [T1,n1] 388  drain remaining: 1
I230325 00:39:18.182249 14728 1@server/drain.go:295 ⋮ [T1,n1] 389  drain details: range lease iterations: 1
I230325 00:39:18.182404 14728 1@server/drain.go:175 ⋮ [T1,n1] 390  drain request completed without server shutdown
```
This change modifies the test to repeatedly issue drain requests until
there is no remaining work, allowing the drain to complete upon
subsequent requests.

Fixes: #86974

Release note: None
AlexTalks added a commit that referenced this issue Mar 29, 2023
While previously `TestDrain` would issue a drain request twice, and
expect that after the second drain request there would be no remaining
leases, we have seen in some race builds that a lease extension can
occur before that second drain, leaving one lease remaining after the
second drain request. This can be seen in the following log example:
```
I230325 00:39:18.151604 14728 1@server/drain.go:145 ⋮ [T1,n1] 383  drain request received with doDrain = true, shutdown = false
...
I230325 00:39:18.155547 986 kv/kvserver/replica_proposal.go:272 ⋮ [T1,n1,s1,r51/1:‹/Table/5{0-1}›,raft] 385  new range lease repl=(n1,s1):1 seq=1 start=0,0 exp=1679704764.152223164,0 pro=1679704758.152223164,0 following repl=(n1,s1):1 seq=1 start=0,0 exp=1679704746.135729956,0 pro=1679704740.135729956,0
I230325 00:39:18.172450 14728 1@server/drain.go:399 ⋮ [T1,n1] 386  (DEBUG) initiating kvserver node drain
I230325 00:39:18.172613 14728 1@kv/kvserver/store.go:1559 ⋮ [T1,drain,n1,s1] 387  (DEBUG) store marked as draining
I230325 00:39:18.182123 14728 1@server/drain.go:293 ⋮ [T1,n1] 388  drain remaining: 1
I230325 00:39:18.182249 14728 1@server/drain.go:295 ⋮ [T1,n1] 389  drain details: range lease iterations: 1
I230325 00:39:18.182404 14728 1@server/drain.go:175 ⋮ [T1,n1] 390  drain request completed without server shutdown
```
This change modifies the test to repeatedly issue drain requests until
there is no remaining work, allowing the drain to complete upon
subsequent requests.

Fixes: #86974

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-release-23.1 Used to mark GA and release blockers, technical advisories, and bugs for 23.1 C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. GA-blocker skipped-test T-kv KV Team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants