Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

backupccl: TestDataDriven/restore-grants is occasionally hanging #87129

Closed
adityamaru opened this issue Aug 30, 2022 · 13 comments · Fixed by #89901
Closed

backupccl: TestDataDriven/restore-grants is occasionally hanging #87129

adityamaru opened this issue Aug 30, 2022 · 13 comments · Fixed by #89901
Assignees
Labels
A-disaster-recovery branch-release-22.2 Used to mark GA and release blockers, technical advisories, and bugs for 22.2 C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. skipped-test T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions)

Comments

@adityamaru
Copy link
Contributor

adityamaru commented Aug 30, 2022

Example build: https://teamcity.cockroachdb.com/viewLog.html?buildId=6293928&buildTypeId=Cockroach_UnitTests_BazelUnitTests&tab=artifacts#%2Ftmp%2F_tmp%2F73ed31399472b19fb4ec4de3ae2e0b2c%2FlogTestDataDriven2672313397;%2Ftmp%2F_tmp%2F73ed31399472b19fb4ec4de3ae2e0b2c%2FTestBackupRestoreDataDriven2133966575;%2Ftmp%2F_tmp%2F73ed31399472b19fb4ec4de3ae2e0b2c%2FStartServer3283493967;%2Fbazel-testlogs%2Fpkg%2Fccl%2Fbackupccl%2Fbackupccl_test (shard 4 of 16)

It looks like we're hanging on these statements

GRANT SELECT ON testdb.sc.othertable TO user1;
. According to the logs we queue 3 schemachange jobs but then do not seem to execute them until the timeout:

I220830 14:51:25.210956 249216 sql/table.go:98 ⋮ [n1,client=127.0.0.1:33332,user=root] 9978  queued new database schema change job 792332485291933697 for database 107
I220830 14:51:25.213329 249216 sql/schema.go:81 ⋮ [n1,client=127.0.0.1:33332,user=root] 9979  queued new schema change job 792332485307006977 for schema 108
I220830 14:51:25.216511 249216 sql/type_change.go:147 ⋮ [n1,client=127.0.0.1:33332,user=root] 9980  queued new type change job 792332485317427201 for type 109

In a local, successful run of this test we do see these jobs queued as well but quickly see them resuming execution and completing.

Jira issue: CRDB-19166

@adityamaru adityamaru added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-disaster-recovery T-sql-schema-deprecated Use T-sql-foundations instead labels Aug 30, 2022
@blathers-crl
Copy link

blathers-crl bot commented Aug 30, 2022

cc @cockroachdb/bulk-io

adityamaru added a commit to adityamaru/cockroach that referenced this issue Aug 31, 2022
Informs: cockroachdb#87129

Release note: None

Release justification: low risk test only change
craig bot pushed a commit that referenced this issue Sep 1, 2022
87231: backupccl: skipping restore-grant datadriven test r=adityamaru a=adityamaru

Informs: #87129

Release note: None

Release justification: low risk test only change

Co-authored-by: adityamaru <[email protected]>
@ajwerner
Copy link
Contributor

ajwerner commented Sep 6, 2022

@adityamaru did you ever get a stack trace from a hang?

@adityamaru
Copy link
Contributor Author

Nope, didn't 😞 I haven't been able to reproduce this locally either the few times I tried stressing it.

@blathers-crl
Copy link

blathers-crl bot commented Sep 20, 2022

Hi @ajwerner, please add branch-* labels to identify which branch(es) this release-blocker affects.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.

@ajwerner ajwerner added the branch-release-22.2 Used to mark GA and release blockers, technical advisories, and bugs for 22.2 label Sep 20, 2022
@ajwerner
Copy link
Contributor

@adityamaru it seems that this fell off your plate. AFAICT this is new functionality and it needs to get sorted or we need to decide that it's not a problem. Can you take a look and decide whether this matters or not?

@adityamaru
Copy link
Contributor Author

Yup looking at it now, last I looked I wasn't able to reproduce this locally to grab stacks, but let me stress this again.

@adityamaru adityamaru self-assigned this Sep 27, 2022
@adityamaru
Copy link
Contributor Author

1500 stress runs over an hour and it hasn't hung. I've opened #88851 in the hope that CI times out and dumps its stacks.

@stevendanna
Copy link
Collaborator

I can occasionally get this test to hang for a few minutes. For example, if I set the per-test timeout to 3 minutes, I can typically get a timeout eventually. In the cases I've caught so far, it looks like we are waiting on leases:

goroutine 1462 [select, 2 minutes]:
github.com/cockroachdb/cockroach/pkg/sql/catalog/lease.acquireNodeLease({0x60a5790?, 0xc003965f20}, 0xc0009f6f00, 0x72)
        github.com/cockroachdb/cockroach/pkg/sql/catalog/lease/lease.go:537 +0x2de
github.com/cockroachdb/cockroach/pkg/sql/catalog/lease.(*Manager).Acquire.func1(0xc00348c360, {0x60a5790, 0xc003965f20}, 0xc000785578?, 0x20b0880?)
        github.com/cockroachdb/cockroach/pkg/sql/catalog/lease/lease.go:988 +0xab
github.com/cockroachdb/cockroach/pkg/sql/catalog/lease.(*Manager).Acquire(0xc0009f6f00, {0x60a5790, 0xc003965f20}, {0x60a5790?, 0x3965f20?, 0xc0?}, 0x1a95f6f?)
        github.com/cockroachdb/cockroach/pkg/sql/catalog/lease/lease.go:990 +0xef
github.com/cockroachdb/cockroach/pkg/sql/catalog/lease.(*Manager).AcquireByName(0xc0009f6f00, {0x60a5790, 0xc003965f20}, {0xc001836fe0?, 0x6b?, 0x71?}, 0x18b51db?, 0xc0?, {0xc0018b51db, 0xa})
        github.com/cockroachdb/cockroach/pkg/sql/catalog/lease/lease.go:843 +0x36e
github.com/cockroachdb/cockroach/pkg/sql/catalog/descs.(*leasedDescriptors).getByName(0xc003696f28, {0x60a5790, 0xc003965f20}, {0x607eea8, 0xc003070fd0?}, 0x6b, 0x71, {0xc0018b51db, 0xa})
        github.com/cockroachdb/cockroach/pkg/sql/catalog/descs/leased_descriptors.go:106 +0x2e3
github.com/cockroachdb/cockroach/pkg/sql/catalog/descs.(*Collection).getNonVirtualDescriptorID.func6()
        github.com/cockroachdb/cockroach/pkg/sql/catalog/descs/descriptor.go:484 +0x13f
github.com/cockroachdb/cockroach/pkg/sql/catalog/descs.(*Collection).getNonVirtualDescriptorID(0xc003696f00?, {0x60a5790?, 0xc003965f20?}, 0xc003070fd0?, {0x612d888?, 0xc003494210?}, {0x612cbc0?, 0xc00395c820?}, {0xc0018b51db, 0xa}, ...)
        github.com/cockroachdb/cockroach/pkg/sql/catalog/descs/descriptor.go:511 +0x4db
github.com/cockroachdb/cockroach/pkg/sql/catalog/descs.(*Collection).getDescriptorByName(0xc003965f20?, {0x60a5790, 0xc003965f20}, 0x6?, {0x612d888?, 0xc003494210}, {0x612cbc0, 0xc00395c820}, {0xc0018b51db, 0xa}, ...)
        github.com/cockroachdb/cockroach/pkg/sql/catalog/descs/descriptor.go:316 +0x15b
github.com/cockroachdb/cockroach/pkg/sql/catalog/descs.(*Collection).GetObjectByName(0x38?, {0x60a5790, 0xc003965f20}, 0x8c11620?, {0xc0018b51d1, 0x6}, {0xc0018b51d8, 0x2}, {0xc0018b51db, 0xa}, ...)
        github.com/cockroachdb/cockroach/pkg/sql/catalog/descs/object.go:53 +0x1d2
github.com/cockroachdb/cockroach/pkg/sql.(*schemaResolver).LookupObject(0x0?, {0x60a5790?, 0xc003965f20?}, {{0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, ...}, ...)
        github.com/cockroachdb/cockroach/pkg/sql/schema_resolver.go:118 +0x5b1
github.com/cockroachdb/cockroach/pkg/sql/catalog/resolver.ResolveExisting({0x60a5790, 0xc003965f20}, 0xc003962f00, {0x7fa6b1077ed0, 0xc00177cad0}, {{0x1, 0x0, 0x0, 0x0, 0x0, ...}, ...}, ...)
        github.com/cockroachdb/cockroach/pkg/sql/catalog/resolver/resolver.go:388 +0x451
github.com/cockroachdb/cockroach/pkg/sql/catalog/resolver.ResolveExistingObject({0x60a5790, 0xc003965f20}, {0x60efa20, 0xc00177cad0}, 0xc003962f00, {{0x1, 0x0, 0x0, 0x0, 0x0, ...}, ...})
        github.com/cockroachdb/cockroach/pkg/sql/catalog/resolver/resolver.go:196 +0x16a
github.com/cockroachdb/cockroach/pkg/sql/catalog/resolver.ResolveExistingTableObject({0x60a5790, 0xc003965f20}, {0x60efa20, 0xc00177cad0}, 0xc00395a5b8, {{0x1, 0x0, 0x0, 0x0, 0x0, ...}, ...})
        github.com/cockroachdb/cockroach/pkg/sql/catalog/resolver/resolver.go:124 +0x159
github.com/cockroachdb/cockroach/pkg/sql.(*optCatalog).ResolveDataSource(0xc00395a5a0, {0x60a5790, 0xc003965f20}, {0xc0?, 0x18?}, 0x428d5f?)
        github.com/cockroachdb/cockroach/pkg/sql/opt_catalog.go:232 +0x1e5
github.com/cockroachdb/cockroach/pkg/sql/opt/cat.ExpandDataSourceGlob({0x60a5790?, 0xc003965f20?}, {0x6101430?, 0xc00395a5a0?}, {0x2?, 0x0?}, {0x6099508?, 0xc003962ec0?})
        github.com/cockroachdb/cockroach/pkg/sql/opt/cat/utils.go:31 +0xba
github.com/cockroachdb/cockroach/pkg/sql.expandTableGlob({0x60a5790, 0xc003965f20}, 0xc00177cad0, {0x6099508, 0xc003962ec0})
        github.com/cockroachdb/cockroach/pkg/sql/resolver.go:896 +0x145
github.com/cockroachdb/cockroach/pkg/sql.(*planner).getTablePatternsComposition(0xc00177cad0, {0x60a5790, 0xc003965f20}, {{0x0, 0x0, 0x0}, {0x0, 0x0, 0x0}, {0x0, ...}, ...})
        github.com/cockroachdb/cockroach/pkg/sql/grant_revoke.go:537 +0x10c
github.com/cockroachdb/cockroach/pkg/sql.(*planner).getGrantOnObject(0x4c3d0c0?, {0x60a5790?, 0xc003965f20?}, {{0x0, 0x0, 0x0}, {0x0, 0x0, 0x0}, {0x0, ...}, ...}, ...)
        github.com/cockroachdb/cockroach/pkg/sql/grant_revoke.go:485 +0xf8
github.com/cockroachdb/cockroach/pkg/sql.(*planner).Grant(0xc00177cad0, {0x60a5790?, 0xc003965f20?}, 0xc0032e0500)
        github.com/cockroachdb/cockroach/pkg/sql/grant_revoke.go:47 +0x9f
github.com/cockroachdb/cockroach/pkg/sql.planOpaque({0x60a5790, 0xc003965f20}, 0xc003965f20?, {0x60cd1c0?, 0xc0032e0500?})
        github.com/cockroachdb/cockroach/pkg/sql/opaque.go:202 +0xa45
github.com/cockroachdb/cockroach/pkg/sql.buildOpaque({0x60a5790, 0xc003965f20}, 0xc00177cd30, 0xc4c875?, {0x60cd1c0?, 0xc0032e0500?})
        github.com/cockroachdb/cockroach/pkg/sql/opaque.go:63 +0x232
github.com/cockroachdb/cockroach/pkg/sql/opt/optbuilder.(*Builder).tryBuildOpaque(0xc001cb0a50, {0x60cd1c0?, 0xc0032e0500}, 0xc0034f3000)
        github.com/cockroachdb/cockroach/pkg/sql/opt/optbuilder/opaque.go:70 +0x103
github.com/cockroachdb/cockroach/pkg/sql/opt/optbuilder.(*Builder).buildStmt(0xc001cb0a50, {0x60cd1c0?, 0xc0032e0500?}, {0x0, 0x0, 0x0}, 0x0?)
        github.com/cockroachdb/cockroach/pkg/sql/opt/optbuilder/builder.go:404 +0xaaf
github.com/cockroachdb/cockroach/pkg/sql/opt/optbuilder.(*Builder).buildStmtAtRoot(0xc001cb0a50, {0x60cd1c0?, 0xc0032e0500?}, {0x0?, 0x0?, 0x0?})
        github.com/cockroachdb/cockroach/pkg/sql/opt/optbuilder/builder.go:252 +0x1ad
github.com/cockroachdb/cockroach/pkg/sql/opt/optbuilder.(*Builder).Build(0xc001cb0a50)
        github.com/cockroachdb/cockroach/pkg/sql/opt/optbuilder/builder.go:226 +0x314
github.com/cockroachdb/cockroach/pkg/sql.(*optPlanningCtx).buildExecMemo(0xc00177d3f8, {0x60a5790?, 0xc003965f20})
        github.com/cockroachdb/cockroach/pkg/sql/plan_opt.go:563 +0x508
github.com/cockroachdb/cockroach/pkg/sql.(*planner).makeOptimizerPlan(0xc00177cad0, {0x60a5790?, 0xc003965ec0?})
        github.com/cockroachdb/cockroach/pkg/sql/plan_opt.go:231 +0x1b1
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).makeExecPlan(0xc00177c600, {0x60a5790, 0xc003965ec0}, 0xc00177cad0)
        github.com/cockroachdb/cockroach/pkg/sql/conn_executor_exec.go:1477 +0x4e
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).dispatchToExecutionEngine(0xc00177c600, {0x60a5790, 0xc003965ec0}, 0xc00177cad0, {0x7fa6ad483670, 0xc0006b7148})
        github.com/cockroachdb/cockroach/pkg/sql/conn_executor_exec.go:1102 +0x294
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).execStmtInOpenState(0xc00177c600, {0x60a5790?, 0xc0032dbc20?}, {{0x60cd1c0, 0xc0032e0500}, {0xc0018b51c1, 0x2d}, 0x0, 0x1}, 0x0, ...)
        github.com/cockroachdb/cockroach/pkg/sql/conn_executor_exec.go:731 +0x24a8
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).execStmt.func1({0x60a5790?, 0xc0032dbc20?})
        github.com/cockroachdb/cockroach/pkg/sql/conn_executor_exec.go:129 +0xfc
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).execWithProfiling(0x60a5790?, {0x60a5790?, 0xc0032dbc20?}, {0x60cd1c0?, 0xc0032e0500?}, 0x0?, 0x0?)
        github.com/cockroachdb/cockroach/pkg/sql/conn_executor_exec.go:2423 +0x330
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).execStmt(0xc00177c600, {0x60a5790, 0xc0032dbc20}, {{0x60cd1c0, 0xc0032e0500}, {0xc0018b51c1, 0x2d}, 0x0, 0x1}, 0x0, ...)
        github.com/cockroachdb/cockroach/pkg/sql/conn_executor_exec.go:128 +0x49f
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).execCmd.func1({{{0x60cd1c0, 0xc0032e0500}, {0xc0018b51c1, 0x2d}, 0x0, 0x1}, {0xc0c998e961c4245e, 0x1b39b2b4, 0x0}, {0xc0c998e961c42d65, ...}, ...}, ...)
        github.com/cockroachdb/cockroach/pkg/sql/conn_executor.go:1921 +0x2ff
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).execCmd(0xc00177c600)
        github.com/cockroachdb/cockroach/pkg/sql/conn_executor.go:1926 +0x1128
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).run(0xc00177c600, {0x60a56e8, 0xc001db9ec0}, 0x4155bc53c0000000?, 0xc003f0bdd8?, 0xc003c1c160?)
        github.com/cockroachdb/cockroach/pkg/sql/conn_executor.go:1846 +0x20b
github.com/cockroachdb/cockroach/pkg/sql.(*Server).ServeConn(0xc001aae0a0?, {0x60a56e8?, 0xc001db9ec0?}, {0xc000f19c00?}, 0x3?, 0xc000ccebc0?)
        github.com/cockroachdb/cockroach/pkg/sql/conn_executor.go:828 +0xe6
github.com/cockroachdb/cockroach/pkg/sql/pgwire.(*conn).processCommandsAsync.func1()
        github.com/cockroachdb/cockroach/pkg/sql/pgwire/conn.go:728 +0x405
created by github.com/cockroachdb/cockroach/pkg/sql/pgwire.(*conn).processCommandsAsync
        github.com/cockroachdb/cockroach/pkg/sql/pgwire/conn.go:639 +0x22e

But, I suppose that should resolve itself within about 5 minutes at worst so I'm not sure it would time out the test overall.

@stevendanna
Copy link
Collaborator

stevendanna commented Oct 12, 2022

Adding some logging, it looks like the transaction to insert the lease is retrying repeatedly because of an intent that touches the parent ID.

I221012 10:24:20.198393 1706 sql/catalog/lease/storage.go:144  [n1,client=127.0.0.1:54572,user=root] 232  storage attempting to acquire lease tabledesc.immutable: {ID: 114, Version: 1, ModificationTime: "1665570259.263377182,1", ParentID: 107, ParentSchemaID: 113, State: PUBLIC, NextColumnID: 3, Columns: [{ID: 1, TypeID: 20, Null: true}, {ID: 2, TypeID: 20, Null: false, Hidden: true, HasDefault: true}], NextFamilyID: 1, Families: [{ID: 0, Columns: [1, 2]}], PrimaryIndex: 1, NextIndexID: 2, Indexes: [{ID: 1, Unique: true, KeyColumns: [{ID: 2, Dir: ASC}], StoreColumns: [1]}]}@1665570555.099930179,0
I221012 10:24:20.200000 1706 kv/txn.go:1027  [n1,client=127.0.0.1:54572,user=root] 233  retrying transaction: unnamed (id: faecacb9-a7fe-4e67-86fd-98411cccbb00) because of a retryable error: TransactionRetryWithProtoRefreshError: TransactionRetryError: retry txn (RETRY_SERIALIZABLE - failed preemptive refresh due to a conflict: intent on key /Table/3/1/107/2/1): "unnamed" meta={id=faecacb9 key=/Table/11/1/114/1/2022-10-12T10:29:16.283917Z/1/0 pri=100.00000000 epo=40 ts=1665570260.187960517,2 min=1665570259.305919836,0 seq=2} lock=true stat=PENDING rts=1665570260.177933462,0 wto=false gul=1665570259.805919836,0

I'm not 100% sure yet what involves 107 in this transaction, but from a brief peak at the code it looks like it could get read during the Validate call in MustGetDescriptorsByID.

107 is the database that we created right before the stalled grant statement that we have an enqueued schema change job for.

Even in successful cases, I see a few of these retries, so it seems plausible that whatever is causing this other transaction to sit around for so long could cause this test to hang for a fairly long time.

This is fairly reproducible on master for me just with:

./dev test ./pkg/ccl/backupccl/ --filter TestDataDriven/restore-grants --show-logs --verbose --stress --test-args='-test.timeout 3m' --stress-args '-p 3' 2>&1

Lowering the timeout here probably helps because perhaps most of the time the conflicting transaction eventually does get pushed by the high priority txn used by the leasing code.

@stevendanna
Copy link
Collaborator

@ajwerner I'm going to leave GA-blocker on this for now. It seems to be a real issue based on what I've seen this morning.

@stevendanna
Copy link
Collaborator

stevendanna commented Oct 13, 2022

I ran this again with some additional logging. It definitely looks like we are caught in a loop where the lease manager's txn pushes the txn that updated the descriptor. But, it doesn't appear to be pushing it far enough, so it just gets conflicted again:

I221013 01:17:41.845031 1599 sql/catalog/lease/storage.go:145  [n1,client=127.0.0.1:47964,user=root] 467  storage attempting to acquire lease tabledesc.immutable: {ID: 114, Version: 1, ModificationTime: "1665623860.865910134,2", ParentID: 107, ParentSchemaID: 113, State: PUBLIC, NextColumnID: 3, Columns: [{ID: 1, TypeID: 20, Null: true}, {ID: 2, TypeID: 20, Null: false, Hidden: true, HasDefault: true}], NextFamilyID: 1, Families: [{ID: 0, Columns: [1, 2]}], PrimaryIndex: 1, NextIndexID: 2, Indexes: [{ID: 1, Unique: true, KeyColumns: [{ID: 2, Dir: ASC}], StoreColumns: [1]}]}@1665624151.187712056,0
I221013 01:17:41.846632 2127 kv/kvserver/batcheval/cmd_refresh.go:74  [n1,s1,r7/1:/Table/{3-4}] 468  refresh failed because of intent from txn: id=de34aa4c key=/Table/15/1/804629580697272321/0 pri=0.00256076 epo=0 ts=1665623861.823871195,1 min=1665623860.904576818,0 seq=8
I221013 01:17:41.846999 1599 kv/txn.go:1027  [n1,client=127.0.0.1:47964,user=root] 469  retrying transaction: unnamed (id: 9786d77c-b479-4de4-ba56-8d8cbb5a9744) because of a retryable error: TransactionRetryWithProtoRefreshError: TransactionRetryError: retry txn (RETRY_SERIALIZABLE - failed preemptive refresh due to a conflict: intent on key /Table/3/1/107/2/1): "unnamed" meta={id=9786d77c key=/Table/11/1/114/1/2022-10-13T01:22:47.49004Z/1/0 pri=100.00000000 epo=41 ts=1665623861.830141350,2 min=1665623860.911339152,0 seq=2} lock=true stat=PENDING rts=1665623861.823871195,0 wto=false gul=1665623861.411339152,0
I221013 01:17:41.848568 1599 kv/kvserver/concurrency/lock_table_waiter.go:499  [n1,s1,r7/1:/Table/{3-4}] 470  pushing timestamp of txn de34aa4c above 1665623861.846918850,0
I221013 01:17:41.854364 1599 kv/kvserver/concurrency/lock_table_waiter.go:499  [n1,s1,r7/1:/Table/{3-4}] 471  pushing timestamp of txn de34aa4c above 1665623861.846918850,0
I221013 01:17:41.860886 1599 kv/kvserver/concurrency/lock_table_waiter.go:499  [n1,s1,r7/1:/Table/{3-4}] 472  pushing timestamp of txn de34aa4c above 1665623861.846918850,0
I221013 01:17:41.867171 1599 sql/catalog/lease/storage.go:145  [n1,client=127.0.0.1:47964,user=root] 473  storage attempting to acquire lease tabledesc.immutable: {ID: 114, Version: 1, ModificationTime: "1665623860.865910134,2", ParentID: 107, ParentSchemaID: 113, State: PUBLIC, NextColumnID: 3, Columns: [{ID: 1, TypeID: 20, Null: true}, {ID: 2, TypeID: 20, Null: false, Hidden: true, HasDefault: true}], NextFamilyID: 1, Families: [{ID: 0, Columns: [1, 2]}], PrimaryIndex: 1, NextIndexID: 2, Indexes: [{ID: 1, Unique: true, KeyColumns: [{ID: 2, Dir: ASC}], StoreColumns: [1]}]}@1665624161.855587831,0
I221013 01:17:41.868672 2199 kv/kvserver/batcheval/cmd_refresh.go:74  [n1,s1,r7/1:/Table/{3-4}] 474  refresh failed because of intent from txn: id=de34aa4c key=/Table/15/1/804629580697272321/0 pri=0.00256076 epo=0 ts=1665623861.846918850,1 min=1665623860.904576818,0 seq=8
I221013 01:17:41.869189 1599 kv/txn.go:1027  [n1,client=127.0.0.1:47964,user=root] 475  retrying transaction: unnamed (id: 9786d77c-b479-4de4-ba56-8d8cbb5a9744) because of a retryable error: TransactionRetryWithProtoRefreshError: TransactionRetryError: retry txn (RETRY_SERIALIZABLE - failed preemptive refresh due to a conflict: intent on key /Table/3/1/107/2/1): "unnamed" meta={id=9786d77c key=/Table/11/1/114/1/2022-10-13T01:22:47.49004Z/1/0 pri=100.00000000 epo=42 ts=1665623861.851039102,2 min=1665623860.911339152,0 seq=2} lock=true stat=PENDING rts=1665623861.846918850,0 wto=false gul=1665623861.411339152,0
I221013 01:17:41.871336 1599 kv/kvserver/concurrency/lock_table_waiter.go:499  [n1,s1,r7/1:/Table/{3-4}] 476  pushing timestamp of txn de34aa4c above 1665623861.869062936,0
I221013 01:17:41.877133 1599 kv/kvserver/concurrency/lock_table_waiter.go:499  [n1,s1,r7/1:/Table/{3-4}] 477  pushing timestamp of txn de34aa4c above 1665623861.869062936,0
I221013 01:17:41.883343 1599 kv/kvserver/concurrency/lock_table_waiter.go:499  [n1,s1,r7/1:/Table/{3-4}] 478  pushing timestamp of txn de34aa4c above 1665623861.869062936,0
I221013 01:17:41.889526 1599 sql/catalog/lease/storage.go:145  [n1,client=127.0.0.1:47964,user=root] 479  storage attempting to acquire lease tabledesc.immutable: {ID: 114, Version: 1, ModificationTime: "1665623860.865910134,2", ParentID: 107, ParentSchemaID: 113, State: PUBLIC, NextColumnID: 3, Columns: [{ID: 1, TypeID: 20, Null: true}, {ID: 2, TypeID: 20, Null: false, Hidden: true, HasDefault: true}], NextFamilyID: 1, Families: [{ID: 0, Columns: [1, 2]}], PrimaryIndex: 1, NextIndexID: 2, Indexes: [{ID: 1, Unique: true, KeyColumns: [{ID: 2, Dir: ASC}], StoreColumns: [1]}]}@1665624159.441595899,0
I221013 01:17:41.891070 2212 kv/kvserver/batcheval/cmd_refresh.go:74  [n1,s1,r7/1:/Table/{3-4}] 480  refresh failed because of intent from txn: id=de34aa4c key=/Table/15/1/804629580697272321/0 pri=0.00256076 epo=0 ts=1665623861.869062936,1 min=1665623860.904576818,0 seq=8
I221013 01:17:41.891414 1599 kv/txn.go:1027  [n1,client=127.0.0.1:47964,user=root] 481  retrying transaction: unnamed (id: 9786d77c-b479-4de4-ba56-8d8cbb5a9744) because of a retryable error: TransactionRetryWithProtoRefreshError: TransactionRetryError: retry txn (RETRY_SERIALIZABLE - failed preemptive refresh due to a conflict: intent on key /Table/3/1/107/2/1): "unnamed" meta={id=9786d77c key=/Table/11/1/114/1/2022-10-13T01:22:47.49004Z/1/0 pri=100.00000000 epo=43 ts=1665623861.872580702,2 min=1665623860.911339152,0 seq=2} lock=true stat=PENDING rts=1665623861.869062936,0 wto=false gul=1665623861.411339152,0
I221013 01:17:41.892957 1599 kv/kvserver/concurrency/lock_table_waiter.go:499  [n1,s1,r7/1:/Table/{3-4}] 482  pushing timestamp of txn de34aa4c above 1665623861.891330033,0
I221013 01:17:41.899113 1599 kv/kvserver/concurrency/lock_table_waiter.go:499  [n1,s1,r7/1:/Table/{3-4}] 483  pushing timestamp of txn de34aa4c above 1665623861.891330033,0
I221013 01:17:41.904488 2148 kv/kvserver/concurrency/lock_table_waiter.go:499  [n1,s1,r16/1:/Table/1{5-6}] 484  pushing timestamp of txn de34aa4c above 1665623861.904478018,0
I221013 01:17:41.905191 1599 kv/kvserver/concurrency/lock_table_waiter.go:499  [n1,s1,r7/1:/Table/{3-4}] 485  pushing timestamp of txn de34aa4c above 1665623861.891330033,0
I221013 01:17:41.911035 1599 sql/catalog/lease/storage.go:145  [n1,client=127.0.0.1:47964,user=root] 486  storage attempting to acquire lease tabledesc.immutable: {ID: 114, Version: 1, ModificationTime: "1665623860.865910134,2", ParentID: 107, ParentSchemaID: 113, State: PUBLIC, NextColumnID: 3, Columns: [{ID: 1, TypeID: 20, Null: true}, {ID: 2, TypeID: 20, Null: false, Hidden: true, HasDefault: true}], NextFamilyID: 1, Families: [{ID: 0, Columns: [1, 2]}], PrimaryIndex: 1, NextIndexID: 2, Indexes: [{ID: 1, Unique: true, KeyColumns: [{ID: 2, Dir: ASC}], StoreColumns: [1]}]}@1665624174.429981986,0
I221013 01:17:41.912492 2190 kv/kvserver/batcheval/cmd_refresh.go:74  [n1,s1,r7/1:/Table/{3-4}] 487  refresh failed because of intent from txn: id=de34aa4c key=/Table/15/1/804629580697272321/0 pri=0.00256076 epo=0 ts=1665623861.891330033,1 min=1665623860.904576818,0 seq=8
I221013 01:17:41.912831 1599 kv/txn.go:1027  [n1,client=127.0.0.1:47964,user=root] 488  retrying transaction: unnamed (id: 9786d77c-b479-4de4-ba56-8d8cbb5a9744) because of a retryable error: TransactionRetryWithProtoRefreshError: TransactionRetryError: retry txn (RETRY_SERIALIZABLE - failed preemptive refresh due to a conflict: intent on key /Table/3/1/107/2/1): "unnamed" meta={id=9786d77c key=/Table/11/1/114/1/2022-10-13T01:22:47.49004Z/1/0 pri=100.00000000 epo=44 ts=1665623861.893269755,2 min=1665623860.911339152,0 seq=2} lock=true stat=PENDING rts=1665623861.891330033,0 wto=false gul=1665623861.411339152,0
I221013 01:17:41.914387 1599 kv/kvserver/concurrency/lock_table_waiter.go:499  [n1,s1,r7/1:/Table/{3-4}] 489  pushing timestamp of txn de34aa4c above 1665623861.912750573,0
I221013 01:17:41.921070 1599 kv/kvserver/concurrency/lock_table_waiter.go:499  [n1,s1,r7/1:/Table/{3-4}] 490  pushing timestamp of txn de34aa4c above 1665623861.912750573,0
I221013 01:17:41.927691 1599 kv/kvserver/concurrency/lock_table_waiter.go:499  [n1,s1,r7/1:/Table/{3-4}] 491  pushing timestamp of txn de34aa4c above 1665623861.912750573,0
I221013 01:17:41.933828 1599 sql/catalog/lease/storage.go:145  [n1,client=127.0.0.1:47964,user=root] 492  storage attempting to acquire lease tabledesc.immutable: {ID: 114, Version: 1, ModificationTime: "1665623860.865910134,2", ParentID: 107, ParentSchemaID: 113, State: PUBLIC, NextColumnID: 3, Columns: [{ID: 1, TypeID: 20, Null: true}, {ID: 2, TypeID: 20, Null: false, Hidden: true, HasDefault: true}], NextFamilyID: 1, Families: [{ID: 0, Columns: [1, 2]}], PrimaryIndex: 1, NextIndexID: 2, Indexes: [{ID: 1, Unique: true, KeyColumns: [{ID: 2, Dir: ASC}], StoreColumns: [1]}]}@1665624167.998078354,0
I221013 01:17:41.935414 2243 kv/kvserver/batcheval/cmd_refresh.go:74  [n1,s1,r7/1:/Table/{3-4}] 493  refresh failed because of intent from txn: id=de34aa4c key=/Table/15/1/804629580697272321/0 pri=0.00256076 epo=0 ts=1665623861.912750573,1 min=1665623860.904576818,0 seq=8
I221013 01:17:41.935741 1599 kv/txn.go:1027  [n1,client=127.0.0.1:47964,user=root] 494  retrying transaction: unnamed (id: 9786d77c-b479-4de4-ba56-8d8cbb5a9744) because of a retryable error: TransactionRetryWithProtoRefreshError: TransactionRetryError: retry txn (RETRY_SERIALIZABLE - failed preemptive refresh due to a conflict: intent on key /Table/3/1/107/2/1): "unnamed" meta={id=9786d77c key=/Table/11/1/114/1/2022-10-13T01:22:47.49004Z/1/0 pri=100.00000000 epo=45 ts=1665623861.922009370,2 min=1665623860.911339152,0 seq=2} lock=true stat=PENDING rts=1665623861.912750573,0 wto=false gul=1665623861.411339152,0
I221013 01:17:41.937273 1599 kv/kvserver/concurrency/lock_table_waiter.go:499  [n1,s1,r7/1:/Table/{3-4}] 495  pushing timestamp of txn de34aa4c above 1665623861.935666604,0

If I comment out this code to force it to be pushed a bit further out:

if !h.Timestamp.Synthetic {
// Because we intend to read on the same node, we can limit this to a
// clock reading from the local clock, relying on the fact that an
// observed timestamp from this node will limit our local uncertainty
// limit when we return to read.
//
// We intentionally do not use an observed timestamp directly to limit
// the push timestamp, because observed timestamps are not applicable in
// some cases (e.g. across lease changes). So to avoid an infinite loop
// where we continue to push to an unusable observed timestamp and
// continue to find the pushee in our uncertainty interval, we instead
// use the present time to limit the push timestamp, which is less
// optimal but is guaranteed to progress.
//
// There is some inherent raciness here, because the lease may move
// between when we push and when we later read. In such cases, we may
// need to push again, but expect to eventually succeed in reading,
// either after lease movement subsides or after the reader's read
// timestamp surpasses its global uncertainty limit.
//
// However, this argument only holds if we expect to be able to use a
// local uncertainty limit when we return to read the pushed intent.
// Notably, local uncertainty limits can not be used to ignore intents
// with synthetic timestamps that would otherwise be in a reader's
// uncertainty interval. This is because observed timestamps do not
// apply to intents/values with synthetic timestamps. So if we know
// that we will be pushing an intent to a synthetic timestamp, we
// don't limit the value to a clock reading from the local clock.
uncertaintyLimit.Backward(w.clock.Now())
}

Then I am no longer able to reproduce the issue.

On the DR side, we can probably renable this test so that we have coverage for restoring grants. We can lower the probability of it hanging substantially by setting SET enable_implicit_transaction_for_batch_statements = 'false' so that we aren't doing as many DDL statements in the same transaction so it is unlikely we need to lease a descriptor that we've already written to.

@stevendanna
Copy link
Collaborator

closedts.TargetDuration.Override(context.Background(), &settings.SV, 10*time.Millisecond)
closedts.SideTransportCloseInterval.Override(context.Background(), &settings.SV, 10*time.Millisecond)

@stevendanna
Copy link
Collaborator

I'm removing GA-blocker from this issue and have opened #89901 to unskip the test with a small mitigation.

I'll let schema decide if #89900 needs to be a GA-blcoker, but it seems like lease acquisition transactions contending with DDL transactions is not a new issue in general:

#46414
#86324

craig bot pushed a commit that referenced this issue Oct 17, 2022
89901: backupccl: unskip restore-grants r=adityamaru a=stevendanna

This test was skipped because it timed out.

I believe #89900 is the likely cause of the timeout. Since this test doesn't depend on the shorter closed timestamp setting, we can reset them to make the timeout much less likely.

Fixes #87129

Release note: None

Co-authored-by: Steven Danna <[email protected]>
@craig craig bot closed this as completed in 300eef3 Oct 17, 2022
@exalate-issue-sync exalate-issue-sync bot added T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions) and removed T-sql-schema-deprecated Use T-sql-foundations instead labels May 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-disaster-recovery branch-release-22.2 Used to mark GA and release blockers, technical advisories, and bugs for 22.2 C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. skipped-test T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions)
Projects
No open projects
Archived in project
Development

Successfully merging a pull request may close this issue.

3 participants