Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv: failure in TestLearnerReplicateQueueRace #94993

Closed
knz opened this issue Jan 10, 2023 · 4 comments · Fixed by #95512
Closed

kv: failure in TestLearnerReplicateQueueRace #94993

knz opened this issue Jan 10, 2023 · 4 comments · Fixed by #95512
Assignees
Labels
A-kv-replication Relating to Raft, consensus, and coordination. C-test-failure Broken test (automatically or manually discovered).

Comments

@knz
Copy link
Contributor

knz commented Jan 10, 2023

Found here: https://teamcity.cockroachdb.com/buildConfiguration/Cockroach_BazelEssentialCi/8262010?showRootCauses=false&expandBuildChangesSection=true&expandBuildProblemsSection=true&expandBuildTestsSection=true

=== RUN   TestLearnerReplicateQueueRace
    test_log_scope.go:161: test logs captured to: /artifacts/tmp/_tmp/7438ab096787a15a8ce5f502b5724798/logTestLearnerReplicateQueueRace1790270096
    test_log_scope.go:79: use -show-logs to present logs inline
    replica_learner_test.go:1117:
          Error Trace:  /home/roach/.cache/bazel/_bazel_roach/c5a4e7d36696d9cd970af2045211a7df/sandbox/processwrapper-sandbox/7671/execroot/com_github_cockroachdb_cockroach/bazel-out/k8-fastbuild/bin/pkg/kv/kvserver/kvserver_test_/kvserver_test.runfiles/com_github_cockroachdb_cockroach/pkg/kv/kvserver/replica_learner_test.go:1117
          Error:        Received unexpected error:
                        AdminChangeReplicas error: trying to remove a replica that doesn't exist: {ChangeType:REMOVE_VOTER Target:n3,s3}
                        (1) AdminChangeReplicas error: trying to remove a replica that doesn't exist: {ChangeType:REMOVE_VOTER Target:n3,s3}
                          | -- cause hidden behind barrier
                          | AdminChangeReplicas error: trying to remove a replica that doesn't exist: {ChangeType:REMOVE_VOTER Target:n3,s3}
                          | (1) attached stack trace
                          |   -- stack trace:
                          |   | github.com/cockroachdb/cockroach/pkg/testutils/testcluster.(*TestCluster).changeReplicas
                          |   |   github.com/cockroachdb/cockroach/pkg/testutils/testcluster/testcluster.go:762
                          |   | github.com/cockroachdb/cockroach/pkg/testutils/testcluster.(*TestCluster).RemoveVoters
                          |   |   github.com/cockroachdb/cockroach/pkg/testutils/testcluster/testcluster.go:918
                          |   | github.com/cockroachdb/cockroach/pkg/kv/kvserver_test.TestLearnerReplicateQueueRace
                          |   |   github.com/cockroachdb/cockroach/pkg/kv/kvserver_test/pkg/kv/kvserver/replica_learner_test.go:1116
                          |   | testing.tRunner
                          |   |   GOROOT/src/testing/testing.go:1446
                          |   | runtime.goexit
                          |   |   GOROOT/src/runtime/asm_amd64.s:1594
                          | Wraps: (2) AdminChangeReplicas error
                          | Wraps: (3) forced error mark
                          |   | "invalid replication change"
                          |   | github.com/cockroachdb/errors/withstack/*withstack.withStack::
                          | Wraps: (4) assertion failure
                          | Wraps: (5)
                          |   | (opaque error wrapper)
                          |   | type name: github.com/cockroachdb/errors/withstack/*withstack.withStack
                          |   | reportable 0:
                          |   |
                          |   | github.com/cockroachdb/cockroach/pkg/kv/kvserver.validateRemovals
                          |   |   github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_command.go:1457
                          |   | github.com/cockroachdb/cockroach/pkg/kv/kvserver.validateReplicationChanges
                          |   |   github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_command.go:1626
                          |   | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).changeReplicasImpl
                          |   |   github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_command.go:1013
                          |   | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).ChangeReplicas
                          |   |   github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_command.go:991
                          |   | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).executeAdminBatch
                          |   |   github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_send.go:955
                          |   | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).SendWithWriteBytes
                          |   |   github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_send.go:187
                          |   | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).SendWithWriteBytes
                          |   |   github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/store_send.go:205
                          |   | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Stores).SendWithWriteBytes
                          |   |   github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/stores.go:203
                          |   | github.com/cockroachdb/cockroach/pkg/server.(*Node).batchInternal
                          |   |   github.com/cockroachdb/cockroach/pkg/server/node.go:1121
                          |   | github.com/cockroachdb/cockroach/pkg/server.(*Node).Batch
                          |   |   github.com/cockroachdb/cockroach/pkg/server/node.go:1173
                          |   | github.com/cockroachdb/cockroach/pkg/rpc.makeInternalClientAdapter.func1
                          |   |   github.com/cockroachdb/cockroach/pkg/rpc/pkg/rpc/context.go:741
                          |   | github.com/cockroachdb/cockroach/pkg/util/tracing/grpcinterceptor.ServerInterceptor.func1
                          |   |   github.com/cockroachdb/cockroach/pkg/util/tracing/grpcinterceptor/grpc_interceptor.go:95
                          |   | github.com/cockroachdb/cockroach/pkg/rpc.bindUnaryServerInterceptorToHandler.func1
                          |   |   github.com/cockroachdb/cockroach/pkg/rpc/pkg/rpc/context.go:813
                          |   | github.com/cockroachdb/cockroach/pkg/rpc.NewServerEx.func3
                          |   |   github.com/cockroachdb/cockroach/pkg/rpc/pkg/rpc/context.go:260
                          |   | github.com/cockroachdb/cockroach/pkg/rpc.bindUnaryServerInterceptorToHandler.func1
                          |   |   github.com/cockroachdb/cockroach/pkg/rpc/pkg/rpc/context.go:813
                          |   | github.com/cockroachdb/cockroach/pkg/rpc.kvAuth.unaryInterceptor
                          |   |   github.com/cockroachdb/cockroach/pkg/rpc/pkg/rpc/auth.go:73
                          |   | github.com/cockroachdb/cockroach/pkg/rpc.bindUnaryServerInterceptorToHandler.func1
                          |   |   github.com/cockroachdb/cockroach/pkg/rpc/pkg/rpc/context.go:813
                          |   | github.com/cockroachdb/cockroach/pkg/rpc.NewServerEx.func1.1
                          |   |   github.com/cockroachdb/cockroach/pkg/rpc/pkg/rpc/context.go:229
                          |   | github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunTaskWithErr
                          |   |   github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:322
                          |   | github.com/cockroachdb/cockroach/pkg/rpc.NewServerEx.func1
                          |   |   github.com/cockroachdb/cockroach/pkg/rpc/pkg/rpc/context.go:227
                          |   | github.com/cockroachdb/cockroach/pkg/rpc.bindUnaryServerInterceptorToHandler.func1
                          |   |   github.com/cockroachdb/cockroach/pkg/rpc/pkg/rpc/context.go:813
                          |   | github.com/cockroachdb/cockroach/pkg/rpc.makeInternalClientAdapter.func2
                          |   |   github.com/cockroachdb/cockroach/pkg/rpc/pkg/rpc/context.go:751
                          |   | github.com/cockroachdb/cockroach/pkg/util/tracing/grpcinterceptor.ClientInterceptor.func2
                          |   |   github.com/cockroachdb/cockroach/pkg/util/tracing/grpcinterceptor/grpc_interceptor.go:225
                          |   | github.com/cockroachdb/cockroach/pkg/rpc.getChainUnaryInvoker.func1
                          |   |   github.com/cockroachdb/cockroach/pkg/rpc/pkg/rpc/context.go:897
                          |   | github.com/cockroachdb/cockroach/pkg/rpc.makeInternalClientAdapter.func3
                          |   |   github.com/cockroachdb/cockroach/pkg/rpc/pkg/rpc/context.go:782
                          |   | github.com/cockroachdb/cockroach/pkg/rpc.internalClientAdapter.Batch
                          |   |   github.com/cockroachdb/cockroach/pkg/rpc/pkg/rpc/context.go:905
                          |   | github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*grpcTransport).sendBatch
                          |   |   github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/transport.go:209
                          |   | github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*grpcTransport).SendNext
                          |   |   github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/transport.go:188
                          |   | github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*DistSender).sendToReplicas
                          |   |   github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/dist_sender.go:2142
                          |   | github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*DistSender).sendPartialBatch
                          |   |   github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/dist_sender.go:1668
                          |   | github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*DistSender).divideAndSendBatchToRanges
                          |   |   github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/dist_sender.go:1240
                          |   | github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*DistSender).Send
                          |   |   github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/dist_sender.go:861
                          | Wraps: (6) trying to remove a replica that doesn't exist: {ChangeType:REMOVE_VOTER Target:n3,s3}
                          | Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *markers.withMark (4) *assert.withAssertionFailure (5) *errbase.opaqueWrapper (6) *errutil.leafError
                        Error types: (1) *barriers.barrierErr
          Test:         TestLearnerReplicateQueueRace

cc @erikgrinaker @tbg for triage

Jira issue: CRDB-23264

@knz knz added C-test-failure Broken test (automatically or manually discovered). A-kv-replication Relating to Raft, consensus, and coordination. T-kv-replication labels Jan 10, 2023
@blathers-crl
Copy link

blathers-crl bot commented Jan 10, 2023

cc @cockroachdb/replication

@erikgrinaker
Copy link
Contributor

@aliher1911 can you have a look at this?

@aliher1911
Copy link
Contributor

Test tries to upreplicate while blocking the snapshot processing.

When failure happens it looks like change replicas fails and retries (again unsuccessfully).

I230116 21:15:56.229477 191599 13@kv/kvserver/replicate_queue.go:1208 ⋮ [T1,n1,s1,r55/1:‹/{Table/Max-Max}›] 7  adding voter n3,s3: ‹[1*:15, 2:0]›
I230116 21:15:56.230642 191599 13@kv/kvserver/replica_command.go:2333 ⋮ [T1,n1,s1,r55/1:‹/{Table/Max-Max}›] 8  change replicas (add [(n3,s3):3LEARNER] remove []): existing descriptor r55:‹/{Table/Max-Max}› [(n1,s1):1, (n2,s2):2, next=3, gen=3, sticky=9223372036.854775807,2147483647]
I230116 21:16:57.254873 191599 13@kv/kvserver/replica_command.go:2333 ⋮ [T1,n1,s1,r55/1:‹/{Table/Max-Max}›] 9  change replicas (add [(n3,s3):3LEARNER] remove []): existing descriptor r55:‹/{Table/Max-Max}› [(n1,s1):1, (n2,s2):2, next=3, gen=3, sticky=9223372036.854775807,2147483647]
I230116 21:16:57.275710 191599 13@kv/kvserver/replicate_queue.go:818 ⋮ [T1,n1,replicate,s1,r55/1:‹/{Table/Max-Max}›] 10  error processing replica: change replicas of r55 failed: ‹kv/kvserver/intentresolver/intent_resolver.go›:957: stopping‹›

And we see that failure is happening because we have conflicting transaction:

W230116 21:16:11.233392 187375 kv/kvserver/spanlatch/manager.go:558 ⋮ [T1,n1,s1,r55/1:‹/{Table/Max-Max}›] 197  have been waiting 15s to acquire ‹read› latch ‹/Local/Range/Table/Max/RangeDescriptor@0,0›, held by ‹write› latch ‹/Local/Range/Table/Max/RangeDescriptor@0,0›
W230116 21:16:11.252047 191599 kv/kvserver/spanlatch/manager.go:558 ⋮ [T1,n1,s1,r55/1:‹/{Table/Max-Max}›] 198  have been waiting 15s to acquire ‹read› latch ‹/Local/Range/Table/Max/RangeDescriptor@0,0›, held by ‹write› latch ‹/Local/Range/Table/Max/RangeDescriptor@0,0›
W230116 21:16:12.229424 191707 kv/kvserver/spanlatch/manager.go:558 ⋮ [T1,n1,s1,r55/1:‹/{Table/Max-Max}›] 200  have been waiting 15s to acquire ‹read› latch ‹/Local/Range/Table/Max/RangeDescriptor@0,0›, held by ‹write› latch ‹/Local/Range/Table/Max/RangeDescriptor@0,0›
W230116 21:16:12.605475 191737 kv/kvserver/spanlatch/manager.go:558 ⋮ [T1,n1,s1,r55/1:‹/{Table/Max-Max}›] 201  have been waiting 15s to acquire ‹write› latch ‹/Local/RangeID/55/r/AbortSpan/"ff3bb908-2ccf-43c8-ab5f-89cfa7aacc97"@0,0›, held by ‹read› latch ‹/Local/RangeID/55/r/AbortSpan/"ff3bb908-2ccf-43c8-ab5f-89cfa7aacc97"@0,0›
...
W230116 21:16:16.239261 191595 kv/kvserver/intentresolver/intent_resolver.go:799 ⋮ [T1] 207  failed to gc transaction record: ‹operation "cleanup txn record" timed out after 20.012s (given timeout 20s)›: could not GC completed transaction anchored at /Local/Range/Table/Max/‹RangeDescriptor›: context deadline exceeded

craig bot pushed a commit that referenced this issue Jan 19, 2023
95461: ui: remove reset sql stats for non-admin r=maryliag a=maryliag

Continuation from #95303

The previous PR missed the reset on the Transactions tab. This PR removes the reset sql stats for non-admin users.

Fixes #95213

Release note (ui change): Remove `reset sql stats` from Transactions page for non-admins.

95466: ingesting: fixup privileges granted during database restore r=rafiss a=adityamaru

Previously, all schemas and tables that were ingested as part of a database restore would "inherit" the privileges of the database. The database would be granted `CONNECT` for the `public` role and `ALL` to `admin` and `root`, and so all ingested schemas would have `ALL` for `admin` and `root`. Since 21.2 we have moved away from tables/schemas inheriting privileges from the parent database and so this logic is stale and partly incorrect. It is incorrect because the restored `public` schema does not have `CREATE` and `USAGE` granted to the `public` role. These privileges are always granted to `public` schemas of a database and so there is a discrepancy in restore's behaviour.

This change simplifies the logic to grant schemas and tables their default set of privileges if ingested via a database restore. It leaves the logic for cluster and table restores unchanged.

Release note (bug fix): fixes a bug where a database restore would not grant `CREATE` and `USAGE` on the public schema to the public role

Fixes: #95456

95467: schemachanger: a bunch of small fixes r=postamar a=postamar

Informs #88294.

Release note: None

95504: schemachangerccl: rename generated tests r=postamar a=postamar

This commit adds an underscore in the generated tests' name where there previously was one missing.

Informs #88294.

Release note: None

95512: kvserver: fix flaky TestLearnerReplicateQueueRace test r=tbg a=aliher1911

Test was not expecting raft snapshots and injected failure at wrong moments.

Fixes #94993

Release note: None

Co-authored-by: maryliag <[email protected]>
Co-authored-by: adityamaru <[email protected]>
Co-authored-by: Marius Posta <[email protected]>
Co-authored-by: Oleg Afanasyev <[email protected]>
@craig craig bot closed this as completed in 41b07fa Jan 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-replication Relating to Raft, consensus, and coordination. C-test-failure Broken test (automatically or manually discovered).
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants