roachtest: acceptance/version-upgrade failed [not using applied state] #62267

cockroach-teamcity · 2021-03-19T20:03:29Z

(roachtest).acceptance/version-upgrade failed on master@24e76d399047857a3230cd0c3dd2dcea42c4292a:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: artifacts/acceptance/version-upgrade/run_1
	versionupgrade.go:296,versionupgrade.go:471,versionupgrade.go:453,versionupgrade.go:200,versionupgrade.go:188,acceptance.go:63,acceptance.go:104,test_runner.go:768: dial tcp 127.0.0.1:26261: connect: connection refused

	cluster.go:1667,context.go:140,cluster.go:1656,test_runner.go:849: dead node detection: /go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor local --oneshot --ignore-empty-nodes: exit status 1 3: dead
		2: 23562
		1: 23903
		4: 23677
		Error: UNCLASSIFIED_PROBLEM: 3: dead
		(1) UNCLASSIFIED_PROBLEM
		Wraps: (2) attached stack trace
		  -- stack trace:
		  | main.glob..func14
		  | 	/go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1147
		  | main.wrap.func1
		  | 	/go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:271
		  | github.com/spf13/cobra.(*Command).execute
		  | 	/go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:830
		  | github.com/spf13/cobra.(*Command).ExecuteC
		  | 	/go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:914
		  | github.com/spf13/cobra.(*Command).Execute
		  | 	/go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:864
		  | main.main
		  | 	/go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1852
		  | runtime.main
		  | 	/usr/local/go/src/runtime/proc.go:204
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1374
		Wraps: (3) 3: dead
		Error types: (1) errors.Unclassified (2) *withstack.withStack (3) *errutil.leafError

More

Artifacts: /acceptance/version-upgrade
Related:

roachtest: acceptance/version-upgrade failed: invalid attempted write of database descriptor #58307 roachtest: acceptance/version-upgrade failed A-schema-descriptors C-bug C-test-failure O-roachtest branch-release-20.2
roachtest: acceptance/version-upgrade failed #53812 roachtest: acceptance/version-upgrade failed C-test-failure O-roachtest O-robot branch-release-19.2

See this test on roachdash
_{powered by pkg/cmd/internal/issues}

The text was updated successfully, but these errors were encountered:

nvanbenschoten · 2021-03-23T15:47:41Z

kv/kvserver/replica_proposal.go:841 â‹® [n3,s3,r21/3:â€¹/Table/6{0-1}â€º] 123  not using applied state key in v21.1
kv/kvserver/replica_proposal.go:841 â‹® [n3,s3,r21/3:â€¹/Table/6{0-1}â€º] 123 !goroutine 2574 [running]:
kv/kvserver/replica_proposal.go:841 â‹® [n3,s3,r21/3:â€¹/Table/6{0-1}â€º] 123 !github.com/cockroachdb/cockroach/pkg/util/log.getStacks(0x8447f01, 0xc002f51328, 0x10000000049cd4d, 0x7fd58bb7da10)
kv/kvserver/replica_proposal.go:841 â‹® [n3,s3,r21/3:â€¹/Table/6{0-1}â€º] 123 !	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/get_stacks.go:25 +0xb9
kv/kvserver/replica_proposal.go:841 â‹® [n3,s3,r21/3:â€¹/Table/6{0-1}â€º] 123 !github.com/cockroachdb/cockroach/pkg/util/log.(*loggerT).outputLogEntry(0xc0003745e0, 0xc00227f410, 0x24, 0x3, 0x0, 0x0, 0x0, 0x166dd6f14f1e3fcb, 0x400000000, 0x0, ...)
kv/kvserver/replica_proposal.go:841 â‹® [n3,s3,r21/3:â€¹/Table/6{0-1}â€º] 123 !	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog.go:279 +0xc32
kv/kvserver/replica_proposal.go:841 â‹® [n3,s3,r21/3:â€¹/Table/6{0-1}â€º] 123 !github.com/cockroachdb/cockroach/pkg/util/log.logfDepth(0x5a106e0, 0xc00278d4c0, 0x1, 0x4, 0x4c5c41b, 0x24, 0x0, 0x0, 0x0)
kv/kvserver/replica_proposal.go:841 â‹® [n3,s3,r21/3:â€¹/Table/6{0-1}â€º] 123 !	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/channels.go:58 +0x198
kv/kvserver/replica_proposal.go:841 â‹® [n3,s3,r21/3:â€¹/Table/6{0-1}â€º] 123 !github.com/cockroachdb/cockroach/pkg/util/log.Fatal(...)
kv/kvserver/replica_proposal.go:841 â‹® [n3,s3,r21/3:â€¹/Table/6{0-1}â€º] 123 !	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/log_channels_generated.go:850
kv/kvserver/replica_proposal.go:841 â‹® [n3,s3,r21/3:â€¹/Table/6{0-1}â€º] 123 !github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).evaluateProposal(0xc00181b600, 0x5a106e0, 0xc00278d4c0, 0xc0026b6ab0, 0x8, 0xc003266fa0, 0x0, 0x0, 0xc001d92300, 0x0, ...)
kv/kvserver/replica_proposal.go:841 â‹® [n3,s3,r21/3:â€¹/Table/6{0-1}â€º] 123 !	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_proposal.go:841 +0x63f
kv/kvserver/replica_proposal.go:841 â‹® [n3,s3,r21/3:â€¹/Table/6{0-1}â€º] 123 !github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).requestToProposal(0xc00181b600, 0x5a106e0, 0xc00278d4c0, 0xc0026b6ab0, 0x8, 0xc003266fa0, 0x1601fb530f0b5318, 0x0, 0xc0019424a0, 0x400000004, ...)
kv/kvserver/replica_proposal.go:841 â‹® [n3,s3,r21/3:â€¹/Table/6{0-1}â€º] 123 !	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_proposal.go:876 +0xb6
kv/kvserver/replica_proposal.go:841 â‹® [n3,s3,r21/3:â€¹/Table/6{0-1}â€º] 123 !github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).evalAndPropose(0xc00181b600, 0x5a106e0, 0xc00278d4c0, 0xc003266fa0, 0xc000fe82a0, 0x1601fb530f0b5318, 0x0, 0xc0019424a0, 0x400000004, 0x2, ...)
kv/kvserver/replica_proposal.go:841 â‹® [n3,s3,r21/3:â€¹/Table/6{0-1}â€º] 123 !	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_raft.go:84 +0x197
kv/kvserver/replica_proposal.go:841 â‹® [n3,s3,r21/3:â€¹/Table/6{0-1}â€º] 123 !github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).executeWriteBatch(0xc00181b600, 0x5a106e0, 0xc00278d4c0, 0xc003266fa0, 0xc000fe82a0, 0x0, 0x0, 0x0)
kv/kvserver/replica_proposal.go:841 â‹® [n3,s3,r21/3:â€¹/Table/6{0-1}â€º] 123 !	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_write.go:138 +0x7c5
kv/kvserver/replica_proposal.go:841 â‹® [n3,s3,r21/3:â€¹/Table/6{0-1}â€º] 123 !github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).executeBatchWithConcurrencyRetries(0xc00181b600, 0x5a106e0, 0xc00278d4c0, 0xc003266fa0, 0x52c8d58, 0x0, 0x0)
kv/kvserver/replica_proposal.go:841 â‹® [n3,s3,r21/3:â€¹/Table/6{0-1}â€º] 123 !	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_send.go:275 +0x336
kv/kvserver/replica_proposal.go:841 â‹® [n3,s3,r21/3:â€¹/Table/6{0-1}â€º] 123 !github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).sendWithRangeID(0xc00181b600, 0x5a106e0, 0xc00278d4c0, 0x15, 0xc003266fa0, 0x0, 0x0)
kv/kvserver/replica_proposal.go:841 â‹® [n3,s3,r21/3:â€¹/Table/6{0-1}â€º] 123 !	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_send.go:95 +0x55d
kv/kvserver/replica_proposal.go:841 â‹® [n3,s3,r21/3:â€¹/Table/6{0-1}â€º] 123 !github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).Send(...)
kv/kvserver/replica_proposal.go:841 â‹® [n3,s3,r21/3:â€¹/Table/6{0-1}â€º] 123 !	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_send.go:34
kv/kvserver/replica_proposal.go:841 â‹® [n3,s3,r21/3:â€¹/Table/6{0-1}â€º] 123 !github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*pendingLeaseRequest).requestLeaseAsync.func2(0x5a106e0, 0xc00278d4c0)
kv/kvserver/replica_proposal.go:841 â‹® [n3,s3,r21/3:â€¹/Table/6{0-1}â€º] 123 !	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_range_lease.go:421 +0x6b0
kv/kvserver/replica_proposal.go:841 â‹® [n3,s3,r21/3:â€¹/Table/6{0-1}â€º] 123 !github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTask.func1(0xc001000100, 0x5a106e0, 0xc00278d4c0, 0xc00113b1c0,

nvanbenschoten · 2021-03-23T15:48:18Z

This looks like an instance of #58378.

nvanbenschoten · 2021-03-23T15:48:43Z

@tbg and @irfansharif what do we want to do here? Should we mark #58378 as a GA-blocker?

tbg · 2021-03-23T16:04:27Z

Looks like a somewhat different issue:

I210319 20:02:53.198611 2827 kv/kvserver/store_remove_replica.go:127 â‹® [n3,replicaGC,s3,r21/3:â€¹/Table/6{0-1}â€º] 121  removing replica r21/3
I210319 20:02:53.198658 2574 1@kv/kvserver/replica_proposal.go:841 â‹® [n3,s3,r21/3:â€¹/Table/6{0-1}â€º] 122  the server is terminating due to a fatal error (see the DEV channel for details)
I210319 20:02:53.208924 2827 kv/kvserver/replica_destroy.go:170 â‹® [n3,replicaGC,s3,r21/3:â€¹/Table/6{0-1}â€º] 129  removed 10 (0+10) keys in 10ms [clear=0ms commit=10ms]
F210319 20:02:53.198720 2574 kv/kvserver/replica_proposal.go:841 â‹® [n3,s3,r21/3:â€¹/Table/6{0-1}â€º] 123  not using applied state key in v21.1

There is no snapshot application racing with the cluster version bump here. Instead, it seems as though there's some invalid interleaving of replicaGC and (lease) request evaluation.

irfansharif · 2021-03-24T15:03:25Z

@tbg, I have cycles, I'll take a look here.

irfansharif · 2021-03-25T21:48:35Z

This looks like an instance of #58378.

Well it's not exactly that, but I think it's possible for replica GC to purge out a replica, but if that replica is receiving a raft message, it might re-instate an uninitialized replica in the store map. I'm not sure yet though, this is pretty difficult to reproduce. I'm going to try and see what reverting #60429 tickles, I suspect it's the same issue (where we're dealing with uninitialized replicas, and thus panicking). Hopefully that makes it easier to repro.

irfansharif · 2021-03-26T06:51:01Z

Hm, I'm not having much luck with this one. It's pretty difficult to repro (read: I couldn't) and none of my theories have panned out. My best guess for as to what's happening is that the replicaGC process GCs an old replica, writing the range tombstone key and removing it from the store's replicas map. Perhaps concurrently, we're processing an old raft request, calling into store.tryGetOrCreateReplica. I don't see how (we appropriately check for the tombstone and the store map), but if that GC-ed replica were missing from the map, we might try to initialize a second incarnation of it here:

cockroach/pkg/kv/kvserver/store_create_replica.go

Line 221 in 8b137b4

return repl.loadRaftMuLockedReplicaMuLocked(uninitializedDesc)

In doing so it'll read from the engine to find it's replica state, but won't find any if it was GC-ed away, so it's as-if it was "unset":

cockroach/pkg/kv/kvserver/replica_init.go

Lines 171 to 173 in 8b137b4

    
           if r.mu.state, err = r.mu.stateLoader.Load(ctx, r.Engine(), desc); err != nil { 
        
           	return err 
        
           }

cockroach/pkg/kv/kvserver/replica_proposal.go

Line 831 in 8b137b4

usingAppliedStateKey := r.mu.state.UsingAppliedStateKey

But again, I'm not actually seeing evidence of any of that. I tried to work backwards from the replica state possibly being unset, and this seems to be the only possible narrative for it, but I don't see how. A third thing that would need to happen is a lease request from this node, which would need to evaluate on the follower node. In the failure above, r21/3 on n3 is being GC-ed, and is also evaluating a proposal. Looking at the descriptor for r21, the replicas + leaseholders are elsewhere:

      "desc": {
        "range_id": 21,
        "start_key": "xA==",
        "end_key": "xQ==",
        "internal_replicas": [
          {
            "node_id": 2,
            "store_id": 2,
            "replica_id": 4
          },
          {
            "node_id": 4,
            "store_id": 4,
            "replica_id": 2
          },
          {
            "node_id": 1,
            "store_id": 1,
            "replica_id": 5
          }
        ],
        "next_replica_id": 6,
        "generation": 3,
        "deprecated_generation_comparable": true
      },

This "unset" replica state, coming about from an uninitialized replica being installed in the stores map (again, not even sure if that's what's happening), would also fit into what #58378 is seeing. I'm going to look at other things for now, maybe I'll think of something else while I do. I'd be very curious to see if anyone else has better luck (+cc @tbg).

cockroach-teamcity · 2021-03-29T06:11:37Z

(roachtest).acceptance/version-upgrade failed on master@d891594d3c998f153b88f631e3c89ac7d12c2a6e:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/acceptance/version-upgrade/run_1
	cluster.go:1667,context.go:140,cluster.go:1656,test_runner.go:849: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-2823051-1616997898-13-n4cpu4 --oneshot --ignore-empty-nodes: exit status 1 3: dead
		4: 7239
		2: 7363
		1: 7825
		Error: UNCLASSIFIED_PROBLEM: 3: dead
		(1) UNCLASSIFIED_PROBLEM
		Wraps: (2) attached stack trace
		  -- stack trace:
		  | main.glob..func14
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1147
		  | main.wrap.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:271
		  | github.com/spf13/cobra.(*Command).execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:830
		  | github.com/spf13/cobra.(*Command).ExecuteC
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:914
		  | github.com/spf13/cobra.(*Command).Execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:864
		  | main.main
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1852
		  | runtime.main
		  | 	/usr/local/go/src/runtime/proc.go:204
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1374
		Wraps: (3) 3: dead
		Error types: (1) errors.Unclassified (2) *withstack.withStack (3) *errutil.leafError

More

Artifacts: /acceptance/version-upgrade
Related:

roachtest: acceptance/version-upgrade failed: invalid attempted write of database descriptor #58307 roachtest: acceptance/version-upgrade failed: invalid attempted write of database descriptor A-schema-descriptors C-bug C-test-failure O-roachtest branch-release-20.2
roachtest: acceptance/version-upgrade failed #53812 roachtest: acceptance/version-upgrade failed C-test-failure O-roachtest O-robot branch-release-19.2

See this test on roachdash
_{powered by pkg/cmd/internal/issues}

irfansharif · 2021-03-29T13:42:20Z

I210329 06:10:20.339270 9269 1@kv/kvserver/replica_proposal.go:841 â‹® [n3,s3,r21/3:â€¹/Table/6{0-1}â€º] 242  the server is terminating due to a fatal error (see the DEV channel for details)
I210329 06:10:20.347575 9282 kv/kvserver/store_remove_replica.go:127 â‹® [n3,replicaGC,s3,r21/3:â€¹/Table/6{0-1}â€º] 244  removing replica r21/3
I210329 06:10:20.348110 9282 kv/kvserver/replica_destroy.go:170 â‹® [n3,replicaGC,s3,r21/3:â€¹/Table/6{0-1}â€º] 245  removed 10 (0+10) keys in 0ms [clear=0ms commit=0ms]
F210329 06:10:20.339341 9269 kv/kvserver/replica_proposal.go:841 â‹® [n3,s3,r21/3:â€¹/Table/6{0-1}â€º] 243  not using applied state key in v21.1

The same kind of GC race as before, except it happened during teardown after the test had already successfully completed.

aliher1911 · 2021-03-29T15:58:22Z

Another instance of this failure happening to me during merge: https://teamcity.cockroachdb.com/buildConfiguration/Cockroach_UnitTests_Roachtest/2824498

irfansharif · 2021-03-29T20:31:40Z

I'm still unable to repro; I’ve had about 50 runs with nothing. I’m still trying to tease out how it’s possible for an uninitalized range to evaluate a proposal. My only guess is that if we’re somehow no able to read the range tombstone the replicaGC process writes out, we could misconstrue that as instantiate a replica without the right replica state. It's also not clear why r21 (and r84) is even being GC-ed. Somewhat surprisingly that happens both times in both failures where those two very replicas are GC-ed, but there aren't any indications as to why (no merges/splits/change triggers).

Actually, as I type that out, maybe we aren't logging things in the same way we do in 21.1. I'll double check, I was able to see the same GC behavior for r21 and r84 on successful runs, so at least that should be easier to track down (and maybe hints at what's happening here).

irfansharif · 2021-03-29T20:50:32Z

A third thing that would need to happen is a lease request from this node, which would need to evaluate on the follower node.

F210329 06:10:20.339341 9269 kv/kvserver/replica_proposal.go:841 ⋮ [n3,s3,r21/3:‹/Table/6{0-1}›] 243  not using applied state key in v21.1
F210329 06:10:20.339341 9269 kv/kvserver/replica_proposal.go:841 ⋮ [n3,s3,r21/3:‹/Table/6{0-1}›] 243 !goroutine 9269 [running]:
F210329 06:10:20.339341 9269 kv/kvserver/replica_proposal.go:841 ⋮ [n3,s3,r21/3:‹/Table/6{0-1}›] 243 !github.com/cockroachdb/cockroach/pkg/util/log.getStacks(0x8661d01, 0x203001, 0x203000, 0x10)
F210329 06:10:20.339341 9269 kv/kvserver/replica_proposal.go:841 ⋮ [n3,s3,r21/3:‹/Table/6{0-1}›] 243 !	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/get_stacks.go:25 +0xb9
F210329 06:10:20.339341 9269 kv/kvserver/replica_proposal.go:841 ⋮ [n3,s3,r21/3:‹/Table/6{0-1}›] 243 !github.com/cockroachdb/cockroach/pkg/util/log.(*loggerT).outputLogEntry(0xc0004b7b80, 0xc0026abce0, 0x24, 0x3, 0x0, 0x0, 0x0, 0x1670bb506dc3ea5a, 0x400000000, 0x0, ...)
F210329 06:10:20.339341 9269 kv/kvserver/replica_proposal.go:841 ⋮ [n3,s3,r21/3:‹/Table/6{0-1}›] 243 !	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog.go:279 +0xc32
F210329 06:10:20.339341 9269 kv/kvserver/replica_proposal.go:841 ⋮ [n3,s3,r21/3:‹/Table/6{0-1}›] 243 !github.com/cockroachdb/cockroach/pkg/util/log.logfDepth(0x5a1fb20, 0xc001740998, 0x1, 0x4, 0x4c45415, 0x24, 0x0, 0x0, 0x0)
F210329 06:10:20.339341 9269 kv/kvserver/replica_proposal.go:841 ⋮ [n3,s3,r21/3:‹/Table/6{0-1}›] 243 !	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/channels.go:58 +0x198
F210329 06:10:20.339341 9269 kv/kvserver/replica_proposal.go:841 ⋮ [n3,s3,r21/3:‹/Table/6{0-1}›] 243 !github.com/cockroachdb/cockroach/pkg/util/log.Fatal(...)
F210329 06:10:20.339341 9269 kv/kvserver/replica_proposal.go:841 ⋮ [n3,s3,r21/3:‹/Table/6{0-1}›] 243 !	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/log_channels_generated.go:850
F210329 06:10:20.339341 9269 kv/kvserver/replica_proposal.go:841 ⋮ [n3,s3,r21/3:‹/Table/6{0-1}›] 243 !github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).evaluateProposal(0xc001c89200, 0x5a1fb20, 0xc001740998, 0xc000b21110, 0x8, 0xc002bd80a0, 0x0, 0x0, 0xc003301500, 0x0, ...)
F210329 06:10:20.339341 9269 kv/kvserver/replica_proposal.go:841 ⋮ [n3,s3,r21/3:‹/Table/6{0-1}›] 243 !	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_proposal.go:841 +0x63f
F210329 06:10:20.339341 9269 kv/kvserver/replica_proposal.go:841 ⋮ [n3,s3,r21/3:‹/Table/6{0-1}›] 243 !github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).requestToProposal(0xc001c89200, 0x5a1fb20, 0xc001740998, 0xc000b21110, 0x8, 0xc002bd80a0, 0x1601fb530f0b5318, 0x0, 0xc001c863e0, 0x400000004, ...)
F210329 06:10:20.339341 9269 kv/kvserver/replica_proposal.go:841 ⋮ [n3,s3,r21/3:‹/Table/6{0-1}›] 243 !	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_proposal.go:876 +0xb6
F210329 06:10:20.339341 9269 kv/kvserver/replica_proposal.go:841 ⋮ [n3,s3,r21/3:‹/Table/6{0-1}›] 243 !github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).evalAndPropose(0xc001c89200, 0x5a1fb20, 0xc001740998, 0xc002bd80a0, 0xc0008fd490, 0x1601fb530f0b5318, 0x0, 0xc001c863e0, 0x400000004, 0x2, ...)
F210329 06:10:20.339341 9269 kv/kvserver/replica_proposal.go:841 ⋮ [n3,s3,r21/3:‹/Table/6{0-1}›] 243 !	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_raft.go:84 +0x197
F210329 06:10:20.339341 9269 kv/kvserver/replica_proposal.go:841 ⋮ [n3,s3,r21/3:‹/Table/6{0-1}›] 243 !github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).executeWriteBatch(0xc001c89200, 0x5a1fb20, 0xc001740998, 0xc002bd80a0, 0xc0008fd490, 0x0, 0x0, 0x0)
F210329 06:10:20.339341 9269 kv/kvserver/replica_proposal.go:841 ⋮ [n3,s3,r21/3:‹/Table/6{0-1}›] 243 !	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_write.go:138 +0x7c5
F210329 06:10:20.339341 9269 kv/kvserver/replica_proposal.go:841 ⋮ [n3,s3,r21/3:‹/Table/6{0-1}›] 243 !github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).executeBatchWithConcurrencyRetries(0xc001c89200, 0x5a1fb20, 0xc001740998, 0xc002bd80a0, 0x52b2e70, 0x0, 0x0)
F210329 06:10:20.339341 9269 kv/kvserver/replica_proposal.go:841 ⋮ [n3,s3,r21/3:‹/Table/6{0-1}›] 243 !	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_send.go:275 +0x336
F210329 06:10:20.339341 9269 kv/kvserver/replica_proposal.go:841 ⋮ [n3,s3,r21/3:‹/Table/6{0-1}›] 243 !github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).sendWithRangeID(0xc001c89200, 0x5a1fb20, 0xc001740998, 0x15, 0xc002bd80a0, 0x0, 0xc001ca5f40)
F210329 06:10:20.339341 9269 kv/kvserver/replica_proposal.go:841 ⋮ [n3,s3,r21/3:‹/Table/6{0-1}›] 243 !	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_send.go:95 +0x55d
F210329 06:10:20.339341 9269 kv/kvserver/replica_proposal.go:841 ⋮ [n3,s3,r21/3:‹/Table/6{0-1}›] 243 !github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).Send(...)
F210329 06:10:20.339341 9269 kv/kvserver/replica_proposal.go:841 ⋮ [n3,s3,r21/3:‹/Table/6{0-1}›] 243 !	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_send.go:34
F210329 06:10:20.339341 9269 kv/kvserver/replica_proposal.go:841 ⋮ [n3,s3,r21/3:‹/Table/6{0-1}›] 243 !github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*pendingLeaseRequest).requestLeaseAsync.func2(0x5a1fb20, 0xc001740998)
F210329 06:10:20.339341 9269 kv/kvserver/replica_proposal.go:841 ⋮ [n3,s3,r21/3:‹/Table/6{0-1}›] 243 !	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_range_lease.go:421 +0x6b0
F210329 06:10:20.339341 9269 kv/kvserver/replica_proposal.go:841 ⋮ [n3,s3,r21/3:‹/Table/6{0-1}›] 243 !github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTask.func1(0xc0003d6c00, 0x5a1fb20, 0xc001740998, 0xc001740820, 0xc002bd8000)

So at least that's how the evaluation is happening.

cockroach-teamcity · 2021-03-29T22:04:26Z

(roachtest).acceptance/version-upgrade failed on master@0dd303e2234c5efe21ba242aff74a5edfd5f5c40:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: artifacts/acceptance/version-upgrade/run_1
	versionupgrade.go:296,versionupgrade.go:471,versionupgrade.go:453,versionupgrade.go:200,versionupgrade.go:188,acceptance.go:63,acceptance.go:104,test_runner.go:768: pq: operation "show cluster setting version" timed out after 2m0s: value differs between local setting ([18 8 8 20 16 2 24 0 32 48]) and KV ([18 8 8 20 16 2 24 0 32 0]); try again later (<nil> after 1m59.87588872s)

	cluster.go:1667,context.go:140,cluster.go:1656,test_runner.go:849: dead node detection: /go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor local --oneshot --ignore-empty-nodes: exit status 1 3: dead
		2: 25233
		1: 25003
		4: 25119
		Error: UNCLASSIFIED_PROBLEM: 3: dead
		(1) UNCLASSIFIED_PROBLEM
		Wraps: (2) attached stack trace
		  -- stack trace:
		  | main.glob..func14
		  | 	/go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1147
		  | main.wrap.func1
		  | 	/go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:271
		  | github.com/spf13/cobra.(*Command).execute
		  | 	/go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:830
		  | github.com/spf13/cobra.(*Command).ExecuteC
		  | 	/go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:914
		  | github.com/spf13/cobra.(*Command).Execute
		  | 	/go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:864
		  | main.main
		  | 	/go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1852
		  | runtime.main
		  | 	/usr/local/go/src/runtime/proc.go:204
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1374
		Wraps: (3) 3: dead
		Error types: (1) errors.Unclassified (2) *withstack.withStack (3) *errutil.leafError

More

Artifacts: /acceptance/version-upgrade
Related:

roachtest: acceptance/version-upgrade failed: invalid attempted write of database descriptor #58307 roachtest: acceptance/version-upgrade failed: invalid attempted write of database descriptor A-schema-descriptors C-bug C-test-failure O-roachtest branch-release-20.2
roachtest: acceptance/version-upgrade failed #53812 roachtest: acceptance/version-upgrade failed C-test-failure O-roachtest O-robot branch-release-19.2

See this test on roachdash
_{powered by pkg/cmd/internal/issues}

tbg · 2021-03-30T08:26:32Z

Looks like the same thing

I210329 22:02:02.221682 4379 1@kv/kvserver/replica_proposal.go:841 â‹® [n3,s3,r21/3:â€¹/Table/6{0-1}â€º] 294  the server is terminating due to a fatal error (see the DEV channel for details)
I210329 22:02:02.221776 4362 kv/kvserver/store_remove_replica.go:127 â‹® [n3,replicaGC,s3,r21/3:â€¹/Table/6{0-1}â€º] 295  removing replica r21/3
W210329 22:02:02.238841 3855 kv/kvserver/intentresolver/intent_resolver.go:758 â‹® [-] 297  failed to gc transaction record: could not GC completed transaction anchored at â€¹/Table/15/1/645480192251330563â€º: â€¹node unavailable; try another peerâ€º
W210329 22:02:02.238839 4335 kv/kvserver/replica_write.go:206 â‹® [n3,s3,r10/3:â€¹/Table/1{5-6}â€º] 298  during async intent resolution: â€¹node unavailable; try another peerâ€º
W210329 22:02:02.238950 3713 kv/kvserver/closedts/sidetransport/receiver.go:125 â‹® [n3] 299  closed timestamps side-transport connection dropped from node: 1
W210329 22:02:02.239198 457 jobs/registry.go:729 â‹® [-] 301  canceling all adopted jobs due to stopper quiescing
I210329 22:02:02.239075 4387 1@vendor/github.com/cockroachdb/circuitbreaker/circuitbreaker.go:322 â‹® [n3] 300  circuitbreaker: â€¹rpc 127.0.0.1:26261 [n2]â€º tripped: failed to connect to n2 at â€¹127.0.0.1:26259â€º: stopped
I210329 22:02:02.239260 4387 1@vendor/github.com/cockroachdb/circuitbreaker/circuitbreaker.go:447 â‹® [n3] 302  circuitbreaker: â€¹rpc 127.0.0.1:26261 [n2]â€º event: â€¹BreakerTrippedâ€º
W210329 22:02:02.239251 3827 kv/kvserver/closedts/sidetransport/receiver.go:125 â‹® [n3] 303  closed timestamps side-transport connection dropped from node: 2
W210329 22:02:02.239383 3664 kv/kvserver/closedts/sidetransport/receiver.go:125 â‹® [n3] 304  closed timestamps side-transport connection dropped from node: 4
W210329 22:02:02.239432 456 sql/sqlliveness/slinstance/slinstance.go:183 â‹® [n3] 305  exiting heartbeat loop
W210329 22:02:02.239261 4087 kv/txn.go:635 â‹® [n3,intExec=â€¹poll-show-jobsâ€º,migration-mgr] 306  failure aborting transaction: â€¹node unavailable; try another peerâ€º; abort caused by: query execution canceled
W210329 22:02:02.239854 4333 kv/txn.go:635 â‹® [n3,job=â€¹645480192964231171â€º,migration=20.2-50] 307  failure aborting transaction: â€¹node unavailable; try another peerâ€º; abort caused by: aborted during DistSender.Send: context canceled
W210329 22:02:02.239952 1524 kv/txn.go:635 â‹® [n3,intExec=â€¹set-versionâ€º] 308  failure aborting transaction: â€¹node unavailable; try another peerâ€º; abort caused by: polling for queued jobs to complete: â€¹poll-show-jobsâ€º: context canceled
W210329 22:02:02.240167 1524 kv/txn.go:635 â‹® [n3,intExec=â€¹set-versionâ€º] 309  failure aborting transaction: â€¹node unavailable; try another peerâ€º; abort caused by: query execution canceled
I210329 22:02:02.240271 476 server/auto_upgrade.go:75 â‹® [n3] 310  error when finalizing cluster version upgrade: â€¹set-versionâ€º: context canceled
F210329 22:02:02.221796 4379 kv/kvserver/replica_proposal.go:841 â‹® [n3,s3,r21/3:â€¹/Table/6{0-1}â€º] 296  not using applied state key in v21.1
F210329 22:02:02.221796 4379 kv/kvserver/replica_proposal.go:841 â‹® [n3,s3,r21/3:â€¹/Table/6{0-1}â€º] 296 !goroutine 4379 [running]:
F210329 22:02:02.221796 4379 kv/kvserver/replica_proposal.go:841 â‹® [n3,s3,r21/3:â€¹/Table/6{0-1}â€º] 296 !github.com/cockroachdb/cockroach/pkg/util/log.getStacks(0x8492201, 0xc001ae7328, 0x10000000049cd4d, 0x7f28a69a6420)
F210329 22:02:02.221796 4379 kv/kvserver/replica_proposal.go:841 â‹® [n3,s3,r21/3:â€¹/Table/6{0-1}â€º] 296 !	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/get_stacks.go:25 +0xb9
F210329 22:02:02.221796 4379 kv/kvserver/replica_proposal.go:841 â‹® [n3,s3,r21/3:â€¹/Table/6{0-1}â€º] 296 !github.com/cockroachdb/cockroach/pkg/util/log.(*loggerT).outputLogEntry(0xc000d60940, 0xc0028af1d0, 0x24, 0x3, 0x0, 0x0, 0x0, 0x1670ef3f7f2f5d59, 0x400000000, 0x0, ...)
F210329 22:02:02.221796 4379 kv/kvserver/replica_proposal.go:841 â‹® [n3,s3,r21/3:â€¹/Table/6{0-1}â€º] 296 !	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog.go:279 +0xc32
F210329 22:02:02.221796 4379 kv/kvserver/replica_proposal.go:841 â‹® [n3,s3,r21/3:â€¹/Table/6{0-1}â€º] 296 !github.com/cockroachdb/cockroach/pkg/util/log.logfDepth(0x5a45f00, 0xc0025d9640, 0x1, 0x4, 0x4c8994b, 0x24, 0x0, 0x0, 0x0)
F210329 22:02:02.221796 4379 kv/kvserver/replica_proposal.go:841 â‹® [n3,s3,r21/3:â€¹/Table/6{0-1}â€º] 296 !	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/channels.go:58 +0x198
F210329 22:02:02.221796 4379 kv/kvserver/replica_proposal.go:841 â‹® [n3,s3,r21/3:â€¹/Table/6{0-1}â€º] 296 !github.com/cockroachdb/cockroach/pkg/util/log.Fatal(...)
F210329 22:02:02.221796 4379 kv/kvserver/replica_proposal.go:841 â‹® [n3,s3,r21/3:â€¹/Table/6{0-1}â€º] 296 !	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/log_channels_generated.go:850
F210329 22:02:02.221796 4379 kv/kvserver/replica_proposal.go:841 â‹® [n3,s3,r21/3:â€¹/Table/6{0-1}â€º] 296 !github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).evaluateProposal(0xc001c32000, 0x5a45f00, 0xc0025d9640, 0xc001cfc178, 0x8, 0xc001244e60, 0x0, 0x0, 0xc001e64420, 0x0, ...)
F210329 22:02:02.221796 4379 kv/kvserver/replica_proposal.go:841 â‹® [n3,s3,r21/3:â€¹/Table/6{0-1}â€º] 296 !	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_proposal.go:841 +0x63f
F210329 22:02:02.221796 4379 kv/kvserver/replica_proposal.go:841 â‹® [n3,s3,r21/3:â€¹/Table/6{0-1}â€º] 296 !github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).requestToProposal(0xc001c32000, 0x5a45f00, 0xc0025d9640, 0xc001cfc178, 0x8, 0xc001244e60, 0x1601fb530f0b5318, 0x0, 0xc00231b910, 0x400000004, ...)
F210329 22:02:02.221796 4379 kv/kvserver/replica_proposal.go:841 â‹® [n3,s3,r21/3:â€¹/Table/6{0-1}â€º] 296 !	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_proposal.go:876 +0xb6
F210329 22:02:02.221796 4379 kv/kvserver/replica_proposal.go:841 â‹® [n3,s3,r21/3:â€¹/Table/6{0-1}â€º] 296 !github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).evalAndPropose(0xc001c32000, 0x5a45f00, 0xc0025d9640, 0xc001244e60, 0xc000eb4c40, 0x1601fb530f0b5318, 0x0, 0xc00231b910, 0x400000004, 0x2, ...)
F210329 22:02:02.221796 4379 kv/kvserver/replica_proposal.go:841 â‹® [n3,s3,r21/3:â€¹/Table/6{0-1}â€º] 296 !	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_raft.go:84 +0x197
F210329 22:02:02.221796 4379 kv/kvserver/replica_proposal.go:841 â‹® [n3,s3,r21/3:â€¹/Table/6{0-1}â€º] 296 !github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).executeWriteBatch(0xc001c32000, 0x5a45f00, 0xc0025d9640, 0xc001244e60, 0xc000eb4c40, 0x0, 0x0, 0x0)
F210329 22:02:02.221796 4379 kv/kvserver/replica_proposal.go:841 â‹® [n3,s3,r21/3:â€¹/Table/6{0-1}â€º] 296 !	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_write.go:138 +0x7c5
F210329 22:02:02.221796 4379 kv/kvserver/replica_proposal.go:841 â‹® [n3,s3,r21/3:â€¹/Table/6{0-1}â€º] 296 !github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).executeBatchWithConcurrencyRetries(0xc001c32000, 0x5a45f00, 0xc0025d9640, 0xc001244e60, 0x52f7688, 0x0, 0x0)
F210329 22:02:02.221796 4379 kv/kvserver/replica_proposal.go:841 â‹® [n3,s3,r21/3:â€¹/Table/6{0-1}â€º] 296 !	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_send.go:275 +0x336
F210329 22:02:02.221796 4379 kv/kvserver/replica_proposal.go:841 â‹® [n3,s3,r21/3:â€¹/Table/6{0-1}â€º] 296 !github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).sendWithRangeID(0xc001c32000, 0x5a45f00, 0xc0025d9640, 0x15, 0xc001244e60, 0xc0010ce2b0, 0xc0010ce31a)
F210329 22:02:02.221796 4379 kv/kvserver/replica_proposal.go:841 â‹® [n3,s3,r21/3:â€¹/Table/6{0-1}â€º] 296 !	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_send.go:95 +0x55d
F210329 22:02:02.221796 4379 kv/kvserver/replica_proposal.go:841 â‹® [n3,s3,r21/3:â€¹/Table/6{0-1}â€º] 296 !github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).Send(...)
F210329 22:02:02.221796 4379 kv/kvserver/replica_proposal.go:841 â‹® [n3,s3,r21/3:â€¹/Table/6{0-1}â€º] 296 !	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_send.go:34
F210329 22:02:02.221796 4379 kv/kvserver/replica_proposal.go:841 â‹® [n3,s3,r21/3:â€¹/Table/6{0-1}â€º] 296 !github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*pendingLeaseRequest).requestLeaseAsync.func2(0x5a45f00, 0xc0025d9640)
F210329 22:02:02.221796 4379 kv/kvserver/replica_proposal.go:841 â‹® [n3,s3,r21/3:â€¹/Table/6{0-1}â€º] 296 !	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_range_lease.go:421 +0x6b0
F210329 22:02:02.221796 4379 kv/kvserver/replica_proposal.go:841 â‹® [n3,s3,r21/3:â€¹/Table/6{0-1}â€º] 296 !github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTask.func1(0xc000c8e880, 0x5a45f00, 0xc0025d9640, 0xc000ca3680, 0xc0032a15e0)
F210329 22:02:02.221796 4379 kv/kvserver/replica_proposal.go:841 â‹® [n3,s3,r21/3:â€¹/Table/6{0-1}â€º] 296 !	/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:351 +0xb9
F210329 22:02:02.221796 4379 kv/kvserver/replica_proposal.go:841 â‹® [n3,s3,r21/3:â€¹/Table/6{0-1}â€º] 296 !created by github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTask
F210329 22:02:02.221796 4379 kv/kvserver/replica_proposal.go:841 â‹® [n3,s3,r21/3:â€¹/Table/6{0-1}â€º] 296 !	/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:346 +0xfc

irfansharif · 2021-03-30T16:18:21Z

It's probably unrelated but because I'm out of ideas here, I'm going to try poking at #58378 instead. It's a lot more reproducible and I think in the same ball park of "uninitialized raft state on a replica" (running a build that reverts #60429).

irfansharif · 2021-03-30T23:04:56Z

Hmmmmmmmm latest (unverified) shower thought on how we could be running into this bug.

Say r42 starts off on n1, n2, and n3.
This test restarts nodes, so eventually we get a placement for r42 on n1, n2, and n4.
n3's replica, r42/3, is now GC-able, but say that n3 hasn't GC-ed it yet
As we run the applied state migration, we first migrate all ranges and then purge outdated replicas
Well, we should want to purge r42/3, cause it's un-migrated and evaluating anything on it (say a lease request) is unsound because we've bumped version gates that tell the server to always expect post-migration state
What happens when we try to purge r42/3? If it doesn't have a replica version (because kvserver: purging outdated replicas during migrations run into nil pointers #58378 is still unaddressed), we'll skip over it (!)
Is it possible for it to not have a replica version? Shouldn't it be accounted for when we migrate all ranges? No, that's precisely why we have to purge outdated replicas. The migrate request returns once its applied on all followes; in our example that won't include r42/3 since it's no longer a follower
The stop-gap in kv: (re-)introduce a stopgap for lack of ReplicaState synchronization #60429 makes it so that we don't GC r42/3, when we should be doing the opposite. When iterating over a store's replicas for purging purposes, a nil replica version is fine and expected; we should read that as signal that we're dealing with a replica that is definitely to be GC-ed and was never migrated (and thus doesn't have a valid replica version installed)
We can now evaluate requests on a replica that's unmigrated relative to the store's version. Boom.

irfansharif · 2021-03-30T23:07:36Z

The close occurrence of replicaGC logs is explainable, r42/3 should be GC-ed. We'd only trigger this assertion if we evaluated anything against that replica, like a lease request, which we also know is happening (#62267 (comment)). I'll try writing tests for it, and for #58378. I think they're the same issue.

Fixes cockroachdb#58378. Fixes cockroachdb#62267. Previously it was possible for us to have replicas in-memory, with pre-migrated state, even after a migration was finalized. This led to the kind of badness we were observing in cockroachdb#62267, where it appeared that a replica was not using the applied state key despite us having migrated into it (see TruncatedAndRangeAppliedState, introduced in cockroachdb#58088). --- To see how, consider the following set of events: - Say r42 starts off on n1, n2, and n3 - n3 flaps and so we place a replica for r42 on n4 - n3's replica, r42/3, is now GC-able, but still un-GC-ed - We run the applied state migration, first migrating all ranges into it and then purging outdated replicas - Well, we should want to purge r42/3, cause it's un-migrated and evaluating anything on it (say a lease request) is unsound because we've bumped version gates that tell the kvserver to always expect post-migration state - What happens when we try to purge r42/3? Previous to this PR if it didn't have a replica version, we'd skip over it (!) - Was it possible for r42/3 to not have a replica version? Shouldn't it have been accounted for when we migrated all ranges? No, that's precisely why the migration infrastructure purge outdated replicas. The migrate request only returns once its applied on all followers; in our example that wouldn't include r42/3 since it was no longer one - The stop-gap in cockroachdb#60429 made it so that we didn't GC r42/3, when we should've been doing the opposite. When iterating over a store's replicas for purging purposes, an empty replica version is fine and expected; we should interpret that as signal that we're dealing with a replica that was obviously never migrated (to even start using replica versions in the first place). Because it didn't have a valid replica version installed, we can infer that it's soon to be GC-ed (else we wouldn't have been able to finalize the applied state + replica version migration) - The conditions above made it possible for us to evaluate requests on replicas with migration state out-of-date relative to the store's version - Boom Release note: None

60835: kv: bump timestamp cache to Pushee.MinTimestamp on PUSH_ABORT r=nvanbenschoten a=nvanbenschoten Fixes #60779. Fixes #60580. We were only checking that the batch header timestamp was equal to or greater than this pushee's min timestamp, so this is as far as we can bump the timestamp cache. 62832: geo: minor performance improvement for looping over edges r=otan a=andyyang890 This patch slightly improves the performance of many spatial builtins by storing the number of edges used in the loop conditions of for loops into a variable. We discovered this was taking a lot of time when profiling the point-in-polygon optimization. Release note: None 62838: kvserver: purge gc-able, unmigrated replicas during migrations r=irfansharif a=irfansharif Fixes #58378. Fixes #62267. Previously it was possible for us to have replicas in-memory, with pre-migrated state, even after a migration was finalized. This led to the kind of badness we were observing in #62267, where it appeared that a replica was not using the applied state key despite us having migrated into it (see TruncatedAndRangeAppliedState, introduced in #58088). --- To see how, consider the following set of events: - Say r42 starts off on n1, n2, and n3 - n3 flaps and so we place a replica for r42 on n4 - n3's replica, r42/3, is now GC-able, but still un-GC-ed - We run the applied state migration, first migrating all ranges into it and then purging outdated replicas - Well, we should want to purge r42/3, cause it's un-migrated and evaluating anything on it (say a lease request) is unsound because we've bumped version gates that tell the kvserver to always expect post-migration state - What happens when we try to purge r42/3? Previous to this PR if it didn't have a replica version, we'd skip over it (!) - Was it possible for r42/3 to not have a replica version? Shouldn't it have been accounted for when we migrated all ranges? No, that's precisely why the migration infrastructure purge outdated replicas. The migrate request only returns once its applied on all followers; in our example that wouldn't include r42/3 since it was no longer one - The stop-gap in #60429 made it so that we didn't GC r42/3, when we should've been doing the opposite. When iterating over a store's replicas for purging purposes, an empty replica version is fine and expected; we should interpret that as signal that we're dealing with a replica that was obviously never migrated (to even start using replica versions in the first place). Because it didn't have a valid replica version installed, we can infer that it's soon to be GC-ed (else we wouldn't have been able to finalize the applied state + replica version migration) - The conditions above made it possible for us to evaluate requests on replicas with migration state out-of-date relative to the store's version - Boom Release note: None 62839: zonepb: make subzone DiffWithZone more accurate r=ajstorm a=otan * Subzones may be defined in a different order. We did not take this into account which can cause bugs when e.g. ADD REGION adds a subzone in the end rather than in the old "expected" location in the subzones array. This has been fixed by comparing subzones using an unordered map. * The ApplyZoneConfig we previously did overwrote subzone fields on the original subzone array element, meaning that if there was a mismatch it would not be reported through validation. This is now fixed by applying the expected zone config to *zonepb.NewZoneConfig() instead. * Added logic to only check for zone config matches subzones from active subzone IDs. * Improve the error messaging when a subzone config is mismatching - namely, add index and partitioning information and differentiate between missing fields and missing / extraneous zone configs Resolves #62790 Release note (bug fix): Fixed validation bugs during ALTER TABLE ... SET LOCALITY / crdb_internal.validate_multi_region_zone_config where validation errors could occur when the database of a REGIONAL BY ROW table has a new region added. Also fix a validation bug partition zone mismatches configs were not caught. 62872: build: use -json for RandomSyntax test r=otan a=rafiss I'm hoping this will help out with an issue where the test failures seem to be missing helpful logs. Release note: None Co-authored-by: Nathan VanBenschoten <[email protected]> Co-authored-by: Andy Yang <[email protected]> Co-authored-by: irfan sharif <[email protected]> Co-authored-by: Oliver Tan <[email protected]> Co-authored-by: Rafi Shamim <[email protected]>

tbg added GA-blocker and removed release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Mar 23, 2021

tbg changed the title ~~roachtest: acceptance/version-upgrade failed~~ roachtest: acceptance/version-upgrade failed [replicaGC/eval race] Mar 23, 2021

irfansharif self-assigned this Mar 24, 2021

irfansharif mentioned this issue Mar 29, 2021

roachtest: hotspotsplits/nodes=4 failed [inconsistency] #61990

Closed

irfansharif mentioned this issue Mar 31, 2021

kvserver: purge gc-able, unmigrated replicas during migrations #62838

Merged

irfansharif changed the title ~~roachtest: acceptance/version-upgrade failed [replicaGC/eval race]~~ roachtest: acceptance/version-upgrade failed [not using applied state] Mar 31, 2021

irfansharif mentioned this issue Mar 31, 2021

release-21.1: kvserver: purge gc-able, unmigrated replicas during migrations #62871

Merged

craig bot closed this as completed in 1416a4e Apr 1, 2021

nvanbenschoten mentioned this issue Aug 9, 2022

kvserver: v21.2.9: not using applied state key in v21.1 #83763

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

roachtest: acceptance/version-upgrade failed [not using applied state] #62267

roachtest: acceptance/version-upgrade failed [not using applied state] #62267

cockroach-teamcity commented Mar 19, 2021

nvanbenschoten commented Mar 23, 2021

nvanbenschoten commented Mar 23, 2021

nvanbenschoten commented Mar 23, 2021

tbg commented Mar 23, 2021

irfansharif commented Mar 24, 2021

irfansharif commented Mar 25, 2021 •

edited

Loading

irfansharif commented Mar 26, 2021 •

edited

Loading

cockroach-teamcity commented Mar 29, 2021

irfansharif commented Mar 29, 2021 •

edited

Loading

aliher1911 commented Mar 29, 2021

irfansharif commented Mar 29, 2021

irfansharif commented Mar 29, 2021

cockroach-teamcity commented Mar 29, 2021

tbg commented Mar 30, 2021

irfansharif commented Mar 30, 2021

irfansharif commented Mar 30, 2021 •

edited

Loading

irfansharif commented Mar 30, 2021

roachtest: acceptance/version-upgrade failed [not using applied state] #62267

roachtest: acceptance/version-upgrade failed [not using applied state] #62267

Comments

cockroach-teamcity commented Mar 19, 2021

nvanbenschoten commented Mar 23, 2021

nvanbenschoten commented Mar 23, 2021

nvanbenschoten commented Mar 23, 2021

tbg commented Mar 23, 2021

irfansharif commented Mar 24, 2021

irfansharif commented Mar 25, 2021 • edited Loading

irfansharif commented Mar 26, 2021 • edited Loading

cockroach-teamcity commented Mar 29, 2021

irfansharif commented Mar 29, 2021 • edited Loading

aliher1911 commented Mar 29, 2021

irfansharif commented Mar 29, 2021

irfansharif commented Mar 29, 2021

cockroach-teamcity commented Mar 29, 2021

tbg commented Mar 30, 2021

irfansharif commented Mar 30, 2021

irfansharif commented Mar 30, 2021 • edited Loading

irfansharif commented Mar 30, 2021

irfansharif commented Mar 25, 2021 •

edited

Loading

irfansharif commented Mar 26, 2021 •

edited

Loading

irfansharif commented Mar 29, 2021 •

edited

Loading

irfansharif commented Mar 30, 2021 •

edited

Loading