Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: acceptance/version-upgrade failed [not using applied state] #62267

Closed
cockroach-teamcity opened this issue Mar 19, 2021 · 17 comments · Fixed by #62838
Closed

roachtest: acceptance/version-upgrade failed [not using applied state] #62267

cockroach-teamcity opened this issue Mar 19, 2021 · 17 comments · Fixed by #62838
Assignees
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). GA-blocker O-roachtest O-robot Originated from a bot.

Comments

@cockroach-teamcity
Copy link
Member

(roachtest).acceptance/version-upgrade failed on master@24e76d399047857a3230cd0c3dd2dcea42c4292a:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: artifacts/acceptance/version-upgrade/run_1
	versionupgrade.go:296,versionupgrade.go:471,versionupgrade.go:453,versionupgrade.go:200,versionupgrade.go:188,acceptance.go:63,acceptance.go:104,test_runner.go:768: dial tcp 127.0.0.1:26261: connect: connection refused

	cluster.go:1667,context.go:140,cluster.go:1656,test_runner.go:849: dead node detection: /go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor local --oneshot --ignore-empty-nodes: exit status 1 3: dead
		2: 23562
		1: 23903
		4: 23677
		Error: UNCLASSIFIED_PROBLEM: 3: dead
		(1) UNCLASSIFIED_PROBLEM
		Wraps: (2) attached stack trace
		  -- stack trace:
		  | main.glob..func14
		  | 	/go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1147
		  | main.wrap.func1
		  | 	/go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:271
		  | github.com/spf13/cobra.(*Command).execute
		  | 	/go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:830
		  | github.com/spf13/cobra.(*Command).ExecuteC
		  | 	/go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:914
		  | github.com/spf13/cobra.(*Command).Execute
		  | 	/go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:864
		  | main.main
		  | 	/go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1852
		  | runtime.main
		  | 	/usr/local/go/src/runtime/proc.go:204
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1374
		Wraps: (3) 3: dead
		Error types: (1) errors.Unclassified (2) *withstack.withStack (3) *errutil.leafError

More

Artifacts: /acceptance/version-upgrade
Related:

See this test on roachdash
powered by pkg/cmd/internal/issues

@cockroach-teamcity cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Mar 19, 2021
@nvanbenschoten
Copy link
Member

kv/kvserver/replica_proposal.go:841 ⋮ [n3,s3,r21/3:‹/Table/6{0-1}›] 123  not using applied state key in v21.1
kv/kvserver/replica_proposal.go:841 ⋮ [n3,s3,r21/3:‹/Table/6{0-1}›] 123 !goroutine 2574 [running]:
kv/kvserver/replica_proposal.go:841 ⋮ [n3,s3,r21/3:‹/Table/6{0-1}›] 123 !github.com/cockroachdb/cockroach/pkg/util/log.getStacks(0x8447f01, 0xc002f51328, 0x10000000049cd4d, 0x7fd58bb7da10)
kv/kvserver/replica_proposal.go:841 ⋮ [n3,s3,r21/3:‹/Table/6{0-1}›] 123 !	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/get_stacks.go:25 +0xb9
kv/kvserver/replica_proposal.go:841 ⋮ [n3,s3,r21/3:‹/Table/6{0-1}›] 123 !github.com/cockroachdb/cockroach/pkg/util/log.(*loggerT).outputLogEntry(0xc0003745e0, 0xc00227f410, 0x24, 0x3, 0x0, 0x0, 0x0, 0x166dd6f14f1e3fcb, 0x400000000, 0x0, ...)
kv/kvserver/replica_proposal.go:841 ⋮ [n3,s3,r21/3:‹/Table/6{0-1}›] 123 !	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog.go:279 +0xc32
kv/kvserver/replica_proposal.go:841 ⋮ [n3,s3,r21/3:‹/Table/6{0-1}›] 123 !github.com/cockroachdb/cockroach/pkg/util/log.logfDepth(0x5a106e0, 0xc00278d4c0, 0x1, 0x4, 0x4c5c41b, 0x24, 0x0, 0x0, 0x0)
kv/kvserver/replica_proposal.go:841 ⋮ [n3,s3,r21/3:‹/Table/6{0-1}›] 123 !	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/channels.go:58 +0x198
kv/kvserver/replica_proposal.go:841 ⋮ [n3,s3,r21/3:‹/Table/6{0-1}›] 123 !github.com/cockroachdb/cockroach/pkg/util/log.Fatal(...)
kv/kvserver/replica_proposal.go:841 ⋮ [n3,s3,r21/3:‹/Table/6{0-1}›] 123 !	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/log_channels_generated.go:850
kv/kvserver/replica_proposal.go:841 ⋮ [n3,s3,r21/3:‹/Table/6{0-1}›] 123 !github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).evaluateProposal(0xc00181b600, 0x5a106e0, 0xc00278d4c0, 0xc0026b6ab0, 0x8, 0xc003266fa0, 0x0, 0x0, 0xc001d92300, 0x0, ...)
kv/kvserver/replica_proposal.go:841 ⋮ [n3,s3,r21/3:‹/Table/6{0-1}›] 123 !	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_proposal.go:841 +0x63f
kv/kvserver/replica_proposal.go:841 ⋮ [n3,s3,r21/3:‹/Table/6{0-1}›] 123 !github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).requestToProposal(0xc00181b600, 0x5a106e0, 0xc00278d4c0, 0xc0026b6ab0, 0x8, 0xc003266fa0, 0x1601fb530f0b5318, 0x0, 0xc0019424a0, 0x400000004, ...)
kv/kvserver/replica_proposal.go:841 ⋮ [n3,s3,r21/3:‹/Table/6{0-1}›] 123 !	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_proposal.go:876 +0xb6
kv/kvserver/replica_proposal.go:841 ⋮ [n3,s3,r21/3:‹/Table/6{0-1}›] 123 !github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).evalAndPropose(0xc00181b600, 0x5a106e0, 0xc00278d4c0, 0xc003266fa0, 0xc000fe82a0, 0x1601fb530f0b5318, 0x0, 0xc0019424a0, 0x400000004, 0x2, ...)
kv/kvserver/replica_proposal.go:841 ⋮ [n3,s3,r21/3:‹/Table/6{0-1}›] 123 !	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_raft.go:84 +0x197
kv/kvserver/replica_proposal.go:841 ⋮ [n3,s3,r21/3:‹/Table/6{0-1}›] 123 !github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).executeWriteBatch(0xc00181b600, 0x5a106e0, 0xc00278d4c0, 0xc003266fa0, 0xc000fe82a0, 0x0, 0x0, 0x0)
kv/kvserver/replica_proposal.go:841 ⋮ [n3,s3,r21/3:‹/Table/6{0-1}›] 123 !	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_write.go:138 +0x7c5
kv/kvserver/replica_proposal.go:841 ⋮ [n3,s3,r21/3:‹/Table/6{0-1}›] 123 !github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).executeBatchWithConcurrencyRetries(0xc00181b600, 0x5a106e0, 0xc00278d4c0, 0xc003266fa0, 0x52c8d58, 0x0, 0x0)
kv/kvserver/replica_proposal.go:841 ⋮ [n3,s3,r21/3:‹/Table/6{0-1}›] 123 !	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_send.go:275 +0x336
kv/kvserver/replica_proposal.go:841 ⋮ [n3,s3,r21/3:‹/Table/6{0-1}›] 123 !github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).sendWithRangeID(0xc00181b600, 0x5a106e0, 0xc00278d4c0, 0x15, 0xc003266fa0, 0x0, 0x0)
kv/kvserver/replica_proposal.go:841 ⋮ [n3,s3,r21/3:‹/Table/6{0-1}›] 123 !	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_send.go:95 +0x55d
kv/kvserver/replica_proposal.go:841 ⋮ [n3,s3,r21/3:‹/Table/6{0-1}›] 123 !github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).Send(...)
kv/kvserver/replica_proposal.go:841 ⋮ [n3,s3,r21/3:‹/Table/6{0-1}›] 123 !	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_send.go:34
kv/kvserver/replica_proposal.go:841 ⋮ [n3,s3,r21/3:‹/Table/6{0-1}›] 123 !github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*pendingLeaseRequest).requestLeaseAsync.func2(0x5a106e0, 0xc00278d4c0)
kv/kvserver/replica_proposal.go:841 ⋮ [n3,s3,r21/3:‹/Table/6{0-1}›] 123 !	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_range_lease.go:421 +0x6b0
kv/kvserver/replica_proposal.go:841 ⋮ [n3,s3,r21/3:‹/Table/6{0-1}›] 123 !github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTask.func1(0xc001000100, 0x5a106e0, 0xc00278d4c0, 0xc00113b1c0, 

@nvanbenschoten
Copy link
Member

This looks like an instance of #58378.

@nvanbenschoten
Copy link
Member

@tbg and @irfansharif what do we want to do here? Should we mark #58378 as a GA-blocker?

@tbg
Copy link
Member

tbg commented Mar 23, 2021

Looks like a somewhat different issue:

I210319 20:02:53.198611 2827 kv/kvserver/store_remove_replica.go:127 ⋮ [n3,replicaGC,s3,r21/3:‹/Table/6{0-1}›] 121  removing replica r21/3
I210319 20:02:53.198658 2574 1@kv/kvserver/replica_proposal.go:841 ⋮ [n3,s3,r21/3:‹/Table/6{0-1}›] 122  the server is terminating due to a fatal error (see the DEV channel for details)
I210319 20:02:53.208924 2827 kv/kvserver/replica_destroy.go:170 ⋮ [n3,replicaGC,s3,r21/3:‹/Table/6{0-1}›] 129  removed 10 (0+10) keys in 10ms [clear=0ms commit=10ms]
F210319 20:02:53.198720 2574 kv/kvserver/replica_proposal.go:841 ⋮ [n3,s3,r21/3:‹/Table/6{0-1}›] 123  not using applied state key in v21.1

There is no snapshot application racing with the cluster version bump here. Instead, it seems as though there's some invalid interleaving of replicaGC and (lease) request evaluation.

@tbg tbg added GA-blocker and removed release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Mar 23, 2021
@tbg tbg changed the title roachtest: acceptance/version-upgrade failed roachtest: acceptance/version-upgrade failed [replicaGC/eval race] Mar 23, 2021
@irfansharif
Copy link
Contributor

@tbg, I have cycles, I'll take a look here.

@irfansharif irfansharif self-assigned this Mar 24, 2021
@irfansharif
Copy link
Contributor

irfansharif commented Mar 25, 2021

This looks like an instance of #58378.

Well it's not exactly that, but I think it's possible for replica GC to purge out a replica, but if that replica is receiving a raft message, it might re-instate an uninitialized replica in the store map. I'm not sure yet though, this is pretty difficult to reproduce. I'm going to try and see what reverting #60429 tickles, I suspect it's the same issue (where we're dealing with uninitialized replicas, and thus panicking). Hopefully that makes it easier to repro.

@irfansharif
Copy link
Contributor

irfansharif commented Mar 26, 2021

Hm, I'm not having much luck with this one. It's pretty difficult to repro (read: I couldn't) and none of my theories have panned out. My best guess for as to what's happening is that the replicaGC process GCs an old replica, writing the range tombstone key and removing it from the store's replicas map. Perhaps concurrently, we're processing an old raft request, calling into store.tryGetOrCreateReplica. I don't see how (we appropriately check for the tombstone and the store map), but if that GC-ed replica were missing from the map, we might try to initialize a second incarnation of it here:

return repl.loadRaftMuLockedReplicaMuLocked(uninitializedDesc)

In doing so it'll read from the engine to find it's replica state, but won't find any if it was GC-ed away, so it's as-if it was "unset":

if r.mu.state, err = r.mu.stateLoader.Load(ctx, r.Engine(), desc); err != nil {
return err
}

usingAppliedStateKey := r.mu.state.UsingAppliedStateKey

But again, I'm not actually seeing evidence of any of that. I tried to work backwards from the replica state possibly being unset, and this seems to be the only possible narrative for it, but I don't see how. A third thing that would need to happen is a lease request from this node, which would need to evaluate on the follower node. In the failure above, r21/3 on n3 is being GC-ed, and is also evaluating a proposal. Looking at the descriptor for r21, the replicas + leaseholders are elsewhere:

      "desc": {
        "range_id": 21,
        "start_key": "xA==",
        "end_key": "xQ==",
        "internal_replicas": [
          {
            "node_id": 2,
            "store_id": 2,
            "replica_id": 4
          },
          {
            "node_id": 4,
            "store_id": 4,
            "replica_id": 2
          },
          {
            "node_id": 1,
            "store_id": 1,
            "replica_id": 5
          }
        ],
        "next_replica_id": 6,
        "generation": 3,
        "deprecated_generation_comparable": true
      },

This "unset" replica state, coming about from an uninitialized replica being installed in the stores map (again, not even sure if that's what's happening), would also fit into what #58378 is seeing. I'm going to look at other things for now, maybe I'll think of something else while I do. I'd be very curious to see if anyone else has better luck (+cc @tbg).

@cockroach-teamcity
Copy link
Member Author

(roachtest).acceptance/version-upgrade failed on master@d891594d3c998f153b88f631e3c89ac7d12c2a6e:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/acceptance/version-upgrade/run_1
	cluster.go:1667,context.go:140,cluster.go:1656,test_runner.go:849: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-2823051-1616997898-13-n4cpu4 --oneshot --ignore-empty-nodes: exit status 1 3: dead
		4: 7239
		2: 7363
		1: 7825
		Error: UNCLASSIFIED_PROBLEM: 3: dead
		(1) UNCLASSIFIED_PROBLEM
		Wraps: (2) attached stack trace
		  -- stack trace:
		  | main.glob..func14
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1147
		  | main.wrap.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:271
		  | github.com/spf13/cobra.(*Command).execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:830
		  | github.com/spf13/cobra.(*Command).ExecuteC
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:914
		  | github.com/spf13/cobra.(*Command).Execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:864
		  | main.main
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1852
		  | runtime.main
		  | 	/usr/local/go/src/runtime/proc.go:204
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1374
		Wraps: (3) 3: dead
		Error types: (1) errors.Unclassified (2) *withstack.withStack (3) *errutil.leafError

More

Artifacts: /acceptance/version-upgrade
Related:

See this test on roachdash
powered by pkg/cmd/internal/issues

@irfansharif
Copy link
Contributor

irfansharif commented Mar 29, 2021

I210329 06:10:20.339270 9269 1@kv/kvserver/replica_proposal.go:841 ⋮ [n3,s3,r21/3:‹/Table/6{0-1}›] 242  the server is terminating due to a fatal error (see the DEV channel for details)
I210329 06:10:20.347575 9282 kv/kvserver/store_remove_replica.go:127 ⋮ [n3,replicaGC,s3,r21/3:‹/Table/6{0-1}›] 244  removing replica r21/3
I210329 06:10:20.348110 9282 kv/kvserver/replica_destroy.go:170 ⋮ [n3,replicaGC,s3,r21/3:‹/Table/6{0-1}›] 245  removed 10 (0+10) keys in 0ms [clear=0ms commit=0ms]
F210329 06:10:20.339341 9269 kv/kvserver/replica_proposal.go:841 ⋮ [n3,s3,r21/3:‹/Table/6{0-1}›] 243  not using applied state key in v21.1

The same kind of GC race as before, except it happened during teardown after the test had already successfully completed.

@aliher1911
Copy link
Contributor

Another instance of this failure happening to me during merge: https://teamcity.cockroachdb.com/buildConfiguration/Cockroach_UnitTests_Roachtest/2824498

@irfansharif
Copy link
Contributor

I'm still unable to repro; I’ve had about 50 runs with nothing. I’m still trying to tease out how it’s possible for an uninitalized range to evaluate a proposal. My only guess is that if we’re somehow no able to read the range tombstone the replicaGC process writes out, we could misconstrue that as instantiate a replica without the right replica state. It's also not clear why r21 (and r84) is even being GC-ed. Somewhat surprisingly that happens both times in both failures where those two very replicas are GC-ed, but there aren't any indications as to why (no merges/splits/change triggers).

Actually, as I type that out, maybe we aren't logging things in the same way we do in 21.1. I'll double check, I was able to see the same GC behavior for r21 and r84 on successful runs, so at least that should be easier to track down (and maybe hints at what's happening here).

@irfansharif
Copy link
Contributor

A third thing that would need to happen is a lease request from this node, which would need to evaluate on the follower node.

F210329 06:10:20.339341 9269 kv/kvserver/replica_proposal.go:841 ⋮ [n3,s3,r21/3:‹/Table/6{0-1}›] 243  not using applied state key in v21.1
F210329 06:10:20.339341 9269 kv/kvserver/replica_proposal.go:841 ⋮ [n3,s3,r21/3:‹/Table/6{0-1}›] 243 !goroutine 9269 [running]:
F210329 06:10:20.339341 9269 kv/kvserver/replica_proposal.go:841 ⋮ [n3,s3,r21/3:‹/Table/6{0-1}›] 243 !github.com/cockroachdb/cockroach/pkg/util/log.getStacks(0x8661d01, 0x203001, 0x203000, 0x10)
F210329 06:10:20.339341 9269 kv/kvserver/replica_proposal.go:841 ⋮ [n3,s3,r21/3:‹/Table/6{0-1}›] 243 !	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/get_stacks.go:25 +0xb9
F210329 06:10:20.339341 9269 kv/kvserver/replica_proposal.go:841 ⋮ [n3,s3,r21/3:‹/Table/6{0-1}›] 243 !github.com/cockroachdb/cockroach/pkg/util/log.(*loggerT).outputLogEntry(0xc0004b7b80, 0xc0026abce0, 0x24, 0x3, 0x0, 0x0, 0x0, 0x1670bb506dc3ea5a, 0x400000000, 0x0, ...)
F210329 06:10:20.339341 9269 kv/kvserver/replica_proposal.go:841 ⋮ [n3,s3,r21/3:‹/Table/6{0-1}›] 243 !	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog.go:279 +0xc32
F210329 06:10:20.339341 9269 kv/kvserver/replica_proposal.go:841 ⋮ [n3,s3,r21/3:‹/Table/6{0-1}›] 243 !github.com/cockroachdb/cockroach/pkg/util/log.logfDepth(0x5a1fb20, 0xc001740998, 0x1, 0x4, 0x4c45415, 0x24, 0x0, 0x0, 0x0)
F210329 06:10:20.339341 9269 kv/kvserver/replica_proposal.go:841 ⋮ [n3,s3,r21/3:‹/Table/6{0-1}›] 243 !	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/channels.go:58 +0x198
F210329 06:10:20.339341 9269 kv/kvserver/replica_proposal.go:841 ⋮ [n3,s3,r21/3:‹/Table/6{0-1}›] 243 !github.com/cockroachdb/cockroach/pkg/util/log.Fatal(...)
F210329 06:10:20.339341 9269 kv/kvserver/replica_proposal.go:841 ⋮ [n3,s3,r21/3:‹/Table/6{0-1}›] 243 !	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/log_channels_generated.go:850
F210329 06:10:20.339341 9269 kv/kvserver/replica_proposal.go:841 ⋮ [n3,s3,r21/3:‹/Table/6{0-1}›] 243 !github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).evaluateProposal(0xc001c89200, 0x5a1fb20, 0xc001740998, 0xc000b21110, 0x8, 0xc002bd80a0, 0x0, 0x0, 0xc003301500, 0x0, ...)
F210329 06:10:20.339341 9269 kv/kvserver/replica_proposal.go:841 ⋮ [n3,s3,r21/3:‹/Table/6{0-1}›] 243 !	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_proposal.go:841 +0x63f
F210329 06:10:20.339341 9269 kv/kvserver/replica_proposal.go:841 ⋮ [n3,s3,r21/3:‹/Table/6{0-1}›] 243 !github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).requestToProposal(0xc001c89200, 0x5a1fb20, 0xc001740998, 0xc000b21110, 0x8, 0xc002bd80a0, 0x1601fb530f0b5318, 0x0, 0xc001c863e0, 0x400000004, ...)
F210329 06:10:20.339341 9269 kv/kvserver/replica_proposal.go:841 ⋮ [n3,s3,r21/3:‹/Table/6{0-1}›] 243 !	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_proposal.go:876 +0xb6
F210329 06:10:20.339341 9269 kv/kvserver/replica_proposal.go:841 ⋮ [n3,s3,r21/3:‹/Table/6{0-1}›] 243 !github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).evalAndPropose(0xc001c89200, 0x5a1fb20, 0xc001740998, 0xc002bd80a0, 0xc0008fd490, 0x1601fb530f0b5318, 0x0, 0xc001c863e0, 0x400000004, 0x2, ...)
F210329 06:10:20.339341 9269 kv/kvserver/replica_proposal.go:841 ⋮ [n3,s3,r21/3:‹/Table/6{0-1}›] 243 !	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_raft.go:84 +0x197
F210329 06:10:20.339341 9269 kv/kvserver/replica_proposal.go:841 ⋮ [n3,s3,r21/3:‹/Table/6{0-1}›] 243 !github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).executeWriteBatch(0xc001c89200, 0x5a1fb20, 0xc001740998, 0xc002bd80a0, 0xc0008fd490, 0x0, 0x0, 0x0)
F210329 06:10:20.339341 9269 kv/kvserver/replica_proposal.go:841 ⋮ [n3,s3,r21/3:‹/Table/6{0-1}›] 243 !	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_write.go:138 +0x7c5
F210329 06:10:20.339341 9269 kv/kvserver/replica_proposal.go:841 ⋮ [n3,s3,r21/3:‹/Table/6{0-1}›] 243 !github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).executeBatchWithConcurrencyRetries(0xc001c89200, 0x5a1fb20, 0xc001740998, 0xc002bd80a0, 0x52b2e70, 0x0, 0x0)
F210329 06:10:20.339341 9269 kv/kvserver/replica_proposal.go:841 ⋮ [n3,s3,r21/3:‹/Table/6{0-1}›] 243 !	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_send.go:275 +0x336
F210329 06:10:20.339341 9269 kv/kvserver/replica_proposal.go:841 ⋮ [n3,s3,r21/3:‹/Table/6{0-1}›] 243 !github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).sendWithRangeID(0xc001c89200, 0x5a1fb20, 0xc001740998, 0x15, 0xc002bd80a0, 0x0, 0xc001ca5f40)
F210329 06:10:20.339341 9269 kv/kvserver/replica_proposal.go:841 ⋮ [n3,s3,r21/3:‹/Table/6{0-1}›] 243 !	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_send.go:95 +0x55d
F210329 06:10:20.339341 9269 kv/kvserver/replica_proposal.go:841 ⋮ [n3,s3,r21/3:‹/Table/6{0-1}›] 243 !github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).Send(...)
F210329 06:10:20.339341 9269 kv/kvserver/replica_proposal.go:841 ⋮ [n3,s3,r21/3:‹/Table/6{0-1}›] 243 !	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_send.go:34
F210329 06:10:20.339341 9269 kv/kvserver/replica_proposal.go:841 ⋮ [n3,s3,r21/3:‹/Table/6{0-1}›] 243 !github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*pendingLeaseRequest).requestLeaseAsync.func2(0x5a1fb20, 0xc001740998)
F210329 06:10:20.339341 9269 kv/kvserver/replica_proposal.go:841 ⋮ [n3,s3,r21/3:‹/Table/6{0-1}›] 243 !	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_range_lease.go:421 +0x6b0
F210329 06:10:20.339341 9269 kv/kvserver/replica_proposal.go:841 ⋮ [n3,s3,r21/3:‹/Table/6{0-1}›] 243 !github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTask.func1(0xc0003d6c00, 0x5a1fb20, 0xc001740998, 0xc001740820, 0xc002bd8000)

So at least that's how the evaluation is happening.

@cockroach-teamcity
Copy link
Member Author

(roachtest).acceptance/version-upgrade failed on master@0dd303e2234c5efe21ba242aff74a5edfd5f5c40:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: artifacts/acceptance/version-upgrade/run_1
	versionupgrade.go:296,versionupgrade.go:471,versionupgrade.go:453,versionupgrade.go:200,versionupgrade.go:188,acceptance.go:63,acceptance.go:104,test_runner.go:768: pq: operation "show cluster setting version" timed out after 2m0s: value differs between local setting ([18 8 8 20 16 2 24 0 32 48]) and KV ([18 8 8 20 16 2 24 0 32 0]); try again later (<nil> after 1m59.87588872s)

	cluster.go:1667,context.go:140,cluster.go:1656,test_runner.go:849: dead node detection: /go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor local --oneshot --ignore-empty-nodes: exit status 1 3: dead
		2: 25233
		1: 25003
		4: 25119
		Error: UNCLASSIFIED_PROBLEM: 3: dead
		(1) UNCLASSIFIED_PROBLEM
		Wraps: (2) attached stack trace
		  -- stack trace:
		  | main.glob..func14
		  | 	/go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1147
		  | main.wrap.func1
		  | 	/go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:271
		  | github.com/spf13/cobra.(*Command).execute
		  | 	/go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:830
		  | github.com/spf13/cobra.(*Command).ExecuteC
		  | 	/go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:914
		  | github.com/spf13/cobra.(*Command).Execute
		  | 	/go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:864
		  | main.main
		  | 	/go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1852
		  | runtime.main
		  | 	/usr/local/go/src/runtime/proc.go:204
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1374
		Wraps: (3) 3: dead
		Error types: (1) errors.Unclassified (2) *withstack.withStack (3) *errutil.leafError

More

Artifacts: /acceptance/version-upgrade
Related:

See this test on roachdash
powered by pkg/cmd/internal/issues

@tbg
Copy link
Member

tbg commented Mar 30, 2021

Looks like the same thing

I210329 22:02:02.221682 4379 1@kv/kvserver/replica_proposal.go:841 ⋮ [n3,s3,r21/3:‹/Table/6{0-1}›] 294  the server is terminating due to a fatal error (see the DEV channel for details)
I210329 22:02:02.221776 4362 kv/kvserver/store_remove_replica.go:127 ⋮ [n3,replicaGC,s3,r21/3:‹/Table/6{0-1}›] 295  removing replica r21/3
W210329 22:02:02.238841 3855 kv/kvserver/intentresolver/intent_resolver.go:758 ⋮ [-] 297  failed to gc transaction record: could not GC completed transaction anchored at ‹/Table/15/1/645480192251330563›: ‹node unavailable; try another peer›
W210329 22:02:02.238839 4335 kv/kvserver/replica_write.go:206 ⋮ [n3,s3,r10/3:‹/Table/1{5-6}›] 298  during async intent resolution: ‹node unavailable; try another peer›
W210329 22:02:02.238950 3713 kv/kvserver/closedts/sidetransport/receiver.go:125 â‹® [n3] 299  closed timestamps side-transport connection dropped from node: 1
W210329 22:02:02.239198 457 jobs/registry.go:729 â‹® [-] 301  canceling all adopted jobs due to stopper quiescing
I210329 22:02:02.239075 4387 1@vendor/github.com/cockroachdb/circuitbreaker/circuitbreaker.go:322 ⋮ [n3] 300  circuitbreaker: ‹rpc 127.0.0.1:26261 [n2]› tripped: failed to connect to n2 at ‹127.0.0.1:26259›: stopped
I210329 22:02:02.239260 4387 1@vendor/github.com/cockroachdb/circuitbreaker/circuitbreaker.go:447 ⋮ [n3] 302  circuitbreaker: ‹rpc 127.0.0.1:26261 [n2]› event: ‹BreakerTripped›
W210329 22:02:02.239251 3827 kv/kvserver/closedts/sidetransport/receiver.go:125 â‹® [n3] 303  closed timestamps side-transport connection dropped from node: 2
W210329 22:02:02.239383 3664 kv/kvserver/closedts/sidetransport/receiver.go:125 â‹® [n3] 304  closed timestamps side-transport connection dropped from node: 4
W210329 22:02:02.239432 456 sql/sqlliveness/slinstance/slinstance.go:183 â‹® [n3] 305  exiting heartbeat loop
W210329 22:02:02.239261 4087 kv/txn.go:635 ⋮ [n3,intExec=‹poll-show-jobs›,migration-mgr] 306  failure aborting transaction: ‹node unavailable; try another peer›; abort caused by: query execution canceled
W210329 22:02:02.239854 4333 kv/txn.go:635 ⋮ [n3,job=‹645480192964231171›,migration=20.2-50] 307  failure aborting transaction: ‹node unavailable; try another peer›; abort caused by: aborted during DistSender.Send: context canceled
W210329 22:02:02.239952 1524 kv/txn.go:635 ⋮ [n3,intExec=‹set-version›] 308  failure aborting transaction: ‹node unavailable; try another peer›; abort caused by: polling for queued jobs to complete: ‹poll-show-jobs›: context canceled
W210329 22:02:02.240167 1524 kv/txn.go:635 ⋮ [n3,intExec=‹set-version›] 309  failure aborting transaction: ‹node unavailable; try another peer›; abort caused by: query execution canceled
I210329 22:02:02.240271 476 server/auto_upgrade.go:75 ⋮ [n3] 310  error when finalizing cluster version upgrade: ‹set-version›: context canceled
F210329 22:02:02.221796 4379 kv/kvserver/replica_proposal.go:841 ⋮ [n3,s3,r21/3:‹/Table/6{0-1}›] 296  not using applied state key in v21.1
F210329 22:02:02.221796 4379 kv/kvserver/replica_proposal.go:841 ⋮ [n3,s3,r21/3:‹/Table/6{0-1}›] 296 !goroutine 4379 [running]:
F210329 22:02:02.221796 4379 kv/kvserver/replica_proposal.go:841 ⋮ [n3,s3,r21/3:‹/Table/6{0-1}›] 296 !github.com/cockroachdb/cockroach/pkg/util/log.getStacks(0x8492201, 0xc001ae7328, 0x10000000049cd4d, 0x7f28a69a6420)
F210329 22:02:02.221796 4379 kv/kvserver/replica_proposal.go:841 ⋮ [n3,s3,r21/3:‹/Table/6{0-1}›] 296 !	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/get_stacks.go:25 +0xb9
F210329 22:02:02.221796 4379 kv/kvserver/replica_proposal.go:841 ⋮ [n3,s3,r21/3:‹/Table/6{0-1}›] 296 !github.com/cockroachdb/cockroach/pkg/util/log.(*loggerT).outputLogEntry(0xc000d60940, 0xc0028af1d0, 0x24, 0x3, 0x0, 0x0, 0x0, 0x1670ef3f7f2f5d59, 0x400000000, 0x0, ...)
F210329 22:02:02.221796 4379 kv/kvserver/replica_proposal.go:841 ⋮ [n3,s3,r21/3:‹/Table/6{0-1}›] 296 !	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog.go:279 +0xc32
F210329 22:02:02.221796 4379 kv/kvserver/replica_proposal.go:841 ⋮ [n3,s3,r21/3:‹/Table/6{0-1}›] 296 !github.com/cockroachdb/cockroach/pkg/util/log.logfDepth(0x5a45f00, 0xc0025d9640, 0x1, 0x4, 0x4c8994b, 0x24, 0x0, 0x0, 0x0)
F210329 22:02:02.221796 4379 kv/kvserver/replica_proposal.go:841 ⋮ [n3,s3,r21/3:‹/Table/6{0-1}›] 296 !	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/channels.go:58 +0x198
F210329 22:02:02.221796 4379 kv/kvserver/replica_proposal.go:841 ⋮ [n3,s3,r21/3:‹/Table/6{0-1}›] 296 !github.com/cockroachdb/cockroach/pkg/util/log.Fatal(...)
F210329 22:02:02.221796 4379 kv/kvserver/replica_proposal.go:841 ⋮ [n3,s3,r21/3:‹/Table/6{0-1}›] 296 !	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/log_channels_generated.go:850
F210329 22:02:02.221796 4379 kv/kvserver/replica_proposal.go:841 ⋮ [n3,s3,r21/3:‹/Table/6{0-1}›] 296 !github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).evaluateProposal(0xc001c32000, 0x5a45f00, 0xc0025d9640, 0xc001cfc178, 0x8, 0xc001244e60, 0x0, 0x0, 0xc001e64420, 0x0, ...)
F210329 22:02:02.221796 4379 kv/kvserver/replica_proposal.go:841 ⋮ [n3,s3,r21/3:‹/Table/6{0-1}›] 296 !	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_proposal.go:841 +0x63f
F210329 22:02:02.221796 4379 kv/kvserver/replica_proposal.go:841 ⋮ [n3,s3,r21/3:‹/Table/6{0-1}›] 296 !github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).requestToProposal(0xc001c32000, 0x5a45f00, 0xc0025d9640, 0xc001cfc178, 0x8, 0xc001244e60, 0x1601fb530f0b5318, 0x0, 0xc00231b910, 0x400000004, ...)
F210329 22:02:02.221796 4379 kv/kvserver/replica_proposal.go:841 ⋮ [n3,s3,r21/3:‹/Table/6{0-1}›] 296 !	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_proposal.go:876 +0xb6
F210329 22:02:02.221796 4379 kv/kvserver/replica_proposal.go:841 ⋮ [n3,s3,r21/3:‹/Table/6{0-1}›] 296 !github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).evalAndPropose(0xc001c32000, 0x5a45f00, 0xc0025d9640, 0xc001244e60, 0xc000eb4c40, 0x1601fb530f0b5318, 0x0, 0xc00231b910, 0x400000004, 0x2, ...)
F210329 22:02:02.221796 4379 kv/kvserver/replica_proposal.go:841 ⋮ [n3,s3,r21/3:‹/Table/6{0-1}›] 296 !	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_raft.go:84 +0x197
F210329 22:02:02.221796 4379 kv/kvserver/replica_proposal.go:841 ⋮ [n3,s3,r21/3:‹/Table/6{0-1}›] 296 !github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).executeWriteBatch(0xc001c32000, 0x5a45f00, 0xc0025d9640, 0xc001244e60, 0xc000eb4c40, 0x0, 0x0, 0x0)
F210329 22:02:02.221796 4379 kv/kvserver/replica_proposal.go:841 ⋮ [n3,s3,r21/3:‹/Table/6{0-1}›] 296 !	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_write.go:138 +0x7c5
F210329 22:02:02.221796 4379 kv/kvserver/replica_proposal.go:841 ⋮ [n3,s3,r21/3:‹/Table/6{0-1}›] 296 !github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).executeBatchWithConcurrencyRetries(0xc001c32000, 0x5a45f00, 0xc0025d9640, 0xc001244e60, 0x52f7688, 0x0, 0x0)
F210329 22:02:02.221796 4379 kv/kvserver/replica_proposal.go:841 ⋮ [n3,s3,r21/3:‹/Table/6{0-1}›] 296 !	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_send.go:275 +0x336
F210329 22:02:02.221796 4379 kv/kvserver/replica_proposal.go:841 ⋮ [n3,s3,r21/3:‹/Table/6{0-1}›] 296 !github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).sendWithRangeID(0xc001c32000, 0x5a45f00, 0xc0025d9640, 0x15, 0xc001244e60, 0xc0010ce2b0, 0xc0010ce31a)
F210329 22:02:02.221796 4379 kv/kvserver/replica_proposal.go:841 ⋮ [n3,s3,r21/3:‹/Table/6{0-1}›] 296 !	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_send.go:95 +0x55d
F210329 22:02:02.221796 4379 kv/kvserver/replica_proposal.go:841 ⋮ [n3,s3,r21/3:‹/Table/6{0-1}›] 296 !github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).Send(...)
F210329 22:02:02.221796 4379 kv/kvserver/replica_proposal.go:841 ⋮ [n3,s3,r21/3:‹/Table/6{0-1}›] 296 !	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_send.go:34
F210329 22:02:02.221796 4379 kv/kvserver/replica_proposal.go:841 ⋮ [n3,s3,r21/3:‹/Table/6{0-1}›] 296 !github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*pendingLeaseRequest).requestLeaseAsync.func2(0x5a45f00, 0xc0025d9640)
F210329 22:02:02.221796 4379 kv/kvserver/replica_proposal.go:841 ⋮ [n3,s3,r21/3:‹/Table/6{0-1}›] 296 !	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_range_lease.go:421 +0x6b0
F210329 22:02:02.221796 4379 kv/kvserver/replica_proposal.go:841 ⋮ [n3,s3,r21/3:‹/Table/6{0-1}›] 296 !github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTask.func1(0xc000c8e880, 0x5a45f00, 0xc0025d9640, 0xc000ca3680, 0xc0032a15e0)
F210329 22:02:02.221796 4379 kv/kvserver/replica_proposal.go:841 ⋮ [n3,s3,r21/3:‹/Table/6{0-1}›] 296 !	/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:351 +0xb9
F210329 22:02:02.221796 4379 kv/kvserver/replica_proposal.go:841 ⋮ [n3,s3,r21/3:‹/Table/6{0-1}›] 296 !created by github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTask
F210329 22:02:02.221796 4379 kv/kvserver/replica_proposal.go:841 ⋮ [n3,s3,r21/3:‹/Table/6{0-1}›] 296 !	/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:346 +0xfc

@irfansharif
Copy link
Contributor

It's probably unrelated but because I'm out of ideas here, I'm going to try poking at #58378 instead. It's a lot more reproducible and I think in the same ball park of "uninitialized raft state on a replica" (running a build that reverts #60429).

@irfansharif
Copy link
Contributor

irfansharif commented Mar 30, 2021

Hmmmmmmmm latest (unverified) shower thought on how we could be running into this bug.

  • Say r42 starts off on n1, n2, and n3.
  • This test restarts nodes, so eventually we get a placement for r42 on n1, n2, and n4.
  • n3's replica, r42/3, is now GC-able, but say that n3 hasn't GC-ed it yet
  • As we run the applied state migration, we first migrate all ranges and then purge outdated replicas
  • Well, we should want to purge r42/3, cause it's un-migrated and evaluating anything on it (say a lease request) is unsound because we've bumped version gates that tell the server to always expect post-migration state
  • What happens when we try to purge r42/3? If it doesn't have a replica version (because kvserver: purging outdated replicas during migrations run into nil pointers #58378 is still unaddressed), we'll skip over it (!)
  • Is it possible for it to not have a replica version? Shouldn't it be accounted for when we migrate all ranges? No, that's precisely why we have to purge outdated replicas. The migrate request returns once its applied on all followes; in our example that won't include r42/3 since it's no longer a follower
  • The stop-gap in kv: (re-)introduce a stopgap for lack of ReplicaState synchronization #60429 makes it so that we don't GC r42/3, when we should be doing the opposite. When iterating over a store's replicas for purging purposes, a nil replica version is fine and expected; we should read that as signal that we're dealing with a replica that is definitely to be GC-ed and was never migrated (and thus doesn't have a valid replica version installed)
  • We can now evaluate requests on a replica that's unmigrated relative to the store's version. Boom.

@irfansharif
Copy link
Contributor

The close occurrence of replicaGC logs is explainable, r42/3 should be GC-ed. We'd only trigger this assertion if we evaluated anything against that replica, like a lease request, which we also know is happening (#62267 (comment)). I'll try writing tests for it, and for #58378. I think they're the same issue.

@irfansharif irfansharif changed the title roachtest: acceptance/version-upgrade failed [replicaGC/eval race] roachtest: acceptance/version-upgrade failed [not using applied state] Mar 31, 2021
irfansharif added a commit to irfansharif/cockroach that referenced this issue Mar 31, 2021
Fixes cockroachdb#58378.
Fixes cockroachdb#62267.

Previously it was possible for us to have replicas in-memory, with
pre-migrated state, even after a migration was finalized. This led to
the kind of badness we were observing in cockroachdb#62267, where it appeared that
a replica was not using the applied state key despite us having migrated
into it (see TruncatedAndRangeAppliedState, introduced in cockroachdb#58088).

---

To see how, consider the following set of events:

- Say r42 starts off on n1, n2, and n3
- n3 flaps and so we place a replica for r42 on n4
- n3's replica, r42/3, is now GC-able, but still un-GC-ed
- We run the applied state migration, first migrating all ranges into it
  and then purging outdated replicas
- Well, we should want to purge r42/3, cause it's un-migrated and
  evaluating anything on it (say a lease request) is unsound because
  we've bumped version gates that tell the kvserver to always expect
  post-migration state
- What happens when we try to purge r42/3? Previous to this PR if it
  didn't have a replica version, we'd skip over it (!)
- Was it possible for r42/3 to not have a replica version? Shouldn't it
  have been accounted for when we migrated all ranges? No, that's precisely
  why the migration infrastructure purge outdated replicas. The migrate
  request only returns once its applied on all followers; in our example
  that wouldn't include r42/3 since it was no longer one
- The stop-gap in cockroachdb#60429 made it so that we didn't GC r42/3, when we
  should've been doing the opposite. When iterating over a store's
  replicas for purging purposes, an empty replica version is fine and
  expected; we should interpret that as signal that we're dealing with a
  replica that was obviously never migrated (to even start using replica
  versions in the first place). Because it didn't have a valid replica
  version installed, we can infer that it's soon to be GC-ed (else we
  wouldn't have been able to finalize the applied state + replica
  version migration)
- The conditions above made it possible for us to evaluate requests on
  replicas with migration state out-of-date relative to the store's
  version
- Boom

Release note: None
craig bot pushed a commit that referenced this issue Mar 31, 2021
60835: kv: bump timestamp cache to Pushee.MinTimestamp on PUSH_ABORT r=nvanbenschoten a=nvanbenschoten

Fixes #60779.
Fixes #60580.

We were only checking that the batch header timestamp was equal to or
greater than this pushee's min timestamp, so this is as far as we can
bump the timestamp cache.

62832: geo: minor performance improvement for looping over edges r=otan a=andyyang890

This patch slightly improves the performance of many
spatial builtins by storing the number of edges used
in the loop conditions of for loops into a variable.
We discovered this was taking a lot of time when
profiling the point-in-polygon optimization.

Release note: None

62838: kvserver: purge gc-able, unmigrated replicas during migrations r=irfansharif a=irfansharif

Fixes #58378.
Fixes #62267.

Previously it was possible for us to have replicas in-memory, with
pre-migrated state, even after a migration was finalized. This led to
the kind of badness we were observing in #62267, where it appeared that
a replica was not using the applied state key despite us having migrated
into it (see TruncatedAndRangeAppliedState, introduced in #58088).

---

To see how, consider the following set of events:

- Say r42 starts off on n1, n2, and n3
- n3 flaps and so we place a replica for r42 on n4
- n3's replica, r42/3, is now GC-able, but still un-GC-ed
- We run the applied state migration, first migrating all ranges into it
  and then purging outdated replicas
- Well, we should want to purge r42/3, cause it's un-migrated and
  evaluating anything on it (say a lease request) is unsound because
  we've bumped version gates that tell the kvserver to always expect
  post-migration state
- What happens when we try to purge r42/3? Previous to this PR if it
  didn't have a replica version, we'd skip over it (!)
- Was it possible for r42/3 to not have a replica version? Shouldn't it
  have been accounted for when we migrated all ranges? No, that's precisely
  why the migration infrastructure purge outdated replicas. The migrate
  request only returns once its applied on all followers; in our example
  that wouldn't include r42/3 since it was no longer one
- The stop-gap in #60429 made it so that we didn't GC r42/3, when we
  should've been doing the opposite. When iterating over a store's
  replicas for purging purposes, an empty replica version is fine and
  expected; we should interpret that as signal that we're dealing with a
  replica that was obviously never migrated (to even start using replica
  versions in the first place). Because it didn't have a valid replica
  version installed, we can infer that it's soon to be GC-ed (else we
  wouldn't have been able to finalize the applied state + replica
  version migration)
- The conditions above made it possible for us to evaluate requests on
  replicas with migration state out-of-date relative to the store's
  version
- Boom

Release note: None


62839: zonepb: make subzone DiffWithZone more accurate r=ajstorm a=otan

* Subzones may be defined in a different order. We did not take this
  into account which can cause bugs when e.g. ADD REGION adds a subzone
  in the end rather than in the old "expected" location in the subzones
  array. This has been fixed by comparing subzones using an unordered
  map.
* The ApplyZoneConfig we previously did overwrote subzone fields on the
  original subzone array element, meaning that if there was a mismatch
  it would not be reported through validation. This is now fixed by
  applying the expected zone config to *zonepb.NewZoneConfig() instead.
* Added logic to only check for zone config matches subzones from
  active subzone IDs.
* Improve the error messaging when a subzone config is mismatching -
  namely, add index and partitioning information and differentiate
  between missing fields and missing / extraneous zone configs

Resolves #62790

Release note (bug fix): Fixed validation bugs during ALTER TABLE ... SET
LOCALITY / crdb_internal.validate_multi_region_zone_config where
validation errors could occur when the database of a REGIONAL BY ROW
table has a new region added. Also fix a validation bug partition zone
mismatches configs were not caught.

62872: build: use -json for RandomSyntax test r=otan a=rafiss

I'm hoping this will help out with an issue where the test failures seem
to be missing helpful logs.

Release note: None

Co-authored-by: Nathan VanBenschoten <[email protected]>
Co-authored-by: Andy Yang <[email protected]>
Co-authored-by: irfan sharif <[email protected]>
Co-authored-by: Oliver Tan <[email protected]>
Co-authored-by: Rafi Shamim <[email protected]>
@craig craig bot closed this as completed in 1416a4e Apr 1, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). GA-blocker O-roachtest O-robot Originated from a bot.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants