Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: transfer-leases/drain-other-node failed #83261

Closed
cockroach-teamcity opened this issue Jun 23, 2022 · 67 comments
Closed

roachtest: transfer-leases/drain-other-node failed #83261

cockroach-teamcity opened this issue Jun 23, 2022 · 67 comments
Assignees
Labels
branch-master Failures and bugs on the master branch. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). GA-blocker O-roachtest O-robot Originated from a bot. T-kv KV Team

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Jun 23, 2022

roachtest.transfer-leases/drain-other-node failed with artifacts on master @ 7a1ced4a49d28f66f381e78733bcab510e2cbafe:

test artifacts and logs in: /artifacts/transfer-leases/drain-other-node/run_1
	quit.go:72,quit.go:323,soon.go:69,retry.go:207,soon.go:75,soon.go:48,quit.go:228,quit.go:95,quit.go:154,context.go:91,quit.go:153,quit.go:95,quit.go:54,quit.go:359,test_runner.go:896: (1) ranges with no lease outside of node 2: []string{"36"}

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=4 , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

/cc @cockroachdb/kv-triage

This test on roachdash | Improve this report!

Jira issue: CRDB-16958

Epic CRDB-19172

@cockroach-teamcity cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Jun 23, 2022
@blathers-crl blathers-crl bot added the T-kv KV Team label Jun 23, 2022
@tbg
Copy link
Member

tbg commented Jun 27, 2022

nodes/3/ranges/36.json shows that n2 had the lease and now n3 has it, but with a non-cooperative lease acquisition (2):

  "lease_history": [
    {
      "start": {
        "wall_time": 1655973822389206298
      },
      "replica": {
        "node_id": 2,
        "store_id": 2,
        "replica_id": 2
      },
      "proposed_ts": {
        "wall_time": 1655973822389068391
      },
      "epoch": 3,
      "sequence": 7,
      "acquisition_type": 1
    },
    {
      "start": {
        "wall_time": 1655973843754591536
      },
      "replica": {
        "node_id": 3,
        "store_id": 3,
        "replica_id": 3
      },
      "proposed_ts": {
        "wall_time": 1655973843754591536
      },
      "epoch": 3,
      "sequence": 8,
      "acquisition_type": 2
    }

This means that when n2 had drained & went down, it still held the lease for r36.

Here are the logs for r36

Just before n2 goes down (right at the end of the file, before its liveness epoch increments) we log repeatedly

kv/kvserver/store.go:1521 ⋮ [drain,n2,s2,r365/3:‹/Table/104/1/18{5-6}›] 3375 not moving out

This indicates that the condition at the top is false:

// Learner replicas aren't allowed to become the leaseholder or raft
// leader, so only consider the `Voters` replicas.
needsLeaseTransfer := len(r.Desc().Replicas().VoterDescriptors()) > 1 &&
drainingLeaseStatus.IsValid() &&
drainingLeaseStatus.OwnedBy(s.StoreID())
// Note that this code doesn't deal with transferring the Raft
// leadership. Leadership tries to follow the lease, so when leases
// are transferred, leadership will be transferred too. For ranges
// without leases we probably should try to move the leadership
// manually to a non-draining replica.
if !needsLeaseTransfer {
if verbose || log.V(1) {
// This logging is useful to troubleshoot incomplete drains.
log.Info(ctx, "not moving out")
}
atomic.AddInt32(&numTransfersAttempted, -1)
return
}

An obvious answer would be that the lease wasn't valid. There's something going on with the liveness epochs here, why do we increment n2's to 4 after it went down? Didn't it start out at epoch one? So I suspect somehow n2 failed some heartbeats (maybe I am missing a nuance of what the test does), and r36 simply didn't get touched by anything after, so it stuck around with the invalid lease. We're not transferring it, and nobody else is trying to get it for some time, so when n2 goes down, we see what looks like a lease that would've caused a period of unavailability, but I suspect the lease wasn't valid and so at any moment in time another node could have gotten the lease, if only it had tried.

This seems like a limitation of the test, then, and not one that I easily know how to fix. Assigning to the incoming column so that @mwang1026 and @nvanbenschoten can take a look during triage.

@tbg tbg removed the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label Jun 27, 2022
@cockroach-teamcity
Copy link
Member Author

roachtest.transfer-leases/drain failed with artifacts on master @ 7f8fef4c3f24ec2b6f9a3db1eb38756da68b3d5a:

test artifacts and logs in: /artifacts/transfer-leases/drain/run_1
	quit.go:72,quit.go:323,soon.go:69,retry.go:207,soon.go:75,soon.go:48,quit.go:228,quit.go:95,quit.go:154,context.go:91,quit.go:153,quit.go:95,quit.go:54,quit.go:359,test_runner.go:896: (1) ranges with no lease outside of node 3: []string{"15"}

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=4 , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.transfer-leases/drain failed with artifacts on master @ e1e99da678e29db4ef21813423f012273c99add2:

test artifacts and logs in: /artifacts/transfer-leases/drain/run_1
	quit.go:72,quit.go:323,soon.go:69,retry.go:207,soon.go:75,soon.go:48,quit.go:228,quit.go:95,quit.go:154,context.go:91,quit.go:153,quit.go:95,quit.go:54,quit.go:359,test_runner.go:896: (1) ranges with no lease outside of node 2: []string{"49", "50"}

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=4 , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.transfer-leases/drain failed with artifacts on master @ 78caa5c3a8ecd0d0a612798a106b626edb3db2ee:

test artifacts and logs in: /artifacts/transfer-leases/drain/run_1
	quit.go:72,quit.go:323,soon.go:69,retry.go:207,soon.go:75,soon.go:48,quit.go:228,quit.go:95,quit.go:154,context.go:91,quit.go:153,quit.go:95,quit.go:54,quit.go:359,test_runner.go:896: (1) ranges with no lease outside of node 2: []string{"18"}

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=4 , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.transfer-leases/drain-other-node failed with artifacts on master @ 7a8cac9b11036dcf401453eb36e531fe381b70b4:

test artifacts and logs in: /artifacts/transfer-leases/drain-other-node/run_1
	quit.go:72,quit.go:323,soon.go:69,retry.go:207,soon.go:75,soon.go:48,quit.go:228,quit.go:95,quit.go:154,context.go:91,quit.go:153,quit.go:95,quit.go:54,quit.go:359,test_runner.go:896: (1) ranges with no lease outside of node 2: []string{"19"}

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=4 , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.transfer-leases/drain failed with artifacts on master @ 33d70998719051ee058bc9e516afa238ea7b7451:

test artifacts and logs in: /artifacts/transfer-leases/drain/run_1
	quit.go:72,quit.go:323,soon.go:69,retry.go:207,soon.go:75,soon.go:48,quit.go:228,quit.go:95,quit.go:154,context.go:91,quit.go:153,quit.go:95,quit.go:54,quit.go:359,test_runner.go:896: (1) ranges with no lease outside of node 2: []string{"33"}

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=4 , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

This test on roachdash | Improve this report!

@nvanbenschoten
Copy link
Member

I'm bisecting this now. I have a sneaking suspicion it is related to #82758.

@nvanbenschoten
Copy link
Member

That was it. Specifically, the bisect points at 034611b. I'll try to understand what we're seeing further.

@nvanbenschoten nvanbenschoten self-assigned this Jul 7, 2022
@cockroach-teamcity
Copy link
Member Author

roachtest.transfer-leases/drain failed with artifacts on master @ 03d5260a467b8640de41ee08054c1ce05f91966c:

test artifacts and logs in: /artifacts/transfer-leases/drain/run_1
	quit.go:72,quit.go:323,soon.go:69,retry.go:207,soon.go:75,soon.go:48,quit.go:228,quit.go:95,quit.go:154,context.go:91,quit.go:153,quit.go:95,quit.go:54,quit.go:359,test_runner.go:896: (1) ranges with no lease outside of node 2: []string{"23"}

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=4 , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

This test on roachdash | Improve this report!

@nvanbenschoten
Copy link
Member

I see what's going on here. The lease transfer fails because the draining node is not the Raft leader for a range which it holds the lease:

I220707 18:49:15.007755 217 kv/kvserver/replica_proposal_buf.go:1248 ⋮ [n3,s3,r37/2:‹/Table/3{6-7}›,raft] 1637  not proposing lease transfer because the target (n2,s2):3 may need a snapshot: ‹local replica not raft leader›

034611b introduced a best-effort, graceful version of this check and an airtight, ungraceful version of the check. The former performs the check before revoking the outgoing leaseholder's lease while the latter performs the check after revoking the outgoing leaseholder's lease.

In this case, we're hitting the airtight, ungraceful check. So we're failing the lease transfer after revoking the old lease, moving the lease into a PROSCRIBED state.

We don't immediately retry and we then hit the code @tbg linked to above:

// Learner replicas aren't allowed to become the leaseholder or raft
// leader, so only consider the `Voters` replicas.
needsLeaseTransfer := len(r.Desc().Replicas().VoterDescriptors()) > 1 &&
drainingLeaseStatus.IsValid() &&
drainingLeaseStatus.OwnedBy(s.StoreID())
// Note that this code doesn't deal with transferring the Raft
// leadership. Leadership tries to follow the lease, so when leases
// are transferred, leadership will be transferred too. For ranges
// without leases we probably should try to move the leadership
// manually to a non-draining replica.
if !needsLeaseTransfer {
if verbose || log.V(1) {
// This logging is useful to troubleshoot incomplete drains.
log.Info(ctx, "not moving out")
}
atomic.AddInt32(&numTransfersAttempted, -1)
return
}

We see that the lease is still PROSCRIBED, meaning `IsValid is false, so we don't ever complete the lease transfer before finishing the drain.

@cockroach-teamcity

This comment was marked as duplicate.

@nvanbenschoten
Copy link
Member

Live footage of the discovery above: https://twitter.com/dmofengineering/status/1545928022791843844.

@github-actions
Copy link

We have marked this test failure issue as stale because it has been
inactive for 1 month. If this failure is still relevant, removing the
stale label or adding a comment will keep it active. Otherwise,
we'll close it in 5 days to keep the test failure queue tidy.

@irfansharif
Copy link
Contributor

irfansharif commented Sep 20, 2022

Is there a gap in the testing that we added in #85629?

Looks like I got away with it by not thinking too hard about a testing knob + using manual clocks.

LeaseRenewalDurationOverride: 10 * time.Millisecond, // speed up the test

tc.IncrClockForLeaseUpgrade(t, manualClock)
tc.WaitForLeaseUpgrade(ctx, t, desc)

irfansharif added a commit to irfansharif/cockroach that referenced this issue Sep 20, 2022
In cockroachdb#85629 we changed our lease transfer protocol to only ever
transfer expiration-based leases, and have recipients later upgrade
them to the more efficient epoch based ones. This was done to limit
the effects of ill-advised lease transfers since the incoming
leaseholder would need to recognize itself as such within a few seconds
-- so we wanted this upgrade happen after having received
the lease.

In cockroachdb#83261 however we noticed that the upgrade was not immediate -- we
were waiting until the current lease's expiration was within its renewal
duration -- 4.5s. When the lease was eventually renewed the upgrade did
happen, but it was not immediate. We fix this here and remove the manual
clock advancing the supporting test had that masked this issue. It now
demonstrates that we're no longer relying on upgrades happen as part of
the (slow) renewal process.

Release note: None
@cockroach-teamcity
Copy link
Member Author

roachtest.transfer-leases/drain-other-node failed with artifacts on master @ 5862e9b3c1e24b3f644716789c96108a8e92fd71:

test artifacts and logs in: /artifacts/transfer-leases/drain-other-node/run_1
	quit.go:72,quit.go:323,soon.go:69,retry.go:207,soon.go:75,soon.go:48,quit.go:228,quit.go:95,quit.go:154,context.go:91,quit.go:153,quit.go:95,quit.go:54,quit.go:359,test_runner.go:917: (1) ranges with no lease outside of node 1: []string{"37", "33"}

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=4 , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

Same failure on other branches

This test on roachdash | Improve this report!

nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Sep 20, 2022
Related to cockroachdb#83261.

This commit removes "protection" that avoided lease acquisitions on
draining nodes. This protection had already been effectively disabled by
acc1ad1, which allowed Raft leaders to bypass the check. As the comment
here (added in 5ffaa9e) explained, Raft followers are already unable to
acquire the lease. If leaders bypass the check and follower (and
candidates) don't need it, the check is useless, so we remove it.

The commit also removes `TestReplicaDrainLease`, which was supposed to
test this behavior. We remove the test not because it started failing
after the change, but because it did not. It must not have been testing
anything real anymore after acc1ad1.

Release justification: low risk change related to release blocker.

Release note: None
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Sep 20, 2022
Related to cockroachdb#83261.

This commit removes "protection" that avoided lease acquisitions on
draining nodes. This protection had already been effectively disabled by
acc1ad1, which allowed Raft leaders to bypass the check. As the comment
here (added in 5ffaa9e) explained, Raft followers are already unable to
acquire the lease. If leaders bypass the check and follower (and
candidates) don't need it, the check is useless, so we remove it.

The commit also removes `TestReplicaDrainLease`, which was supposed to
test this behavior. We remove the test not because it started failing
after the change, but because it did not. It must not have been testing
anything real anymore after acc1ad1.

Release justification: low risk change related to release blocker.

Release note: None
irfansharif added a commit to irfansharif/cockroach that referenced this issue Sep 20, 2022
In cockroachdb#85629 we changed our lease transfer protocol to only ever
transfer expiration-based leases, and have recipients later upgrade
them to the more efficient epoch based ones. This was done to limit
the effects of ill-advised lease transfers since the incoming
leaseholder would need to recognize itself as such within a few seconds
-- so we wanted this upgrade happen after having received
the lease.

In cockroachdb#83261 however we noticed that the upgrade was not immediate -- we
were waiting until the current lease's expiration was within its renewal
duration -- 4.5s. When the lease was eventually renewed the upgrade did
happen, but it was not immediate. We fix this here and remove the manual
clock advancing the supporting test had that masked this issue. It now
demonstrates that we're no longer relying on upgrades happen as part of
the (slow) renewal process.

Release note: None
@irfansharif irfansharif removed release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. blocks-22.2.0-beta.1 labels Sep 20, 2022
craig bot pushed a commit that referenced this issue Sep 20, 2022
85859: pgwire: Add support for cursors with special characters r=knz,rafiss a=rimadeodhar

IIn CockroachDB and Postgres, it is possible to declare
cursors with special characters enclosed within double
quotes, for e.g. "1-2-3". Currently, we store the name
as an unescaped string which causes problems during the
pgwire DESCRIBE step for looking up the cursor. We should
be storing using the tree.Name datatype for the cursor name
while storing and looking up cursors. This PR updates the code
to start using tree.Name instead of raw strings for handling
cursor names. This fixes the issue where the pgwire DESCRIBE
step fails while attempting to look up cursors with names
containing special characters.

Resolves #84261

Release note (bug fix): The pgwire DESCRIBE step no longer
fails with an error while attempting to look up cursors
declared with names containing special characters.

86492: bulk: perform meta lookup on range cache miss during index backfill r=dt a=nvanbenschoten

Fixes #84290.

This commit addresses the sustained slowdown described in #84290 by replacing
the call in `SSTBatcher.flushIfNeeded` to `RangeCache.GetCached` with a call to
`RangeCache.Lookup`. The former method checks the cache but returns no range
descriptor if the cache currently has no overlapping key. This is possible if
the descriptor was recently evicted because it was stale. The later method
performs a meta lookup if the cache currently has no overlapping key, so it
should always return _some_ range descriptor.

There's a question of whether we should be logging a warning but proceeding if
this meta lookup fails. For now, I've decided not to change that behavior.

Release justification: None. Don't merge yet.

87885: kv: remove broken attempt to reject lease acquisitions on draining nodes r=nvanbenschoten a=nvanbenschoten

Related to #83261.

This commit removes "protection" that avoided lease acquisitions on draining nodes. This protection had already been effectively disabled by acc1ad1, which allowed Raft leaders to bypass the check. As the comment here (added in 5ffaa9e) explained, Raft followers are already unable to acquire the lease. If leaders bypass the check and follower (and candidates) don't need it, the check is useless, so we remove it.

The commit also removes `TestReplicaDrainLease`, which was supposed to test this behavior. We remove the test not because it started failing after the change, but because it did not. It must not have been testing anything real anymore after acc1ad1.

Release justification: low risk change related to release blocker.

Release note: None

88205: kvserver: (partially) deflake transfer-leases/drain-other-node r=irfansharif a=irfansharif

In #85629 we changed our lease transfer protocol to only ever transfer expiration-based leases, and have recipients later upgrade them to the more efficient epoch based ones. This was done to limit the effects of ill-advised lease transfers since the incoming leaseholder would need to recognize itself as such within a few seconds -- so we wanted this upgrade happen after having received the lease.

In #83261 however we noticed that the upgrade was not immediate -- we were waiting until the current lease's expiration was within its renewal duration -- 4.5s. When the lease was eventually renewed the upgrade did happen, but it was not immediate. We fix this here and remove the manual clock advancing the supporting test had that masked this issue. It now demonstrates that we're no longer relying on upgrades happen as part of the (slow) renewal process.

Release note: None

Co-authored-by: rimadeodhar <[email protected]>
Co-authored-by: Nathan VanBenschoten <[email protected]>
Co-authored-by: irfan sharif <[email protected]>
blathers-crl bot pushed a commit that referenced this issue Sep 21, 2022
In #85629 we changed our lease transfer protocol to only ever
transfer expiration-based leases, and have recipients later upgrade
them to the more efficient epoch based ones. This was done to limit
the effects of ill-advised lease transfers since the incoming
leaseholder would need to recognize itself as such within a few seconds
-- so we wanted this upgrade happen after having received
the lease.

In #83261 however we noticed that the upgrade was not immediate -- we
were waiting until the current lease's expiration was within its renewal
duration -- 4.5s. When the lease was eventually renewed the upgrade did
happen, but it was not immediate. We fix this here and remove the manual
clock advancing the supporting test had that masked this issue. It now
demonstrates that we're no longer relying on upgrades happen as part of
the (slow) renewal process.

Release note: None
blathers-crl bot pushed a commit that referenced this issue Sep 21, 2022
Related to #83261.

This commit removes "protection" that avoided lease acquisitions on
draining nodes. This protection had already been effectively disabled by
acc1ad1, which allowed Raft leaders to bypass the check. As the comment
here (added in 5ffaa9e) explained, Raft followers are already unable to
acquire the lease. If leaders bypass the check and follower (and
candidates) don't need it, the check is useless, so we remove it.

The commit also removes `TestReplicaDrainLease`, which was supposed to
test this behavior. We remove the test not because it started failing
after the change, but because it did not. It must not have been testing
anything real anymore after acc1ad1.

Release justification: low risk change related to release blocker.

Release note: None
@irfansharif
Copy link
Contributor

Two fixes were merged above. Nathan mentioned that this test still has a baseline 10% failure rate, but let's have that get filed separately.

nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Oct 17, 2022
Fixes cockroachdb#83372.
Fixes cockroachdb#90022.
Fixes cockroachdb#89963.
Fixes cockroachdb#89962.

This commit instructs stores to reacquire proscribed leases when draining in
order to subsequently transfer them away. This addresses a source of flakiness
in `transfer-lease` roachtests where some lease would not be transferred away
before the drain completed. This could result in range unavailable for up to 9
seconds while other replicas waited out the lease'S expiration. This is because
only the previous leaseholder knows that a proscribed lease is invalid. All
other replicas still consider the lease to be valid.

This failure mode was always present if a lease transfer failed during a drain.
However, it became more likely with 034611b. With that change, we began
rejecting lease transfers that were deemed to be "unsafe" more frequently.
034611b introduced a best-effort, graceful version of this check and an
airtight, ungraceful version of the check. The former performs the check before
revoking the outgoing leaseholder's lease while the latter performs the check
after revoking the outgoing leaseholder's lease. In rare cases, it was possible
to hit the airtight, ungraceful check and cause the lease to be proscribed. See
cockroachdb#83261 (comment)
for more details on how this led to test flakiness in the `transfer-lease`
roachtest suite.

Release notes: None.

Release justification: Avoids GA-blocking roachtest failures.
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Oct 18, 2022
Fixes cockroachdb#83372.
Fixes cockroachdb#90022.
Fixes cockroachdb#89963.
Fixes cockroachdb#89962.

This commit instructs stores to reacquire proscribed leases when draining in
order to subsequently transfer them away. This addresses a source of flakiness
in `transfer-lease` roachtests where some lease would not be transferred away
before the drain completed. This could result in range unavailable for up to 9
seconds while other replicas waited out the lease'S expiration. This is because
only the previous leaseholder knows that a proscribed lease is invalid. All
other replicas still consider the lease to be valid.

This failure mode was always present if a lease transfer failed during a drain.
However, it became more likely with 034611b. With that change, we began
rejecting lease transfers that were deemed to be "unsafe" more frequently.
034611b introduced a best-effort, graceful version of this check and an
airtight, ungraceful version of the check. The former performs the check before
revoking the outgoing leaseholder's lease while the latter performs the check
after revoking the outgoing leaseholder's lease. In rare cases, it was possible
to hit the airtight, ungraceful check and cause the lease to be proscribed. See
cockroachdb#83261 (comment)
for more details on how this led to test flakiness in the `transfer-lease`
roachtest suite.

Release notes: None.

Release justification: Avoids GA-blocking roachtest failures.
craig bot pushed a commit that referenced this issue Oct 18, 2022
90007: execbuilder: enforce_home_region should only apply to DML r=rytaft a=msirek

Fixes #89875
Fixes #88789

This fixes a problem where the enforce_home_region session flag might
cause non-DML statements to error out, such as SHOW CREATE, if those
statements utilize scans or joins of multiregion tables.

This also fixes issues with proper erroring out of mutation DML like
UPDATE AND DELETE. For example, the following previously did not error:
```
CREATE TABLE messages_rbr (
    account_id INT NOT NULL,
    message_id   UUID DEFAULT gen_random_uuid(),
    message    STRING NOT NULL,
    PRIMARY KEY (account_id),
    INDEX msg_idx(message)
)
LOCALITY REGIONAL BY ROW;

SET enforce_home_region = true;

DELETE FROM messages_rbr WHERE message = 'Hello World!'
ERROR: Query has no home region. Try adding a filter on messages_rbr.crdb_region and/or on key column (messages_rbr.account_id).
SQLSTATE: XCHR2
```

Release note (bug fix): This patch fixes an issue with the
enforce_home_region session setting which may cause SHOW CREATE TABLE or
other non-DML statements to error out if the optimizer plan for the
statement involves accessing a multiregion table.


90106: kv: reacquire proscribed leases on drain, then transfer r=shralex a=nvanbenschoten

Fixes #83372.
Fixes #90022.
Fixes #89963.
Fixes #89962.

This commit instructs stores to reacquire proscribed leases when draining in order to subsequently transfer them away. This addresses a source of flakiness in `transfer-lease` roachtests where some lease would not be transferred away before the drain completed. This could result in range unavailable for up to 9 seconds while other replicas waited out the lease'S expiration. This is because only the previous leaseholder knows that a proscribed lease is invalid. All other replicas still consider the lease to be valid.

This failure mode was always present if a lease transfer failed during a drain. However, it became more likely with 034611b. With that change, we began rejecting lease transfers that were deemed to be "unsafe" more frequently. 034611b introduced a best-effort, graceful version of this check and an airtight, ungraceful version of the check. The former performs the check before revoking the outgoing leaseholder's lease while the latter performs the check after revoking the outgoing leaseholder's lease. In rare cases, it was possible to hit the airtight, ungraceful check and cause the lease to be proscribed. See #83261 (comment) for more details on how this led to test flakiness in the `transfer-lease` roachtest suite.

Release notes: None.

Release justification: Avoids GA-blocking roachtest failures.

90107: execbuilder: fix enforce_home_region erroring of input table to LOJ r=rytaft a=msirek

Fixes #88788

This fixes erroring out of locality-optimized join when the input table's home region does not match the gateway region and session flag `enforce_home_region` is true.

Release note (bug fix): This patch fixes detection and erroring out of queries using locality-optimized join when session setting enforce_home_region is true and the input table to the join has no home region or its home region does not match the gateway region.

90165: sql,server: increase severity of upgraded-related logging r=ajwerner a=knz

Informs #90148.

This increases the severity from INFO in the following cases:

- in the case when `SET CLUSTER SETTING version` is issued from a SQL client (WARNING in case of failure).
- in the case when the server spontaneously decides to upgrade in the background (ERROR in case of failure).

Release note: None

90166: sql/rowenc: remove leftover log in test r=mgartner a=mgartner

Epic: None
Release note: None

Co-authored-by: Mark Sirek <[email protected]>
Co-authored-by: Nathan VanBenschoten <[email protected]>
Co-authored-by: Raphael 'kena' Poss <[email protected]>
Co-authored-by: Marcus Gartner <[email protected]>
blathers-crl bot pushed a commit that referenced this issue Oct 18, 2022
Fixes #83372.
Fixes #90022.
Fixes #89963.
Fixes #89962.

This commit instructs stores to reacquire proscribed leases when draining in
order to subsequently transfer them away. This addresses a source of flakiness
in `transfer-lease` roachtests where some lease would not be transferred away
before the drain completed. This could result in range unavailable for up to 9
seconds while other replicas waited out the lease'S expiration. This is because
only the previous leaseholder knows that a proscribed lease is invalid. All
other replicas still consider the lease to be valid.

This failure mode was always present if a lease transfer failed during a drain.
However, it became more likely with 034611b. With that change, we began
rejecting lease transfers that were deemed to be "unsafe" more frequently.
034611b introduced a best-effort, graceful version of this check and an
airtight, ungraceful version of the check. The former performs the check before
revoking the outgoing leaseholder's lease while the latter performs the check
after revoking the outgoing leaseholder's lease. In rare cases, it was possible
to hit the airtight, ungraceful check and cause the lease to be proscribed. See
#83261 (comment)
for more details on how this led to test flakiness in the `transfer-lease`
roachtest suite.

Release notes: None.

Release justification: Avoids GA-blocking roachtest failures.
blathers-crl bot pushed a commit that referenced this issue Oct 18, 2022
Fixes #83372.
Fixes #90022.
Fixes #89963.
Fixes #89962.

This commit instructs stores to reacquire proscribed leases when draining in
order to subsequently transfer them away. This addresses a source of flakiness
in `transfer-lease` roachtests where some lease would not be transferred away
before the drain completed. This could result in range unavailable for up to 9
seconds while other replicas waited out the lease'S expiration. This is because
only the previous leaseholder knows that a proscribed lease is invalid. All
other replicas still consider the lease to be valid.

This failure mode was always present if a lease transfer failed during a drain.
However, it became more likely with 034611b. With that change, we began
rejecting lease transfers that were deemed to be "unsafe" more frequently.
034611b introduced a best-effort, graceful version of this check and an
airtight, ungraceful version of the check. The former performs the check before
revoking the outgoing leaseholder's lease while the latter performs the check
after revoking the outgoing leaseholder's lease. In rare cases, it was possible
to hit the airtight, ungraceful check and cause the lease to be proscribed. See
#83261 (comment)
for more details on how this led to test flakiness in the `transfer-lease`
roachtest suite.

Release notes: None.

Release justification: Avoids GA-blocking roachtest failures.
@lunevalex lunevalex added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Dec 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-master Failures and bugs on the master branch. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). GA-blocker O-roachtest O-robot Originated from a bot. T-kv KV Team
Projects
None yet
Development

No branches or pull requests

5 participants