-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: transfer-leases/drain-other-node failed #83261
Comments
nodes/3/ranges/36.json shows that n2 had the lease and now n3 has it, but with a non-cooperative lease acquisition (2):
This means that when n2 had drained & went down, it still held the lease for r36. Just before n2 goes down (right at the end of the file, before its liveness epoch increments) we log repeatedly
This indicates that the condition at the top is false: cockroach/pkg/kv/kvserver/store.go Lines 1506 to 1525 in b86fd4f
An obvious answer would be that the lease wasn't valid. There's something going on with the liveness epochs here, why do we increment n2's to 4 after it went down? Didn't it start out at epoch one? So I suspect somehow n2 failed some heartbeats (maybe I am missing a nuance of what the test does), and r36 simply didn't get touched by anything after, so it stuck around with the invalid lease. We're not transferring it, and nobody else is trying to get it for some time, so when n2 goes down, we see what looks like a lease that would've caused a period of unavailability, but I suspect the lease wasn't valid and so at any moment in time another node could have gotten the lease, if only it had tried. This seems like a limitation of the test, then, and not one that I easily know how to fix. Assigning to the incoming column so that @mwang1026 and @nvanbenschoten can take a look during triage. |
roachtest.transfer-leases/drain failed with artifacts on master @ 7f8fef4c3f24ec2b6f9a3db1eb38756da68b3d5a:
Parameters: |
roachtest.transfer-leases/drain failed with artifacts on master @ e1e99da678e29db4ef21813423f012273c99add2:
Parameters: |
roachtest.transfer-leases/drain failed with artifacts on master @ 78caa5c3a8ecd0d0a612798a106b626edb3db2ee:
Parameters: |
roachtest.transfer-leases/drain-other-node failed with artifacts on master @ 7a8cac9b11036dcf401453eb36e531fe381b70b4:
Parameters: |
roachtest.transfer-leases/drain failed with artifacts on master @ 33d70998719051ee058bc9e516afa238ea7b7451:
Parameters: |
I'm bisecting this now. I have a sneaking suspicion it is related to #82758. |
That was it. Specifically, the bisect points at 034611b. I'll try to understand what we're seeing further. |
roachtest.transfer-leases/drain failed with artifacts on master @ 03d5260a467b8640de41ee08054c1ce05f91966c:
Parameters: |
I see what's going on here. The lease transfer fails because the draining node is not the Raft leader for a range which it holds the lease:
034611b introduced a best-effort, graceful version of this check and an airtight, ungraceful version of the check. The former performs the check before revoking the outgoing leaseholder's lease while the latter performs the check after revoking the outgoing leaseholder's lease. In this case, we're hitting the airtight, ungraceful check. So we're failing the lease transfer after revoking the old lease, moving the lease into a PROSCRIBED state. We don't immediately retry and we then hit the code @tbg linked to above: cockroach/pkg/kv/kvserver/store.go Lines 1506 to 1525 in b86fd4f
We see that the lease is still PROSCRIBED, meaning `IsValid is false, so we don't ever complete the lease transfer before finishing the drain. |
This comment was marked as duplicate.
This comment was marked as duplicate.
Live footage of the discovery above: https://twitter.com/dmofengineering/status/1545928022791843844. |
We have marked this test failure issue as stale because it has been |
Looks like I got away with it by not thinking too hard about a testing knob + using manual clocks. cockroach/pkg/kv/kvserver/client_lease_test.go Line 1321 in 2675c7c
cockroach/pkg/kv/kvserver/client_lease_test.go Lines 1349 to 1350 in 2675c7c
|
In cockroachdb#85629 we changed our lease transfer protocol to only ever transfer expiration-based leases, and have recipients later upgrade them to the more efficient epoch based ones. This was done to limit the effects of ill-advised lease transfers since the incoming leaseholder would need to recognize itself as such within a few seconds -- so we wanted this upgrade happen after having received the lease. In cockroachdb#83261 however we noticed that the upgrade was not immediate -- we were waiting until the current lease's expiration was within its renewal duration -- 4.5s. When the lease was eventually renewed the upgrade did happen, but it was not immediate. We fix this here and remove the manual clock advancing the supporting test had that masked this issue. It now demonstrates that we're no longer relying on upgrades happen as part of the (slow) renewal process. Release note: None
roachtest.transfer-leases/drain-other-node failed with artifacts on master @ 5862e9b3c1e24b3f644716789c96108a8e92fd71:
Parameters: Same failure on other branches
|
Related to cockroachdb#83261. This commit removes "protection" that avoided lease acquisitions on draining nodes. This protection had already been effectively disabled by acc1ad1, which allowed Raft leaders to bypass the check. As the comment here (added in 5ffaa9e) explained, Raft followers are already unable to acquire the lease. If leaders bypass the check and follower (and candidates) don't need it, the check is useless, so we remove it. The commit also removes `TestReplicaDrainLease`, which was supposed to test this behavior. We remove the test not because it started failing after the change, but because it did not. It must not have been testing anything real anymore after acc1ad1. Release justification: low risk change related to release blocker. Release note: None
Related to cockroachdb#83261. This commit removes "protection" that avoided lease acquisitions on draining nodes. This protection had already been effectively disabled by acc1ad1, which allowed Raft leaders to bypass the check. As the comment here (added in 5ffaa9e) explained, Raft followers are already unable to acquire the lease. If leaders bypass the check and follower (and candidates) don't need it, the check is useless, so we remove it. The commit also removes `TestReplicaDrainLease`, which was supposed to test this behavior. We remove the test not because it started failing after the change, but because it did not. It must not have been testing anything real anymore after acc1ad1. Release justification: low risk change related to release blocker. Release note: None
In cockroachdb#85629 we changed our lease transfer protocol to only ever transfer expiration-based leases, and have recipients later upgrade them to the more efficient epoch based ones. This was done to limit the effects of ill-advised lease transfers since the incoming leaseholder would need to recognize itself as such within a few seconds -- so we wanted this upgrade happen after having received the lease. In cockroachdb#83261 however we noticed that the upgrade was not immediate -- we were waiting until the current lease's expiration was within its renewal duration -- 4.5s. When the lease was eventually renewed the upgrade did happen, but it was not immediate. We fix this here and remove the manual clock advancing the supporting test had that masked this issue. It now demonstrates that we're no longer relying on upgrades happen as part of the (slow) renewal process. Release note: None
85859: pgwire: Add support for cursors with special characters r=knz,rafiss a=rimadeodhar IIn CockroachDB and Postgres, it is possible to declare cursors with special characters enclosed within double quotes, for e.g. "1-2-3". Currently, we store the name as an unescaped string which causes problems during the pgwire DESCRIBE step for looking up the cursor. We should be storing using the tree.Name datatype for the cursor name while storing and looking up cursors. This PR updates the code to start using tree.Name instead of raw strings for handling cursor names. This fixes the issue where the pgwire DESCRIBE step fails while attempting to look up cursors with names containing special characters. Resolves #84261 Release note (bug fix): The pgwire DESCRIBE step no longer fails with an error while attempting to look up cursors declared with names containing special characters. 86492: bulk: perform meta lookup on range cache miss during index backfill r=dt a=nvanbenschoten Fixes #84290. This commit addresses the sustained slowdown described in #84290 by replacing the call in `SSTBatcher.flushIfNeeded` to `RangeCache.GetCached` with a call to `RangeCache.Lookup`. The former method checks the cache but returns no range descriptor if the cache currently has no overlapping key. This is possible if the descriptor was recently evicted because it was stale. The later method performs a meta lookup if the cache currently has no overlapping key, so it should always return _some_ range descriptor. There's a question of whether we should be logging a warning but proceeding if this meta lookup fails. For now, I've decided not to change that behavior. Release justification: None. Don't merge yet. 87885: kv: remove broken attempt to reject lease acquisitions on draining nodes r=nvanbenschoten a=nvanbenschoten Related to #83261. This commit removes "protection" that avoided lease acquisitions on draining nodes. This protection had already been effectively disabled by acc1ad1, which allowed Raft leaders to bypass the check. As the comment here (added in 5ffaa9e) explained, Raft followers are already unable to acquire the lease. If leaders bypass the check and follower (and candidates) don't need it, the check is useless, so we remove it. The commit also removes `TestReplicaDrainLease`, which was supposed to test this behavior. We remove the test not because it started failing after the change, but because it did not. It must not have been testing anything real anymore after acc1ad1. Release justification: low risk change related to release blocker. Release note: None 88205: kvserver: (partially) deflake transfer-leases/drain-other-node r=irfansharif a=irfansharif In #85629 we changed our lease transfer protocol to only ever transfer expiration-based leases, and have recipients later upgrade them to the more efficient epoch based ones. This was done to limit the effects of ill-advised lease transfers since the incoming leaseholder would need to recognize itself as such within a few seconds -- so we wanted this upgrade happen after having received the lease. In #83261 however we noticed that the upgrade was not immediate -- we were waiting until the current lease's expiration was within its renewal duration -- 4.5s. When the lease was eventually renewed the upgrade did happen, but it was not immediate. We fix this here and remove the manual clock advancing the supporting test had that masked this issue. It now demonstrates that we're no longer relying on upgrades happen as part of the (slow) renewal process. Release note: None Co-authored-by: rimadeodhar <[email protected]> Co-authored-by: Nathan VanBenschoten <[email protected]> Co-authored-by: irfan sharif <[email protected]>
In #85629 we changed our lease transfer protocol to only ever transfer expiration-based leases, and have recipients later upgrade them to the more efficient epoch based ones. This was done to limit the effects of ill-advised lease transfers since the incoming leaseholder would need to recognize itself as such within a few seconds -- so we wanted this upgrade happen after having received the lease. In #83261 however we noticed that the upgrade was not immediate -- we were waiting until the current lease's expiration was within its renewal duration -- 4.5s. When the lease was eventually renewed the upgrade did happen, but it was not immediate. We fix this here and remove the manual clock advancing the supporting test had that masked this issue. It now demonstrates that we're no longer relying on upgrades happen as part of the (slow) renewal process. Release note: None
Related to #83261. This commit removes "protection" that avoided lease acquisitions on draining nodes. This protection had already been effectively disabled by acc1ad1, which allowed Raft leaders to bypass the check. As the comment here (added in 5ffaa9e) explained, Raft followers are already unable to acquire the lease. If leaders bypass the check and follower (and candidates) don't need it, the check is useless, so we remove it. The commit also removes `TestReplicaDrainLease`, which was supposed to test this behavior. We remove the test not because it started failing after the change, but because it did not. It must not have been testing anything real anymore after acc1ad1. Release justification: low risk change related to release blocker. Release note: None
Two fixes were merged above. Nathan mentioned that this test still has a baseline 10% failure rate, but let's have that get filed separately. |
Fixes cockroachdb#83372. Fixes cockroachdb#90022. Fixes cockroachdb#89963. Fixes cockroachdb#89962. This commit instructs stores to reacquire proscribed leases when draining in order to subsequently transfer them away. This addresses a source of flakiness in `transfer-lease` roachtests where some lease would not be transferred away before the drain completed. This could result in range unavailable for up to 9 seconds while other replicas waited out the lease'S expiration. This is because only the previous leaseholder knows that a proscribed lease is invalid. All other replicas still consider the lease to be valid. This failure mode was always present if a lease transfer failed during a drain. However, it became more likely with 034611b. With that change, we began rejecting lease transfers that were deemed to be "unsafe" more frequently. 034611b introduced a best-effort, graceful version of this check and an airtight, ungraceful version of the check. The former performs the check before revoking the outgoing leaseholder's lease while the latter performs the check after revoking the outgoing leaseholder's lease. In rare cases, it was possible to hit the airtight, ungraceful check and cause the lease to be proscribed. See cockroachdb#83261 (comment) for more details on how this led to test flakiness in the `transfer-lease` roachtest suite. Release notes: None. Release justification: Avoids GA-blocking roachtest failures.
Fixes cockroachdb#83372. Fixes cockroachdb#90022. Fixes cockroachdb#89963. Fixes cockroachdb#89962. This commit instructs stores to reacquire proscribed leases when draining in order to subsequently transfer them away. This addresses a source of flakiness in `transfer-lease` roachtests where some lease would not be transferred away before the drain completed. This could result in range unavailable for up to 9 seconds while other replicas waited out the lease'S expiration. This is because only the previous leaseholder knows that a proscribed lease is invalid. All other replicas still consider the lease to be valid. This failure mode was always present if a lease transfer failed during a drain. However, it became more likely with 034611b. With that change, we began rejecting lease transfers that were deemed to be "unsafe" more frequently. 034611b introduced a best-effort, graceful version of this check and an airtight, ungraceful version of the check. The former performs the check before revoking the outgoing leaseholder's lease while the latter performs the check after revoking the outgoing leaseholder's lease. In rare cases, it was possible to hit the airtight, ungraceful check and cause the lease to be proscribed. See cockroachdb#83261 (comment) for more details on how this led to test flakiness in the `transfer-lease` roachtest suite. Release notes: None. Release justification: Avoids GA-blocking roachtest failures.
90007: execbuilder: enforce_home_region should only apply to DML r=rytaft a=msirek Fixes #89875 Fixes #88789 This fixes a problem where the enforce_home_region session flag might cause non-DML statements to error out, such as SHOW CREATE, if those statements utilize scans or joins of multiregion tables. This also fixes issues with proper erroring out of mutation DML like UPDATE AND DELETE. For example, the following previously did not error: ``` CREATE TABLE messages_rbr ( account_id INT NOT NULL, message_id UUID DEFAULT gen_random_uuid(), message STRING NOT NULL, PRIMARY KEY (account_id), INDEX msg_idx(message) ) LOCALITY REGIONAL BY ROW; SET enforce_home_region = true; DELETE FROM messages_rbr WHERE message = 'Hello World!' ERROR: Query has no home region. Try adding a filter on messages_rbr.crdb_region and/or on key column (messages_rbr.account_id). SQLSTATE: XCHR2 ``` Release note (bug fix): This patch fixes an issue with the enforce_home_region session setting which may cause SHOW CREATE TABLE or other non-DML statements to error out if the optimizer plan for the statement involves accessing a multiregion table. 90106: kv: reacquire proscribed leases on drain, then transfer r=shralex a=nvanbenschoten Fixes #83372. Fixes #90022. Fixes #89963. Fixes #89962. This commit instructs stores to reacquire proscribed leases when draining in order to subsequently transfer them away. This addresses a source of flakiness in `transfer-lease` roachtests where some lease would not be transferred away before the drain completed. This could result in range unavailable for up to 9 seconds while other replicas waited out the lease'S expiration. This is because only the previous leaseholder knows that a proscribed lease is invalid. All other replicas still consider the lease to be valid. This failure mode was always present if a lease transfer failed during a drain. However, it became more likely with 034611b. With that change, we began rejecting lease transfers that were deemed to be "unsafe" more frequently. 034611b introduced a best-effort, graceful version of this check and an airtight, ungraceful version of the check. The former performs the check before revoking the outgoing leaseholder's lease while the latter performs the check after revoking the outgoing leaseholder's lease. In rare cases, it was possible to hit the airtight, ungraceful check and cause the lease to be proscribed. See #83261 (comment) for more details on how this led to test flakiness in the `transfer-lease` roachtest suite. Release notes: None. Release justification: Avoids GA-blocking roachtest failures. 90107: execbuilder: fix enforce_home_region erroring of input table to LOJ r=rytaft a=msirek Fixes #88788 This fixes erroring out of locality-optimized join when the input table's home region does not match the gateway region and session flag `enforce_home_region` is true. Release note (bug fix): This patch fixes detection and erroring out of queries using locality-optimized join when session setting enforce_home_region is true and the input table to the join has no home region or its home region does not match the gateway region. 90165: sql,server: increase severity of upgraded-related logging r=ajwerner a=knz Informs #90148. This increases the severity from INFO in the following cases: - in the case when `SET CLUSTER SETTING version` is issued from a SQL client (WARNING in case of failure). - in the case when the server spontaneously decides to upgrade in the background (ERROR in case of failure). Release note: None 90166: sql/rowenc: remove leftover log in test r=mgartner a=mgartner Epic: None Release note: None Co-authored-by: Mark Sirek <[email protected]> Co-authored-by: Nathan VanBenschoten <[email protected]> Co-authored-by: Raphael 'kena' Poss <[email protected]> Co-authored-by: Marcus Gartner <[email protected]>
Fixes #83372. Fixes #90022. Fixes #89963. Fixes #89962. This commit instructs stores to reacquire proscribed leases when draining in order to subsequently transfer them away. This addresses a source of flakiness in `transfer-lease` roachtests where some lease would not be transferred away before the drain completed. This could result in range unavailable for up to 9 seconds while other replicas waited out the lease'S expiration. This is because only the previous leaseholder knows that a proscribed lease is invalid. All other replicas still consider the lease to be valid. This failure mode was always present if a lease transfer failed during a drain. However, it became more likely with 034611b. With that change, we began rejecting lease transfers that were deemed to be "unsafe" more frequently. 034611b introduced a best-effort, graceful version of this check and an airtight, ungraceful version of the check. The former performs the check before revoking the outgoing leaseholder's lease while the latter performs the check after revoking the outgoing leaseholder's lease. In rare cases, it was possible to hit the airtight, ungraceful check and cause the lease to be proscribed. See #83261 (comment) for more details on how this led to test flakiness in the `transfer-lease` roachtest suite. Release notes: None. Release justification: Avoids GA-blocking roachtest failures.
Fixes #83372. Fixes #90022. Fixes #89963. Fixes #89962. This commit instructs stores to reacquire proscribed leases when draining in order to subsequently transfer them away. This addresses a source of flakiness in `transfer-lease` roachtests where some lease would not be transferred away before the drain completed. This could result in range unavailable for up to 9 seconds while other replicas waited out the lease'S expiration. This is because only the previous leaseholder knows that a proscribed lease is invalid. All other replicas still consider the lease to be valid. This failure mode was always present if a lease transfer failed during a drain. However, it became more likely with 034611b. With that change, we began rejecting lease transfers that were deemed to be "unsafe" more frequently. 034611b introduced a best-effort, graceful version of this check and an airtight, ungraceful version of the check. The former performs the check before revoking the outgoing leaseholder's lease while the latter performs the check after revoking the outgoing leaseholder's lease. In rare cases, it was possible to hit the airtight, ungraceful check and cause the lease to be proscribed. See #83261 (comment) for more details on how this led to test flakiness in the `transfer-lease` roachtest suite. Release notes: None. Release justification: Avoids GA-blocking roachtest failures.
roachtest.transfer-leases/drain-other-node failed with artifacts on master @ 7a1ced4a49d28f66f381e78733bcab510e2cbafe:
Parameters:
ROACHTEST_cloud=gce
,ROACHTEST_cpu=4
,ROACHTEST_ssd=0
Help
See: roachtest README
See: How To Investigate (internal)
This test on roachdash | Improve this report!
Jira issue: CRDB-16958
Epic CRDB-19172
The text was updated successfully, but these errors were encountered: