Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: eager lease preference enforcement occasionally fails when acquiring node fails liveness #108512

Open
kvoli opened this issue Aug 10, 2023 · 0 comments
Labels
A-kv-distribution Relating to rebalancing and leasing. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-kv KV Team

Comments

@kvoli
Copy link
Collaborator

kvoli commented Aug 10, 2023

Describe the problem

#107507 added lease transfers when the acquiring node violated preferences. The mechanism can fail however, if the acquiring node then fails a heartbeat, like we saw in #108425.

To Reproduce

See the lease-preferences/full-first-preference-down roachtest: #108425.

Expected behavior

When (if) a node successfully heartbeats, the leases would transfer then. Or be acquired by another node.

Environment:

  • CockroachDB version: 23.1

Jira issue: CRDB-30503

@kvoli kvoli added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-kv-distribution Relating to rebalancing and leasing. labels Aug 10, 2023
@blathers-crl blathers-crl bot added the T-kv KV Team label Aug 10, 2023
kvoli added a commit to kvoli/cockroach that referenced this issue Aug 10, 2023
The lease preferences roachtest could occasionally fail, if the liveness
leaseholder were on a stopped node. We should address this issue,
for now, pin the liveness lease to a live node to prevent flakes.

Informs: cockroachdb#108512
Resolves: cockroachdb#108425
Release note: None
craig bot pushed a commit that referenced this issue Aug 10, 2023
107302: storage: add method to ingest external files, rename IngestExternalFiles r=RaduBerinde a=itsbilal

Requires cockroachdb/pebble#2753

This change renames the existing IngestExternalFiles method on storage.Engine to IngestLocalFiles, and adds a new IngestExternalFiles that ingests pebble.ExternalFile, for use with online restore.

Depends on cockroachdb/pebble#2753.

Epic: none

Release note: None

108402: serverutils: remove ad-hoc code from StartNewTestCluster r=yuzefovich a=knz

This function is a convenience alias for NewTestCluster+Start. This should not contain custom logic specific to certain tests. Any custom logic should be conditional on testing knobs and put inside `(*testcluster.TestCluster).Start()` instead.

(The code removed here was mistakenly added in the wrong place in 70f85cd).

Release note: None
Needed for #107986.
Epic: CRDB-18499

108446: kv: skip TestConstraintConformanceReportIntegration under deadlock r=erikgrinaker a=nvanbenschoten

Fixes #108430.

This commit avoids flakiness in `TestConstraintConformanceReportIntegration` by skipping the test under deadlock builds. It has been observed to run slowly and flake under stress, and we see the same kinds of behavior under deadlock builds.

Release notes: None

108451: schemachanger: Refactor tests for concurrent schema changer behaviors r=Xiang-Gu a=Xiang-Gu

1. It cleans up some redundant tests about concurrent schema changer behavior and refactor in a new simpler, cleaner test
2. It adds an integration style test for testing concurrent schema change behaviors where we run many schema changes for an extended period of time and assert that all of they eventually succeed and the descriptors end up in the expected state.

Fix #108140
Fix #107223

Epic: None
Release note: None

108492: kv: remove errSavepointInvalidAfterTxnRestart r=knz a=nvanbenschoten

This commit simplifies logic in `checkSavepointLocked`.

Epic: None
Release note: None

108497: sql: don't start default test tenant in MT admin function tests r=yuzefovich a=yuzefovich

These tests themselves start multiple tenants, so there is no need to create a default test tenant (doing that also makes it a bit more confusing because the default tenant as well as the first test tenant share the same TenantID effectively making it two SQL pod config, which is confusing). Starting the default test tenant was enabled recently in c899661 when we enabled the CCL license, and we have seen at least one confusing failure that is possibly related to this.

Starting the default test tenant was originally added in cfa4375, but I don't see a good reason for it.

This PR is opportunistic fix of #108081.

Fixes: #108081.

Release note: None

108502: kvstreamer: add more assertions to RequestsProvider.enqueue r=yuzefovich a=michae2

If we ever enqueue zero-length requests, it could cause a deadlock where the `workerCoordinator` is waiting for more requests and the enqueuer is waiting for results. Add assertions that we never do this.

Informs: #101823
Release note: None

108517: roachtest: pin liveness lease to live node in lease prefs test r=erikgrinaker a=kvoli

The lease preferences roachtest could occasionally fail, if the liveness leaseholder were on a stopped node. We should address this issue, for now, pin the liveness lease to a live node to prevent flakes.

Informs: #108512
Resolves: #108425
Release note: None

Co-authored-by: Bilal Akhtar <[email protected]>
Co-authored-by: Raphael 'kena' Poss <[email protected]>
Co-authored-by: Nathan VanBenschoten <[email protected]>
Co-authored-by: Xiang Gu <[email protected]>
Co-authored-by: Yahor Yuzefovich <[email protected]>
Co-authored-by: Michael Erickson <[email protected]>
Co-authored-by: Austen McClernon <[email protected]>
blathers-crl bot pushed a commit that referenced this issue Aug 10, 2023
The lease preferences roachtest could occasionally fail, if the liveness
leaseholder were on a stopped node. We should address this issue,
for now, pin the liveness lease to a live node to prevent flakes.

Informs: #108512
Resolves: #108425
Release note: None
blathers-crl bot pushed a commit that referenced this issue Aug 21, 2023
The lease preferences roachtest could occasionally fail, if the liveness
leaseholder were on a stopped node. We should address this issue,
for now, pin the liveness lease to a live node to prevent flakes.

Informs: #108512
Resolves: #108425
Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-distribution Relating to rebalancing and leasing. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-kv KV Team
Projects
None yet
Development

No branches or pull requests

1 participant