Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: TestLeasePreferencesDuringOutage is flaky #88769

Closed
renatolabs opened this issue Sep 26, 2022 · 8 comments · Fixed by #108175
Closed

kvserver: TestLeasePreferencesDuringOutage is flaky #88769

renatolabs opened this issue Sep 26, 2022 · 8 comments · Fixed by #108175
Assignees
Labels
A-kv-distribution Relating to rebalancing and leasing. branch-master Failures and bugs on the master branch. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. skipped-test T-kv KV Team

Comments

@renatolabs
Copy link
Contributor

renatolabs commented Sep 26, 2022

Failures generally look like the following assertion error:

        	Error:      	Not equal: 
        	            	expected: 1
        	            	actual  : 5
        	Test:       	TestLeasePreferencesDuringOutage

Some failures in TC:

I was also able to see a failure in under 10 minutes using:

./dev test ./pkg/kv/kvserver/ -f TestLeasePreferencesDuringOutage --stress --timeout 10m --ignore-cache

Jira issue: CRDB-20009

@renatolabs renatolabs added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-kv Anything in KV that doesn't belong in a more specific category. T-kv KV Team labels Sep 26, 2022
@mgartner
Copy link
Collaborator

Seen here too: https://teamcity.cockroachdb.com/buildConfiguration/Cockroach_UnitTests/6679457?showRootCauses=true&expandBuildChangesSection=true&expandBuildProblemsSection=true&expandBuildTestsSection=true

------- Stdout: -------
=== RUN   TestLeasePreferencesDuringOutage
    test_log_scope.go:161: test logs captured to: /go/src/github.com/cockroachdb/cockroach/artifacts/logTestLeasePreferencesDuringOutage2950465332
    test_log_scope.go:79: use -show-logs to present logs inline
    client_lease_test.go:929:
          Error Trace:  /go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/client_lease_test.go:929
          Error:        Not equal:
                        expected: 1
                        actual  : 4
          Test:         TestLeasePreferencesDuringOutage
    panic.go:522: -- test log scope end --
test logs left over in: /go/src/github.com/cockroachdb/cockroach/artifacts/logTestLeasePreferencesDuringOutage2950465332
--- FAIL: TestLeasePreferencesDuringOutage (10.76s)

mgartner added a commit to mgartner/cockroach that referenced this issue Sep 30, 2022
craig bot pushed a commit that referenced this issue Sep 30, 2022
89004: storage: add `TestMVCCHistories` cases r=erikgrinaker a=erikgrinaker

**storage: add `TestMVCCHistories` metamorphic param for peek bounds**

This patch adds a metamorphic test parameter that enables peek bounds
for MVCC range key-related commands. These peek bounds will always
result in identical, correct MVCC stats.

Unfortunately, we can't assert that the peek bounds are enforced,
because importing the `spanset` package causes a cyclic dependency.

Release note: None
  
**storage: use `del_range_ts` in `TestMVCCHistories`**

Many of the early `TestMVCCHistories` tests for MVCC range tombstones
used the direct engine method `put_rangekey` rather than the MVCC range
tombstone write `del_range_ts` (which does conflict checks, MVCC stats
adjustments, etc), because the latter did not exist yet.

This patch migrates most tests to `del_range_ts`, and also renames some
of the test files to refer to range tombstone rather than range key.
Stats assertions have also been enabled for some tests.

There are no significant changes to the tests themselves.

Release note: None
  
**storage: add `TestMVCCHistories` cases**

This patch adds additional test cases for `TestMVCCHistories`. These
were primarily designed by deliberately introducing bugs in MVCC code
that did not cause existing test cases to fail.

Resolves #86655.

Release note: None

89116: kvserver: skip TestLeasePreferencesDuringOutage r=mgartner a=mgartner

See #88769.

Release note: None

Co-authored-by: Erik Grinaker <[email protected]>
Co-authored-by: Marcus Gartner <[email protected]>
@shralex
Copy link
Contributor

shralex commented Oct 3, 2022

@irfansharif can you please take a look / triage ?

@aliher1911
Copy link
Contributor

Fails on master for different reasons than 22.2. I'm fixing 22 cause in #103636 but master failure is happening for other reasons.
First is that it doesn't verify if process in enqueue actually succeeds, and it doesn't because it always fails with 'store throttled' error, so we always check against underreplicated range and fail.
If this operation is retried, then we end up in non-conformance situation after upreplication, but I'm not sure why. Someone from allocator might have a better idea.

@irfansharif
Copy link
Contributor

I'm going to let this issue lapse even though it's flakey on master. I expect it appear through TC-generated issues and fall into our L2 triage queues, but I'll ping @cockroachdb/kv-distribution if they want to get ahead of it anyway.

@kvoli
Copy link
Collaborator

kvoli commented Jul 19, 2023

This test is still skipped on master. Reopening and assigning to myself.

skip.WithIssue(t, 88769, "flaky test")

@kvoli kvoli reopened this Jul 19, 2023
@kvoli kvoli self-assigned this Jul 19, 2023
kvoli added a commit to kvoli/cockroach that referenced this issue Aug 11, 2023
Previously, `TestLeasePreferenceDuringOutage` would force replication
queue processing of the test range, then assert that the range
up-replicated and lease transferred to a preferred locality.

This test was skipped, and two of the assumptions it relied on to pass
were no longer true.

After cockroachdb#85219, the replicate queue no longer
re-processes replicas. Instead, the queue requeues replicas after
processing, at the appropriate priority. This broke the test due to the
replicate queue being disabled, making the re-queue a no-op.

After cockroachdb#94023, the replicate queue no longer looked for lease transfers,
after processing a replication action. Combined with cockroachdb#85219, the queue
would now be guaranteed to not process both up-replication and lease
transfers from a single enqueue.

Update the test to not require a manual process, instead using a queue
range filter, which allows tests which disable automatic replication, to
still process filtered ranges via the various replica queues. Also,
ensure that the non-stopped stores are considered live targets, after
simulating an outage (bumping manual clocks, stopping servers) -- so
that the expected up-replication, then lease transfer can proceed.

Fixes: cockroachdb#88769
Release note: None
@kvoli kvoli added the branch-master Failures and bugs on the master branch. label Aug 25, 2023
kvoli added a commit to kvoli/cockroach that referenced this issue Aug 29, 2023
Previously, `TestLeasePreferenceDuringOutage` would force replication
queue processing of the test range, then assert that the range
up-replicated and lease transferred to a preferred locality.

This test was skipped, and two of the assumptions it relied on to pass
were no longer true.

After cockroachdb#85219, the replicate queue no longer
re-processes replicas. Instead, the queue requeues replicas after
processing, at the appropriate priority. This broke the test due to the
replicate queue being disabled, making the re-queue a no-op.

After cockroachdb#94023, the replicate queue no longer looked for lease transfers,
after processing a replication action. Combined with cockroachdb#85219, the queue
would now be guaranteed to not process both up-replication and lease
transfers from a single enqueue.

Update the test to not require a manual process, instead using a queue
range filter, which allows tests which disable automatic replication, to
still process filtered ranges via the various replica queues. Also,
ensure that the non-stopped stores are considered live targets, after
simulating an outage (bumping manual clocks, stopping servers) -- so
that the expected up-replication, then lease transfer can proceed.

Fixes: cockroachdb#88769
Release note: None
kvoli added a commit to kvoli/cockroach that referenced this issue Sep 7, 2023
Previously, `TestLeasePreferenceDuringOutage` would force replication
queue processing of the test range, then assert that the range
up-replicated and lease transferred to a preferred locality.

This test was skipped, and two of the assumptions it relied on to pass
were no longer true.

After cockroachdb#85219, the replicate queue no longer
re-processes replicas. Instead, the queue requeues replicas after
processing, at the appropriate priority. This broke the test due to the
replicate queue being disabled, making the re-queue a no-op.

After cockroachdb#94023, the replicate queue no longer looked for lease transfers,
after processing a replication action. Combined with cockroachdb#85219, the queue
would now be guaranteed to not process both up-replication and lease
transfers from a single enqueue.

Update the test to not require a manual process, instead using a queue
range filter, which allows tests which disable automatic replication, to
still process filtered ranges via the various replica queues. Also,
ensure that the non-stopped stores are considered live targets, after
simulating an outage (bumping manual clocks, stopping servers) -- so
that the expected up-replication, then lease transfer can proceed.

Fixes: cockroachdb#88769
Release note: None
craig bot pushed a commit that referenced this issue Sep 11, 2023
108175: kvserver: unskip lease preferences during outage  r=andrewbaptist a=kvoli

Previously, `TestLeasePreferenceDuringOutage` would force replication
queue processing of the test range, then assert that the range
up-replicated and lease transferred to a preferred locality.

This test was skipped, and two of the assumptions it relied on to pass
were no longer true.

After #85219, the replicate queue no longer
re-processes replicas. Instead, the queue requeues replicas after
processing, at the appropriate priority. This broke the test due to the
replicate queue being disabled, making the re-queue a no-op.

After #94023, the replicate queue no longer looked for lease transfers,
after processing a replication action. Combined with #85219, the queue
would now be guaranteed to not process both up-replication and lease
transfers from a single enqueue.

Update the test to not require a manual process, instead using a queue
range filter, which allows tests which disable automatic replication, to
still process filtered ranges via the various replica queues. Also,
ensure that the non-stopped stores are considered live targets, after
simulating an outage (bumping manual clocks, stopping servers) -- so
that the expected up-replication, then lease transfer can proceed.

Fixes: #88769
Release note: None

109432: cluster-ui: handle partial response errors on the database details page r=THardy98 a=THardy98

Part of: #102386

**Demos** (Note: these demos show this same logic applied to both the databases and database table pages as well):
DB-Console
- https://www.loom.com/share/5108dd655ad342f28323e72eaf68219c
- https://www.loom.com/share/1973383dacd7494a84e10bf39e5b85a3

This change applies the same error handling ideas from #109245 to the
database details page, enabling non-admin users to use the database
details page and providing better transparency to data fetching issues.

Errors encountered while fetching table details can be viewed via the
tooltip provided by the `Caution` icon at the table's name.
`unavailable` cells also provide a tooltip that displays the error
impacting that exact cell.

Release note (ui change): Non-admin users are able to use the database
details page.

110292: c2c: use seperate spanConfigEventStreamSpec in the span config event stream r=stevendanna a=msbutler

Previously, the spanConfigEventStream used a streamPartitionSpec, which contained a bunch of fields unecessary for span config streaming. This patch creates a new spanConfigEventStreamSpec which contains the fields only necessary for span config event streaming.

Informs #109059

Release note: None

110309: teamcity-trigger: ensure that `race` tag is only passed once r=healthy-pod a=healthy-pod

By running under `-race`, the go command defines the `race` build tag for us [1].

Previously, we defined it under `TAGS` to let the issue poster know that this is a failure under `race` and indicate that in the issue.

At the time, defining the tag twice didn't cause issues but after #109773, it led to build failures [2].

To reproduce locally:
```
bazel test -s --config=race pkg/util/ctxgroup:all --test_env=GOTRACEBACK=all --define gotags=bazel,gss,race
```

As a follow-up, we should find another way to let the issue poster know that a failure was running under `race`.

[1] https://go.dev/doc/articles/race_detector#Excluding_Tests
[2] #109994 (comment)

Epic: none
Release note: None

Co-authored-by: Austen McClernon <[email protected]>
Co-authored-by: Thomas Hardy <[email protected]>
Co-authored-by: Michael Butler <[email protected]>
Co-authored-by: healthy-pod <[email protected]>
@craig craig bot closed this as completed in fe8b67b Sep 11, 2023
@github-project-automation github-project-automation bot moved this to Closed in KV Aug 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-distribution Relating to rebalancing and leasing. branch-master Failures and bugs on the master branch. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. skipped-test T-kv KV Team
Projects
No open projects
Archived in project
Development

Successfully merging a pull request may close this issue.

8 participants