-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kvserver: TestLeasePreferencesDuringOutage is flaky #88769
Comments
|
See cockroachdb#88769. Release note: None
89004: storage: add `TestMVCCHistories` cases r=erikgrinaker a=erikgrinaker **storage: add `TestMVCCHistories` metamorphic param for peek bounds** This patch adds a metamorphic test parameter that enables peek bounds for MVCC range key-related commands. These peek bounds will always result in identical, correct MVCC stats. Unfortunately, we can't assert that the peek bounds are enforced, because importing the `spanset` package causes a cyclic dependency. Release note: None **storage: use `del_range_ts` in `TestMVCCHistories`** Many of the early `TestMVCCHistories` tests for MVCC range tombstones used the direct engine method `put_rangekey` rather than the MVCC range tombstone write `del_range_ts` (which does conflict checks, MVCC stats adjustments, etc), because the latter did not exist yet. This patch migrates most tests to `del_range_ts`, and also renames some of the test files to refer to range tombstone rather than range key. Stats assertions have also been enabled for some tests. There are no significant changes to the tests themselves. Release note: None **storage: add `TestMVCCHistories` cases** This patch adds additional test cases for `TestMVCCHistories`. These were primarily designed by deliberately introducing bugs in MVCC code that did not cause existing test cases to fail. Resolves #86655. Release note: None 89116: kvserver: skip TestLeasePreferencesDuringOutage r=mgartner a=mgartner See #88769. Release note: None Co-authored-by: Erik Grinaker <[email protected]> Co-authored-by: Marcus Gartner <[email protected]>
@irfansharif can you please take a look / triage ? |
Fails on master for different reasons than 22.2. I'm fixing 22 cause in #103636 but master failure is happening for other reasons. |
I'm going to let this issue lapse even though it's flakey on master. I expect it appear through TC-generated issues and fall into our L2 triage queues, but I'll ping @cockroachdb/kv-distribution if they want to get ahead of it anyway. |
This test is still skipped on master. Reopening and assigning to myself.
|
Previously, `TestLeasePreferenceDuringOutage` would force replication queue processing of the test range, then assert that the range up-replicated and lease transferred to a preferred locality. This test was skipped, and two of the assumptions it relied on to pass were no longer true. After cockroachdb#85219, the replicate queue no longer re-processes replicas. Instead, the queue requeues replicas after processing, at the appropriate priority. This broke the test due to the replicate queue being disabled, making the re-queue a no-op. After cockroachdb#94023, the replicate queue no longer looked for lease transfers, after processing a replication action. Combined with cockroachdb#85219, the queue would now be guaranteed to not process both up-replication and lease transfers from a single enqueue. Update the test to not require a manual process, instead using a queue range filter, which allows tests which disable automatic replication, to still process filtered ranges via the various replica queues. Also, ensure that the non-stopped stores are considered live targets, after simulating an outage (bumping manual clocks, stopping servers) -- so that the expected up-replication, then lease transfer can proceed. Fixes: cockroachdb#88769 Release note: None
Previously, `TestLeasePreferenceDuringOutage` would force replication queue processing of the test range, then assert that the range up-replicated and lease transferred to a preferred locality. This test was skipped, and two of the assumptions it relied on to pass were no longer true. After cockroachdb#85219, the replicate queue no longer re-processes replicas. Instead, the queue requeues replicas after processing, at the appropriate priority. This broke the test due to the replicate queue being disabled, making the re-queue a no-op. After cockroachdb#94023, the replicate queue no longer looked for lease transfers, after processing a replication action. Combined with cockroachdb#85219, the queue would now be guaranteed to not process both up-replication and lease transfers from a single enqueue. Update the test to not require a manual process, instead using a queue range filter, which allows tests which disable automatic replication, to still process filtered ranges via the various replica queues. Also, ensure that the non-stopped stores are considered live targets, after simulating an outage (bumping manual clocks, stopping servers) -- so that the expected up-replication, then lease transfer can proceed. Fixes: cockroachdb#88769 Release note: None
Previously, `TestLeasePreferenceDuringOutage` would force replication queue processing of the test range, then assert that the range up-replicated and lease transferred to a preferred locality. This test was skipped, and two of the assumptions it relied on to pass were no longer true. After cockroachdb#85219, the replicate queue no longer re-processes replicas. Instead, the queue requeues replicas after processing, at the appropriate priority. This broke the test due to the replicate queue being disabled, making the re-queue a no-op. After cockroachdb#94023, the replicate queue no longer looked for lease transfers, after processing a replication action. Combined with cockroachdb#85219, the queue would now be guaranteed to not process both up-replication and lease transfers from a single enqueue. Update the test to not require a manual process, instead using a queue range filter, which allows tests which disable automatic replication, to still process filtered ranges via the various replica queues. Also, ensure that the non-stopped stores are considered live targets, after simulating an outage (bumping manual clocks, stopping servers) -- so that the expected up-replication, then lease transfer can proceed. Fixes: cockroachdb#88769 Release note: None
108175: kvserver: unskip lease preferences during outage r=andrewbaptist a=kvoli Previously, `TestLeasePreferenceDuringOutage` would force replication queue processing of the test range, then assert that the range up-replicated and lease transferred to a preferred locality. This test was skipped, and two of the assumptions it relied on to pass were no longer true. After #85219, the replicate queue no longer re-processes replicas. Instead, the queue requeues replicas after processing, at the appropriate priority. This broke the test due to the replicate queue being disabled, making the re-queue a no-op. After #94023, the replicate queue no longer looked for lease transfers, after processing a replication action. Combined with #85219, the queue would now be guaranteed to not process both up-replication and lease transfers from a single enqueue. Update the test to not require a manual process, instead using a queue range filter, which allows tests which disable automatic replication, to still process filtered ranges via the various replica queues. Also, ensure that the non-stopped stores are considered live targets, after simulating an outage (bumping manual clocks, stopping servers) -- so that the expected up-replication, then lease transfer can proceed. Fixes: #88769 Release note: None 109432: cluster-ui: handle partial response errors on the database details page r=THardy98 a=THardy98 Part of: #102386 **Demos** (Note: these demos show this same logic applied to both the databases and database table pages as well): DB-Console - https://www.loom.com/share/5108dd655ad342f28323e72eaf68219c - https://www.loom.com/share/1973383dacd7494a84e10bf39e5b85a3 This change applies the same error handling ideas from #109245 to the database details page, enabling non-admin users to use the database details page and providing better transparency to data fetching issues. Errors encountered while fetching table details can be viewed via the tooltip provided by the `Caution` icon at the table's name. `unavailable` cells also provide a tooltip that displays the error impacting that exact cell. Release note (ui change): Non-admin users are able to use the database details page. 110292: c2c: use seperate spanConfigEventStreamSpec in the span config event stream r=stevendanna a=msbutler Previously, the spanConfigEventStream used a streamPartitionSpec, which contained a bunch of fields unecessary for span config streaming. This patch creates a new spanConfigEventStreamSpec which contains the fields only necessary for span config event streaming. Informs #109059 Release note: None 110309: teamcity-trigger: ensure that `race` tag is only passed once r=healthy-pod a=healthy-pod By running under `-race`, the go command defines the `race` build tag for us [1]. Previously, we defined it under `TAGS` to let the issue poster know that this is a failure under `race` and indicate that in the issue. At the time, defining the tag twice didn't cause issues but after #109773, it led to build failures [2]. To reproduce locally: ``` bazel test -s --config=race pkg/util/ctxgroup:all --test_env=GOTRACEBACK=all --define gotags=bazel,gss,race ``` As a follow-up, we should find another way to let the issue poster know that a failure was running under `race`. [1] https://go.dev/doc/articles/race_detector#Excluding_Tests [2] #109994 (comment) Epic: none Release note: None Co-authored-by: Austen McClernon <[email protected]> Co-authored-by: Thomas Hardy <[email protected]> Co-authored-by: Michael Butler <[email protected]> Co-authored-by: healthy-pod <[email protected]>
Failures generally look like the following assertion error:
Some failures in TC:
I was also able to see a failure in under 10 minutes using:
Jira issue: CRDB-20009
The text was updated successfully, but these errors were encountered: