Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: flake with TestStoreCapacityAfterSplit #92677

Closed
yuzefovich opened this issue Nov 29, 2022 · 4 comments · Fixed by #98515
Closed

kvserver: flake with TestStoreCapacityAfterSplit #92677

yuzefovich opened this issue Nov 29, 2022 · 4 comments · Fixed by #98515
Assignees
Labels
branch-release-23.1 Used to mark GA and release blockers, technical advisories, and bugs for 23.1 C-test-failure Broken test (automatically or manually discovered). GA-blocker skipped-test T-kv KV Team

Comments

@yuzefovich
Copy link
Member

yuzefovich commented Nov 29, 2022

Probably on unrelated PR:

Failed
=== RUN   TestStoreCapacityAfterSplit
    test_log_scope.go:161: test logs captured to: /artifacts/tmp/_tmp/74bb89d499fdd22d37c620d66231ea85/logTestStoreCapacityAfterSplit3212094442
    test_log_scope.go:79: use -show-logs to present logs inline
    client_split_test.go:2858: expected cap.WritesPerSecond >= 0.200000, got 0.000000
    client_split_test.go:2870: expected WritesPerReplica to have increased from p10=0.00 p25=0.00 p50=0.00 p75=0.00 p90=0.00 pMax=0.00, but got p10=0.00 p25=0.00 p50=0.00 p75=0.00 p90=0.00 pMax=0.00
    client_split_test.go:2908: -- test log scope end --
test logs left over in: /artifacts/tmp/_tmp/74bb89d499fdd22d37c620d66231ea85/logTestStoreCapacityAfterSplit3212094442
--- FAIL: TestStoreCapacityAfterSplit (1.57s)

Jira issue: CRDB-21917

@yuzefovich yuzefovich added the C-test-failure Broken test (automatically or manually discovered). label Nov 29, 2022
@blathers-crl blathers-crl bot added the T-kv KV Team label Nov 29, 2022
@github-actions
Copy link

github-actions bot commented Jan 2, 2023

We have marked this test failure issue as stale because it has been
inactive for 1 month. If this failure is still relevant, removing the
stale label or adding a comment will keep it active. Otherwise,
we'll close it in 5 days to keep the test failure queue tidy.

@DrewKimball
Copy link
Collaborator

Saw this in #94671

@msbutler
Copy link
Collaborator

msbutler commented Jan 9, 2023

and here too. Opening a PR to skip.

msbutler added a commit to msbutler/cockroach that referenced this issue Jan 9, 2023
TestStoreCapacityAfterSplit, TestMergeQueueDoesNotInterruptReplicationChange,
testserver_test_22.1_22.2, testserve_upgrade_node.

Informs: cockroachdb#94871, cockroachdb#94951, cockroachdb#92677 cockroachdb#94956

Release note: None
craig bot pushed a commit that referenced this issue Jan 9, 2023
94954: kv/server: skip a bunch of flakey tests r=adityamaru a=msbutler

TestStoreCapacityAfterSplit, TestMergeQueueDoesNotInterruptReplicationChange,
testserver_test_22.1_22.2, testserver_upgrade_node.

Informs: #94871, #94951, #92677, #94956

Release note: None

Co-authored-by: Michael Butler <[email protected]>
@blathers-crl
Copy link

blathers-crl bot commented Mar 1, 2023

Hi @nvanbenschoten, please add branch-* labels to identify which branch(es) this release-blocker affects.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@nvanbenschoten nvanbenschoten added the branch-release-23.1 Used to mark GA and release blockers, technical advisories, and bugs for 23.1 label Mar 2, 2023
craig bot pushed a commit that referenced this issue Mar 14, 2023
95789: pkg/util/log: don't falsify tenant ID tag in logs if none in ctx r=andreimatei a=abarganier

Previously, I made the decision to always tag a log entry with a tenant ID, even if no tenant ID was found in the context associated with the log entry. In this case, the system tenant ID was used in the tag, instead of omitting a tenant ID tag altogether.

I received some feedback that this is confusing. For example, imagine testing a feature, expecting log entries to come from a secondary tenant, and the context being used in that feature is not annotated with a tenant ID. With the previous behavior, the log entry would default to being tagged with the system tenant ID instead of having empty tags (or at least, no tenant ID tag). In this scenario, how do I tell the actual state of the log entry? Did the log entry indeed come from a goroutine belonging to the system tenant? Or was the context just missing the tenant ID annotation, but otherwise came from the correct tenant?

This ambiguity is not helpful. By falsifying a tenant ID tag we confuse the log reader about the actual state of the system. Furthermore, our eventual goal should be that almost no context objects in the system exist without a tenant ID (except for perhaps at startup before tenant initialization). Tagging with the system tenant ID in the case of a missing tenant ID annotation in the context makes it difficult to track down offending context objects.

This patch removes this default behavior from the logging package. Now, if no tenant ID is found in the context, we do not tag the entry with a tenant ID. Note however that on the *decode* side, we will maintain this default tenant ID tagging behavior. If a log entry does not have a tenant ID tag, then we must assume that only the system tenant has privilege to view said log entry, since the owner is ambiguous.

Release note: none

Epic CRDB-14486

98175: cdc: show all changefeed jobs in `SHOW CHANGEFEED JOBS` r=HonoreDB a=jayshrivastava

### cdc: show all changefeed jobs in SHOW CHANGEFEED JOBS

Release note (general change): Previously, the output of `SHOW CHANGEFEED JOBS` was limited to show unfinished jobs and finished jobs from the last 14 days. This change makes the command show all changefeed jobs, regardless of if they finished and when they finished. Note that jobs still obey the cluster setting `jobs.retention_time`. Completed jobs older than that time are deleted.

Fixes: #97883

### jobs: add virtual index for job_type in crdb_internal.jobs

This change adds a virtual index on the `job_type` column
of `crdb_internal.jobs`. This change should make queries
on that table which filter on job type (such as `SHOW
CHANGEFEED JOBS`) more efficient.

Release note: None

Epic: None

98515: kvserver: deflake test store capacity after split r=andrewbaptist a=kvoli

This commit defales `TestStoreCapacityAfterSplit`. Previously it was possible for the replica load stats which underpins Capacity to be reset. The reset caused the recording duration to fall short of min stats duration, which led to a 0 value being reported for writes in store capacity.

This commit bumps the manual clock twice and removes redundant leaseholder checks within a retry loop. The combination of these two changes makes the test much less likely to flake.

The test is now unskipped.

```
dev test pkg/kv/kvserver -f TestStoreCapacityAfterSplit -v --stress
...
4410 runs so far, 0 failures, over 6m10s
```

Resolves: #92677

Release note: None

98521: ui: don't continue polling endpoints that return 403 errors r=dhartunian a=abarganier

It was brought to our attention that endpoints such as `v1/settings` would continue to be polled by DB Console even if they returned 403 errors.

If an endpoint returns 403 errors, we should not continue to poll it since the required access is not present for the current user.

This patch updates the polling mechanism to short-circuit the `refresh` process if a 403 error is encountered throughout the lifecycle of the poller.

Release note: none

Fixes: #98356

98536: kvserver: deflake learner joint cfg relocate range r=andrewbaptist a=kvoli

Previously, in `TestLearnerOrJointConfigAdminRelocateRange` it was possible for there to be an in-flight snapshot towards a learner, prior to sending `AdminRelocateRange` command. When this occurred, the test would fail as `AdminRelocateRange` returns an error when finding any in-flight snapshots to learners. This situation occurred infrequently, causing the test to flake.

This commit updates the `TestLearnerOrJointConfigAdminRelocateRange` test to first assert that there are the expected number of learners, then assert that there are no in-flight snapshots towards learners before beginning the main testing logic. The test is now unskipped.

```
dev test pkg/kv/kvserver \
      -f TestLearnerOrJointConfigAdminRelocateRange \
      -v --stress
...
5652 runs so far, 0 failures, over 12m30s
```

Resolves: #95500

Release note: None

98542: storage: remove MVCCIterator.Key method r=jbowens a=jbowens

The MVCCIterator interface previously exposed two methods for accessing the current iterator postion as a MVCC key—UnsafeKey and Key. Key() was equivalent to UnsafeKey().Clone().

This commit removes the Key() variant, pushing the onus of key copying onto the caller. This reduces the interface surface area, avoids accidental key copying (some of which is addressed within this commit), and does not impose any unreasonable burden on callers.

Epic: None
Informs #82589.
Release note: None

98543: allocator: fix lease io enforcement setting typo r=andrewbaptist a=kvoli

This commit updates the "do nothing" lease IO overload enforcement (`kv.allocator.lease_io_overload_threshold_enforcement`) setting to be correctly spelled "ignore" rather than "ingore".

Part of: #96508

Release note (ops change): The
`kv.allocator.lease_io_overload_threshold_enforcement` setting value which disables enforcement is updated to be spelled correctly as "ignore" rather than "ingore".

98600: server: change conn close error to warning r=knz,abarganier a=dhartunian

Resolves: #98523
Epic: None
Release note: None

Co-authored-by: Alex Barganier <[email protected]>
Co-authored-by: Jayant Shrivastava <[email protected]>
Co-authored-by: Austen McClernon <[email protected]>
Co-authored-by: Jackson Owens <[email protected]>
Co-authored-by: David Hartunian <[email protected]>
@craig craig bot closed this as completed in 0d35a08 Mar 14, 2023
kvoli added a commit to kvoli/cockroach that referenced this issue Oct 13, 2023
This commit deflakes `TestStoreCapacityAfterSplit`. Previously it was
possible for the replica load stats which underpins Capacity to be
reset. The reset caused the recording duration to fall short of min
stats duration, which led to a 0 value being reported for writes in
store capacity.

This commit bumps the manual clock twice and removes redundant
leaseholder checks within a retry loop. The combination of these two
changes makes the test much less likely to flake.

The test is now unskipped.

```
dev test pkg/kv/kvserver -f TestStoreCapacityAfterSplit -v --stress
...
4410 runs so far, 0 failures, over 6m10s
```

Resolves: cockroachdb#92677

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-release-23.1 Used to mark GA and release blockers, technical advisories, and bugs for 23.1 C-test-failure Broken test (automatically or manually discovered). GA-blocker skipped-test T-kv KV Team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants