Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv/kvserver: TestSnapshotsToDrainingNodes failed #133517

Closed
cockroach-teamcity opened this issue Oct 26, 2024 · 4 comments · Fixed by #133693
Closed

kv/kvserver: TestSnapshotsToDrainingNodes failed #133517

cockroach-teamcity opened this issue Oct 26, 2024 · 4 comments · Fixed by #133693
Assignees
Labels
A-testing Testing tools and infrastructure branch-release-23.1 Used to mark GA and release blockers, technical advisories, and bugs for 23.1 branch-release-23.2 Used to mark GA and release blockers, technical advisories, and bugs for 23.2 branch-release-24.1 Used to mark GA and release blockers, technical advisories, and bugs for 24.1 branch-release-24.2 Used to mark GA and release blockers, technical advisories, and bugs for 24.2 branch-release-24.3 Used to mark GA and release blockers, technical advisories, and bugs for 24.3 C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. T-kv KV Team

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Oct 26, 2024

kv/kvserver.TestSnapshotsToDrainingNodes failed on release-24.3 @ 3839fb143f98e5009a994ad58fb6b82dc7badc55:

=== RUN   TestSnapshotsToDrainingNodes
    test_log_scope.go:165: test logs captured to: outputs.zip/logTestSnapshotsToDrainingNodes544189026
    test_log_scope.go:76: use -show-logs to present logs inline
    replica_learner_test.go:1100: -- test log scope end --
test logs left over in: outputs.zip/logTestSnapshotsToDrainingNodes544189026
--- FAIL: TestSnapshotsToDrainingNodes (250.53s)
=== RUN   TestSnapshotsToDrainingNodes/raft_snapshots
    replica_learner_test.go:1040: 
        	Error Trace:	github.com/cockroachdb/cockroach/pkg/kv/kvserver_test/pkg/kv/kvserver/replica_learner_test.go:1040
        	            				github.com/cockroachdb/cockroach/pkg/kv/kvserver_test/pkg/kv/kvserver/replica_learner_test.go:1098
        	Error:      	Not equal: 
        	            	expected: map[string]kvserver_test.snapshotBytesMetrics{"":kvserver_test.snapshotBytesMetrics{sentBytes:437, rcvdBytes:0}, ".rebalancing":kvserver_test.snapshotBytesMetrics{sentBytes:0, rcvdBytes:0}, ".recovery":kvserver_test.snapshotBytesMetrics{sentBytes:437, rcvdBytes:0}}
        	            	actual  : map[string]kvserver_test.snapshotBytesMetrics{"":kvserver_test.snapshotBytesMetrics{sentBytes:431, rcvdBytes:0}, ".rebalancing":kvserver_test.snapshotBytesMetrics{sentBytes:0, rcvdBytes:0}, ".recovery":kvserver_test.snapshotBytesMetrics{sentBytes:431, rcvdBytes:0}}
        	            	
        	            	Diff:
        	            	--- Expected
        	            	+++ Actual
        	            	@@ -2,3 +2,3 @@
        	            	  (string) "": (kvserver_test.snapshotBytesMetrics) {
        	            	-  sentBytes: (int64) 437,
        	            	+  sentBytes: (int64) 431,
        	            	   rcvdBytes: (int64) 0
        	            	@@ -10,3 +10,3 @@
        	            	  (string) (len=9) ".recovery": (kvserver_test.snapshotBytesMetrics) {
        	            	-  sentBytes: (int64) 437,
        	            	+  sentBytes: (int64) 431,
        	            	   rcvdBytes: (int64) 0
        	Test:       	TestSnapshotsToDrainingNodes/raft_snapshots
    --- FAIL: TestSnapshotsToDrainingNodes/raft_snapshots (17.62s)

Parameters:

  • attempt=1
  • race=true
  • run=3
  • shard=26
Help

See also: How To Investigate a Go Test Failure (internal)

/cc @cockroachdb/kv

This test on roachdash | Improve this report!

Jira issue: CRDB-43654

@cockroach-teamcity cockroach-teamcity added branch-release-24.3 Used to mark GA and release blockers, technical advisories, and bugs for 24.3 C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-kv KV Team labels Oct 26, 2024
@miraradeva
Copy link
Contributor

Looks similar to #117064.

@arulajmani
Copy link
Collaborator

Yup, seems like we're off by 6 like we were in #117064. I'll try and stress this to see if it reproduces, but I'll remove the release blocker from this one.

@arulajmani arulajmani added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-testing Testing tools and infrastructure and removed release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Oct 29, 2024
@arulajmani arulajmani self-assigned this Oct 29, 2024
@arulajmani
Copy link
Collaborator

No failures after stressing for 1000 runs, ~2 hours. I'm going to change this test to not make these tight assertions in the hopes that it deflakes things.

arulajmani added a commit to arulajmani/cockroach that referenced this issue Oct 29, 2024
This test was making tight assertions about the size of the snapshot
that was sent. To do so, it was trying to reimplement the actual
snapshot sending logic in `kvBatchSnapshotStrategy.Send()`. So these
tight assertions weren't of much use -- they were asserting that we
were correctly re-implementing `kvBatchSnapshotStrategy.Send()` in
`getExpectedSnapshotSizeBytes`. We weren't, as evidenced by some rare
flakes.

This patch loosens assertions to deflake the test.

Closes cockroachdb#133517

<what was there before: Previously, ...>
<why it needed to change: This was inadequate because ...>
<what you did about it: To address this, this patch ...>

Release note: None
craig bot pushed a commit that referenced this issue Oct 29, 2024
…133690 #133693

133234: workload: tpcc consistency check added flag as-of. r=srosenberg,nameisbhaskar,vidit-bhat a=shailendra-patel

While running the consistency checker on the tpcc database with an active tpcc workload, the consistency check fails with a retryable error, such as restart transaction:`TransactionRetryWithProtoRefreshError: ReadWithinUncertaintyIntervalError:`
To fix this, added a new flag `as-of` which allows to run consistency check using `AS OF SYSTEM TIME`.

Epic: none
Release note: None

133347: crossclsuter/logical: add settings/stats to ldr ingest chunking r=dt a=dt



133607: sql: check object type when revoking privilege r=rafiss a=rafiss

fixes #131157
Release note (bug fix): Fix an unhandled error that could occur when using `REVOKE ... ON SEQUENCE FROM ... user` on an object that is not a sequence.

133608: schemachanger: force prod values in expensive test r=rafiss a=rafiss

fixes #133437
Release note: None

133616: roachtest: validate token return in perturbation/* tests r=kvoli a=andrewbaptist

This commit adds validation that all RAC tokens are returned on all stable nodes at the end of the test.

Fixes: #133410

Release note: None

133681: roachtest: minor fixes in rebalance/by-load test r=arulajmani a=kvoli

`%` was not escaped, causing it to be substituted with values which
were meant to go later.

e.g., from:

```
node 0 has core count normalized CPU utilization ts datapoint not in [0%!,(float64=1.4920845083839689)100{[{{%!](string=cr.node.sys.cpu.combined.percent-normalized) %!]
...
```

To

```
node idx 0 has core count normalized CPU utilization ts datapoint not in [0%,100%]
...
```

---

The `rebalance/by-load/*` roachtests compare the CPU of nodes and assert
that the distribution of node cpu is bounded +- 20%. The previous metric:

```
sys.cpu.combined.percent_normalized
```

Would occasionally over-report the CPU, as greater than 100% (>1.0),
which is impossible. Use the host CPU instead, which will look at the
machines CPU utilization, rather than any cockroach processes.

```
sys.cpu.host.combined.percent_normalized
```

Part of: #133004
Part of: #133054
Part of: #132019
Part of: #133223
Part of: #132633
Release note: None

133683: license: don't hit EnvOrDefaultInt64 in hot path r=fqazi,mgartner a=tbg

Saves 0.3%cpu on sysbench.

Fixes #133088.

Release note: None
Epic: None


133686: rac2: order testingRCRange.mu before RaftMu in tests r=sumeerbhola a=kvoli

`testingRCRange.mu` was being acquired, and held before acquiring `RaftMu` in `testingRCRange.admit()`, which conflicted with different ordering (reversed). This was a test only issue with `TestRangeController`.

Order `testingRCRange.mu` before `RaftMu` in `admit()`.

Fixes: #133650
Release note: None

133690: roachtest: always pass a Context to queries r=kvoli a=andrewbaptist

Queries can hang if there is no context passed to them. In roachtests, a context can be cancelled if there is a VM preemption. It is always better to use the test context and avoid this risk. This change updates the perturbation/* tests to always pass a context.

Fixes: #133625

Release note: None

133693: kvserver: deflake TestSnapshotsToDrainingNodes r=kvoli a=arulajmani

This test was making tight assertions about the size of the snapshot that was sent. To do so, it was trying to reimplement the actual snapshot sending logic in `kvBatchSnapshotStrategy.Send()`. So these tight assertions weren't of much use -- they were asserting that we were correctly re-implementing `kvBatchSnapshotStrategy.Send()` in `getExpectedSnapshotSizeBytes`. We weren't, as evidenced by some rare flakes.

This patch loosens assertions to deflake the test.

Closes #133517
Release note: None

Co-authored-by: Shailendra Patel <[email protected]>
Co-authored-by: David Taylor <[email protected]>
Co-authored-by: Rafi Shamim <[email protected]>
Co-authored-by: Andrew Baptist <[email protected]>
Co-authored-by: Austen McClernon <[email protected]>
Co-authored-by: Tobias Grieger <[email protected]>
Co-authored-by: Arul Ajmani <[email protected]>
craig bot pushed a commit that referenced this issue Oct 29, 2024
133347: crossclsuter/logical: add settings/stats to ldr ingest chunking r=dt a=dt



133608: schemachanger: force prod values in expensive test r=rafiss a=rafiss

fixes #133437
Release note: None

133681: roachtest: minor fixes in rebalance/by-load test r=arulajmani a=kvoli

`%` was not escaped, causing it to be substituted with values which
were meant to go later.

e.g., from:

```
node 0 has core count normalized CPU utilization ts datapoint not in [0%!,(float64=1.4920845083839689)100{[{{%!](string=cr.node.sys.cpu.combined.percent-normalized) %!]
...
```

To

```
node idx 0 has core count normalized CPU utilization ts datapoint not in [0%,100%]
...
```

---

The `rebalance/by-load/*` roachtests compare the CPU of nodes and assert
that the distribution of node cpu is bounded +- 20%. The previous metric:

```
sys.cpu.combined.percent_normalized
```

Would occasionally over-report the CPU, as greater than 100% (>1.0),
which is impossible. Use the host CPU instead, which will look at the
machines CPU utilization, rather than any cockroach processes.

```
sys.cpu.host.combined.percent_normalized
```

Part of: #133004
Part of: #133054
Part of: #132019
Part of: #133223
Part of: #132633
Release note: None

133683: license: don't hit EnvOrDefaultInt64 in hot path r=fqazi,mgartner a=tbg

Saves 0.3%cpu on sysbench.

Fixes #133088.

Release note: None
Epic: None


133693: kvserver: deflake TestSnapshotsToDrainingNodes r=kvoli a=arulajmani

This test was making tight assertions about the size of the snapshot that was sent. To do so, it was trying to reimplement the actual snapshot sending logic in `kvBatchSnapshotStrategy.Send()`. So these tight assertions weren't of much use -- they were asserting that we were correctly re-implementing `kvBatchSnapshotStrategy.Send()` in `getExpectedSnapshotSizeBytes`. We weren't, as evidenced by some rare flakes.

This patch loosens assertions to deflake the test.

Closes #133517
Release note: None

Co-authored-by: David Taylor <[email protected]>
Co-authored-by: Rafi Shamim <[email protected]>
Co-authored-by: Austen McClernon <[email protected]>
Co-authored-by: Tobias Grieger <[email protected]>
Co-authored-by: Arul Ajmani <[email protected]>
@craig craig bot closed this as completed in e38cf40 Oct 29, 2024
Copy link

blathers-crl bot commented Oct 29, 2024

Based on the specified backports for linked PR #133693, I applied the following new label(s) to this issue: branch-release-23.1, branch-release-23.2, branch-release-24.1, branch-release-24.2. Please adjust the labels as needed to match the branches actually affected by this issue, including adding any known older branches.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@blathers-crl blathers-crl bot added branch-release-23.1 Used to mark GA and release blockers, technical advisories, and bugs for 23.1 branch-release-24.1 Used to mark GA and release blockers, technical advisories, and bugs for 24.1 branch-release-23.2 Used to mark GA and release blockers, technical advisories, and bugs for 23.2 branch-release-24.2 Used to mark GA and release blockers, technical advisories, and bugs for 24.2 labels Oct 29, 2024
blathers-crl bot pushed a commit that referenced this issue Oct 29, 2024
This test was making tight assertions about the size of the snapshot
that was sent. To do so, it was trying to reimplement the actual
snapshot sending logic in `kvBatchSnapshotStrategy.Send()`. So these
tight assertions weren't of much use -- they were asserting that we
were correctly re-implementing `kvBatchSnapshotStrategy.Send()` in
`getExpectedSnapshotSizeBytes`. We weren't, as evidenced by some rare
flakes.

This patch loosens assertions to deflake the test.

Closes #133517
Release note: None
blathers-crl bot pushed a commit that referenced this issue Oct 29, 2024
This test was making tight assertions about the size of the snapshot
that was sent. To do so, it was trying to reimplement the actual
snapshot sending logic in `kvBatchSnapshotStrategy.Send()`. So these
tight assertions weren't of much use -- they were asserting that we
were correctly re-implementing `kvBatchSnapshotStrategy.Send()` in
`getExpectedSnapshotSizeBytes`. We weren't, as evidenced by some rare
flakes.

This patch loosens assertions to deflake the test.

Closes #133517
Release note: None
blathers-crl bot pushed a commit that referenced this issue Oct 29, 2024
This test was making tight assertions about the size of the snapshot
that was sent. To do so, it was trying to reimplement the actual
snapshot sending logic in `kvBatchSnapshotStrategy.Send()`. So these
tight assertions weren't of much use -- they were asserting that we
were correctly re-implementing `kvBatchSnapshotStrategy.Send()` in
`getExpectedSnapshotSizeBytes`. We weren't, as evidenced by some rare
flakes.

This patch loosens assertions to deflake the test.

Closes #133517
Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-testing Testing tools and infrastructure branch-release-23.1 Used to mark GA and release blockers, technical advisories, and bugs for 23.1 branch-release-23.2 Used to mark GA and release blockers, technical advisories, and bugs for 23.2 branch-release-24.1 Used to mark GA and release blockers, technical advisories, and bugs for 24.1 branch-release-24.2 Used to mark GA and release blockers, technical advisories, and bugs for 24.2 branch-release-24.3 Used to mark GA and release blockers, technical advisories, and bugs for 24.3 C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. T-kv KV Team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants