Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv/kvserver: TestFlowControlRaftMembership failed #125736

Closed
cockroach-teamcity opened this issue Jun 15, 2024 · 3 comments · Fixed by #126109
Closed

kv/kvserver: TestFlowControlRaftMembership failed #125736

cockroach-teamcity opened this issue Jun 15, 2024 · 3 comments · Fixed by #126109
Assignees
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. T-storage Storage Team

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Jun 15, 2024

kv/kvserver.TestFlowControlRaftMembership failed on master @ 3aa5bdf40527e2f60b179094403b4302a1c2cbe1:

Fatal error:

panic: Counters should not decrease [recovered]
	panic: Counters should not decrease

Stack:

goroutine 28869 [running]:
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).recover(0xc00bb3c290?, {0xe5f0db8, 0xc00b73fce0})
	github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:231 +0x74
panic({0x9a03bc0?, 0xe584900?})
	GOROOT/src/runtime/panic.go:770 +0x132
github.com/cockroachdb/cockroach/pkg/util/metric.(*Counter).Update(0xc008684120, 0x29)
	github.com/cockroachdb/cockroach/pkg/util/metric/pkg/util/metric/metric.go:755 +0x6d
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*StoreMetrics).updateEngineMetrics(_, {0xc0043a9b08, {0x0, 0x0, 0x0, 0x2e, 0x0, 0x2e, 0x0}, {0x7, ...}, ...})
	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/metrics.go:3789 +0x19cd
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).computeMetrics(_, {_, _})
	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/store.go:3478 +0x1d8
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).ComputeMetricsPeriodically(_, {_, _}, _, _)
	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/store.go:3503 +0xb0
github.com/cockroachdb/cockroach/pkg/server.(*Node).computeMetricsPeriodically.func1(0xc0034c5c08)
	github.com/cockroachdb/cockroach/pkg/server/node.go:1036 +0xe6
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Stores).VisitStores.func1(0x9c06660?, 0xc0034c5c08)
	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/stores.go:150 +0x58
github.com/cockroachdb/cockroach/pkg/util/syncutil.(*IntMap).Range(0xc007b6ef68, 0xc00eff1c00)
	github.com/cockroachdb/cockroach/pkg/util/syncutil/int_map.go:385 +0x19d
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Stores).VisitStores(0xc007b6ef30, 0xc009e24c48)
	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/stores.go:149 +0x8e
github.com/cockroachdb/cockroach/pkg/server.(*Node).computeMetricsPeriodically(0xc007b40808, {0xe5f0db8, 0xc00b73fce0}, 0xc009e24e40, 0x0)
	github.com/cockroachdb/cockroach/pkg/server/node.go:1035 +0xcf
github.com/cockroachdb/cockroach/pkg/server.(*Node).startComputePeriodicMetrics.func2({0xe5f0db8, 0xc00b73fce0})
	github.com/cockroachdb/cockroach/pkg/server/node.go:1008 +0x273
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx.func2()
	github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:485 +0x263
created by github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx in goroutine 23658
	github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:476 +0x69d
Log preceding fatal error

=== RUN   TestFlowControlRaftMembership
    test_log_scope.go:170: test logs captured to: outputs.zip/logTestFlowControlRaftMembership224602738
    test_log_scope.go:81: use -show-logs to present logs inline
*
* ERROR: a panic has occurred!
* panic: Counters should not decrease
* (1) attached stack trace
*   -- stack trace:
*   | runtime.gopanic
*   | 	GOROOT/src/runtime/panic.go:770
*   | github.com/cockroachdb/cockroach/pkg/util/metric.(*Counter).Update
*   | 	github.com/cockroachdb/cockroach/pkg/util/metric/pkg/util/metric/metric.go:755
*   | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*StoreMetrics).updateEngineMetrics
*   | 	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/metrics.go:3789
*   | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).computeMetrics
*   | 	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/store.go:3478
*   | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).ComputeMetricsPeriodically
*   | 	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/store.go:3503
*   | github.com/cockroachdb/cockroach/pkg/server.(*Node).computeMetricsPeriodically.func1
*   | 	github.com/cockroachdb/cockroach/pkg/server/node.go:1036
*   | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Stores).VisitStores.func1
*   | 	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/stores.go:150
*   | github.com/cockroachdb/cockroach/pkg/util/syncutil.(*IntMap).Range
*   | 	github.com/cockroachdb/cockroach/pkg/util/syncutil/int_map.go:385
*   | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Stores).VisitStores
*   | 	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/stores.go:149
*   | github.com/cockroachdb/cockroach/pkg/server.(*Node).computeMetricsPeriodically
*   | 	github.com/cockroachdb/cockroach/pkg/server/node.go:1035
*   | github.com/cockroachdb/cockroach/pkg/server.(*Node).startComputePeriodicMetrics.func2
*   | 	github.com/cockroachdb/cockroach/pkg/server/node.go:1008
*   | github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx.func2
*   | 	github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:485
*   | runtime.goexit
*   | 	src/runtime/asm_amd64.s:1695
* Wraps: (2) panic: Counters should not decrease
* Error types: (1) *withstack.withStack (2) *errutil.leafError
*

Parameters:

  • attempt=1
  • race=true
  • run=1
  • shard=3
Help

See also: How To Investigate a Go Test Failure (internal)

/cc @cockroachdb/kv

This test on roachdash | Improve this report!

Jira issue: CRDB-39580

@cockroach-teamcity cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-kv KV Team labels Jun 15, 2024
@aadityasondhi
Copy link
Collaborator

This seems to be a side-effect of the changes here: #125613.

Specifically coming from WALBytesWritten metric. It should not be a release blocker as the metric should work normally in release builds and only panics in test builds.

As storage on-call, I can take this one.

@aadityasondhi aadityasondhi self-assigned this Jun 17, 2024
@aadityasondhi aadityasondhi added the T-storage Storage Team label Jun 17, 2024
@blathers-crl blathers-crl bot added the A-storage Relating to our storage engine (Pebble) on-disk storage. label Jun 17, 2024
@aadityasondhi aadityasondhi removed the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label Jun 17, 2024
@exalate-issue-sync exalate-issue-sync bot added release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. and removed A-storage Relating to our storage engine (Pebble) on-disk storage. T-kv KV Team labels Jun 17, 2024
@aadityasondhi aadityasondhi removed the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label Jun 17, 2024
@aadityasondhi
Copy link
Collaborator

aadityasondhi commented Jun 18, 2024

We saw this before too and it was fixed: cockroachdb/pebble#3555.

Unable to reproduce the non-monotonicity inside pebble using go test . -run TestMetricsWALBytesWrittenMonotonicity --exec 'stress -p 300' -race -v. And TestFlowControlRaftMembership run locally with stressrace doesn't fail either.

@nicktrav nicktrav moved this from Incoming to Tests (failures, skipped, flakes) in [Deprecated] Storage Jun 18, 2024
@aadityasondhi
Copy link
Collaborator

Seems like we are doing a flushable ingest at the time of this failure:

I240615 10:52:49.021841 27872 3@pebble/event.go:828 ⋮ [n4,s4,pebble] 183  [JOB 12] WAL created 000014 (recycled 000002)
I240615 10:52:49.542707 27874 3@pebble/event.go:828 ⋮ [n4,s4,pebble] 184  [JOB 13] WAL created 000015 (recycled 000010)
I240615 10:52:50.542960 31717 3@pebble/event.go:808 ⋮ [n5,s5,pebble] 185  [JOB 3] ingesting: sstable created 000004
I240615 10:52:50.543200 31717 3@pebble/event.go:808 ⋮ [n5,s5,pebble] 186  [JOB 3] ingesting: sstable created 000009
I240615 10:52:50.543352 31717 3@pebble/event.go:808 ⋮ [n5,s5,pebble] 187  [JOB 3] ingesting: sstable created 000005
I240615 10:52:50.543485 31717 3@pebble/event.go:808 ⋮ [n5,s5,pebble] 188  [JOB 3] ingesting: sstable created 000006
I240615 10:52:50.543613 31717 3@pebble/event.go:808 ⋮ [n5,s5,pebble] 189  [JOB 3] ingesting: sstable created 000007
I240615 10:52:50.543740 31717 3@pebble/event.go:808 ⋮ [n5,s5,pebble] 190  [JOB 3] ingesting: sstable created 000008
I240615 10:52:50.544959 31717 3@pebble/event.go:828 ⋮ [n5,s5,pebble] 191  [JOB 4] WAL created 000010
I240615 10:52:50.546531 31717 3@pebble/event.go:828 ⋮ [n5,s5,pebble] 192  [JOB 5] WAL created 000011
I240615 10:52:50.547033 31717 3@pebble/event.go:816 ⋮ [n5,s5,pebble] 193  [JOB 3] ingested as flushable 000004 (1.2KB), 000009 (921B), 000005 (1.9KB), 000006 (958B), 000007 (934B), 000008 (2.0MB)

craig bot pushed a commit that referenced this issue Jun 24, 2024
125571: roachtest: grafana annotations read creds from cloud storage r=herkolategan,renatolabs a=DarrylWong

As of #124099 we now store the service account creds in cloud storage. We already use this to access prometheus when generating dynamic configs. This change does the same for Grafana annotations by extracting the common logic into a helper.

This will allow users to have access to Grafana annotations out of the box locally, and limit the amount of benign but potentially confusing warnings about invalid credentials.

Release note: none
Fixes: none
Epic: none

126084: jobs: limit number of retained dsp-diag-url info rows r=dt a=dt

Fixes #126083.

126109: kvserver: deflake `WALBytesWritten` metric r=raduberinde a=aadityasondhi

There is a race condition in Pebble metrics where sometimes the WAL is rotated prior to updated the BytesIn metric to account for the previous WAL. The metrics collection call happens async so it can sometimes cause this metric to decrease for a scrape window.

Fixes: #125736.

Release note: None

126114: sql: avoid slow lock verification in TestSchemaChangeAfterCreateInTxn r=rafiss a=rafiss

The addition of test-only verification pushed this test over the timeout sometimes, such that running it under the deadlock detector would cause spurious failures. We avoid this by making the test smaller under deadlock, like we do for race builds.

fixes #126075
Release justification: test only change
Release note: None

Co-authored-by: DarrylWong <[email protected]>
Co-authored-by: David Taylor <[email protected]>
Co-authored-by: Aaditya Sondhi <[email protected]>
Co-authored-by: Rafi Shamim <[email protected]>
@craig craig bot closed this as completed in 82cf310 Jun 24, 2024
@github-project-automation github-project-automation bot moved this from Tests (failures, skipped, flakes) to Done in [Deprecated] Storage Jun 24, 2024
asg0451 pushed a commit to asg0451/cockroach that referenced this issue Jun 25, 2024
There is a race condition in Pebble metrics where sometimes the WAL is
rotated prior to updated the BytesIn metric to account for the previous
WAL. The metrics collection call happens async so it can sometimes cause
this metric to decrease for a scrape window.

Fixes: cockroachdb#125736.

Release note: None
@github-project-automation github-project-automation bot moved this to roachtest/unit test backlog in KV Aug 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. T-storage Storage Team
Projects
Archived in project
Status: roachtest/unit test backlog
Development

Successfully merging a pull request may close this issue.

2 participants