Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: kv/restart/nodes=12 failed #98928

Closed
cockroach-teamcity opened this issue Mar 18, 2023 · 6 comments
Closed

roachtest: kv/restart/nodes=12 failed #98928

cockroach-teamcity opened this issue Mar 18, 2023 · 6 comments
Assignees
Labels
branch-master Failures and bugs on the master branch. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). no-test-failure-activity O-roachtest O-robot Originated from a bot. T-kv KV Team X-stale
Milestone

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Mar 18, 2023

roachtest.kv/restart/nodes=12 failed with artifacts on master @ 6c99966f604f3521acdb925b9f689529ffd46df3:

test artifacts and logs in: /artifacts/kv/restart/nodes=12/run_1
(assertions.go:262).Fail: 
	Error Trace:	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/kv.go:1021
	            				main/pkg/cmd/roachtest/test_runner.go:993
	            				GOROOT/src/runtime/asm_amd64.s:1594
	Error:      	"0" is not greater than "0.5"
	Test:       	kv/restart/nodes=12
(cluster.go:1969).Run: output in run_060156.319224414_n13_cockroach-workload-r: ./cockroach workload run kv --min-block-bytes=8192 --max-block-bytes=8192 --tolerate-errors --read-percent=50  --duration=30m0s --concurrency=64 {pgurl:1-11} returned: context canceled

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=8 , ROACHTEST_encrypted=false , ROACHTEST_fs=ext4 , ROACHTEST_localSSD=true , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

/cc @cockroachdb/kv-triage

This test on roachdash | Improve this report!

Jira issue: CRDB-25597

@cockroach-teamcity cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Mar 18, 2023
@cockroach-teamcity cockroach-teamcity added this to the 23.1 milestone Mar 18, 2023
@blathers-crl blathers-crl bot added the T-kv KV Team label Mar 18, 2023
@aliher1911
Copy link
Contributor

aliher1911 commented Mar 20, 2023

Test was reenabled last week after some fixes (#98271) but it looks like the underlying overload behaviour is unchanged.
After node is added after a downtime SQL throughput tanks while node is trying to catch up. Number of leaseholders on the overloaded node is 0 so the change mentioned in PR works. Since throughput drops, overload must be affecting something else.

@aliher1911
Copy link
Contributor

Seems that my interpretation of lease stats is not correct. Node doesn't have leases but other nodes still try to move them to n12.

Queries are stuck with:

W230318 06:26:27.536186 369697 kv/kvclient/kvcoord/dist_sender.go:1675 ⋮ [T1,n10,client=10.128.0.71:50218,user=root] 12986  slow range RPC: have been waiting 145.10s (1 attempts) for RPC Put [/Table/106/1/‹-112334951631669349›/‹0›,/Min), EndTxn(parallel commit) [/Table/106/1/‹-112334951631669349›/‹0›], [txn: d66aaf7d], [can-forward-ts] to r4816:‹/Table/106/1/-112{980658418560248-058367329429228}› [(n12,s12):5, (n7,s7):4, (n9,s9):3, next=6, gen=64, sticky=9223372036.854775807,2147483647]; resp: result is ambiguous: context done during DistSender.Send: ba: ‹Put [/Table/106/1/-112334951631669349/0,/Min), EndTxn(parallel commit) [/Table/106/1/-112334951631669349/0], [txn: d66aaf7d], [can-forward-ts]› RPC error: grpc: ‹context canceled› [code 1/Canceled]

While we try to move some leases right after node comes back:

I230318 06:23:27.300854 3410055 13@kv/kvserver/replicate_queue.go:2033 ⋮ [T1,n10,replicate,s10,r18220/6:‹/Table/106/1/-88{6091…-5998…}›] 33251  transferring lease to s12
I230318 06:23:42.441713 3429253 13@kv/kvserver/replicate_queue.go:2033 ⋮ [T1,n10,replicate,s10,r6461/1:‹/Table/106/1/91{49588…-50511…}›] 33252  transferring lease to s12
I230318 06:23:56.174176 3442526 13@kv/kvserver/replicate_queue.go:2033 ⋮ [T1,n10,replicate,s10,r13280/2:‹/Table/106/1/617{4277…-5199…}›] 33253  transferring lease to s12
I230318 06:24:03.337125 3449366 13@kv/kvserver/replicate_queue.go:2033 ⋮ [T1,n10,replicate,s10,r16913/4:‹/Table/106/1/-6830{948…-026…}›] 33254  transferring lease to s12

@aliher1911 aliher1911 added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Mar 20, 2023
@aliher1911
Copy link
Contributor

When you are going to look at it, maybe you can fix

t.Status(fmt.Sprintf("waiting %x after stopping node to allow the node to fall behind", duration))

formatter from %x to %s? I don't want to raise a PR because of 1 char. It prints timestamp as hex currently.

@kvoli
Copy link
Collaborator

kvoli commented Mar 20, 2023

The restarting node (n12) should be treated as suspect and not be receiving any leases for 30s after restarting.

n12 joins back at 06:23.18.71.

I230318 06:23:18.715307 400 1@gossip/gossip.go:1420 ⋮ [T1,n12] 26  node has connected to cluster via gossip

The earliest lease transfer I see towards n12 is at 06:23:25.54, which is 7 seconds after n12 connects to gossip.

2283:I230318 06:23:25.542903 3477207 13@kv/kvserver/replicate_queue.go:2033 ⋮ [T1,n9,replicate,s9,r18364/5:‹/Table/106/1/720{0787…-1709…}›] 35975  transferring lease to s12
...

So it appears that the node isn't being treated as suspect after immediately rejoining and is receiving leases as a result.

I'll look more into why that is the case.

@kvoli kvoli added GA-blocker and removed release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Mar 20, 2023
kvoli added a commit to kvoli/cockroach that referenced this issue Mar 20, 2023
Previously, the `LastUnavailable` time was set in most parts of the
storepool when a store was considered either `Unavailable`, `Dead`,
`Decommissioned` or `Draining`. When `LastUnavailable` is within the
last suspect duration (30s default), the node is treated as suspect by
other nodes in the cluster.

`LastUnavailable` was not being set when a store was considered dead due
to the store not gossiping its store descriptor. This commit updates the
`status` storepool function to do just that.

Informs: cockroachdb#98928

Release note: None
@kvoli
Copy link
Collaborator

kvoli commented Mar 20, 2023

With the fix in ##99033 to ensure stores are suspect this test should be less flaky. After 30s and before the restarting node hits overload, it may still receive a few leases and cause this test to flake. Removing the GA blocker.

@kvoli kvoli removed the GA-blocker label Mar 20, 2023
craig bot pushed a commit that referenced this issue Mar 20, 2023
98792: kvserver: unskip `TestNewTruncateDecision` r=erikgrinaker a=erikgrinaker

Passed after 10k stress runs. Has been skipped since 2019, issue seems to have been fixed in the meanwhile.

Resolves #38584.

Epic: none
Release note: None

98855: roachtest: enable schema changes in acceptance/version-upgrade r=fqazi a=fqazi

Previously, due to flakes we disabled schema changes inside the version update test. This patch re-enables them, since we are confident that the workload itslef is now stable in a mixed version state.

Fixes: #58489
Release note: None

99023: kv: add log scope to BenchmarkSingleRoundtripWithLatency r=arulajmani a=nvanbenschoten

Informs #98887.

Avoids mixing logs with benchmark results, which breaks benchdiff.

Release note: None

99033: storepool: set last unavailable on gossip dead r=andrewbaptist a=kvoli

Previously, the `LastUnavailable` time was set in most parts of the storepool when a store was considered either `Unavailable`, `Dead`, `Decommissioned` or `Draining`. When `LastUnavailable` is within the last suspect duration (30s default), the node is treated as suspect by other nodes in the cluster.

`LastUnavailable` was not being set when a store was considered dead due to the store not gossiping its store descriptor. This commit updates the `status` storepool function to do just that.

Informs: #98928

Release note: None

99039: pkg/ccl/backupccl: Remove TestBackupRestoreControlJob r=benbardin a=benbardin

This test has was marked skipped for flakiness, in 2018.

Fixes: #24136

Release note: None

Co-authored-by: Erik Grinaker <[email protected]>
Co-authored-by: Faizan Qazi <[email protected]>
Co-authored-by: Nathan VanBenschoten <[email protected]>
Co-authored-by: Austen McClernon <[email protected]>
Co-authored-by: Ben Bardin <[email protected]>
@github-actions
Copy link

We have marked this test failure issue as stale because it has been
inactive for 1 month. If this failure is still relevant, removing the
stale label or adding a comment will keep it active. Otherwise,
we'll close it in 5 days to keep the test failure queue tidy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-master Failures and bugs on the master branch. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). no-test-failure-activity O-roachtest O-robot Originated from a bot. T-kv KV Team X-stale
Projects
None yet
Development

No branches or pull requests

3 participants