-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: kv/gracefuldraining/nodes=3 failed [qps dropped] #59094
Comments
(roachtest).kv/gracefuldraining/nodes=3 failed on master@fbf596c3e17fbb9ec0935b732f4b84469a5399e8:
More
Artifacts: /kv/gracefuldraining/nodes=3 See this test on roachdash |
cc @knz does this qualify as an alpha release blocker? |
No it does not qualify |
(roachtest).kv/gracefuldraining/nodes=3 failed on master@100c09f4f6eb3f5b18a67ec4bbfdfe989e0d6ce2:
More
Artifacts: /kv/gracefuldraining/nodes=3 See this test on roachdash |
(roachtest).kv/gracefuldraining/nodes=3 failed on master@c584f62067a45aa540c26fc9081a83e460bfe37a:
More
Artifacts: /kv/gracefuldraining/nodes=3 See this test on roachdash |
(roachtest).kv/gracefuldraining/nodes=3 failed on master@81a2c26a104fa8cc7e8b530b837ffb6ff85ddc5a:
More
Artifacts: /kv/gracefuldraining/nodes=3 See this test on roachdash |
(roachtest).kv/gracefuldraining/nodes=3 failed on master@0e98670c8fcec566937e899fdf77d2a68c702d62:
More
Artifacts: /kv/gracefuldraining/nodes=3 See this test on roachdash |
(roachtest).kv/gracefuldraining/nodes=3 failed on master@5b33e6dfc47000de831745a851e3bf9e2cf7fd95:
More
Artifacts: /kv/gracefuldraining/nodes=3 See this test on roachdash |
(roachtest).kv/gracefuldraining/nodes=3 failed on master@7853fd32de8b6dea869f2a2a92dcd7506f4a8998:
More
Artifacts: /kv/gracefuldraining/nodes=3 See this test on roachdash |
(roachtest).kv/gracefuldraining/nodes=3 failed on master@e9e372122a2e3db7090b5705da07128f828e2441:
More
Artifacts: /kv/gracefuldraining/nodes=3 See this test on roachdash |
(roachtest).kv/gracefuldraining/nodes=3 failed on master@83e70ce84b740e27e721c3b73c38a4b8b515094a:
More
Artifacts: /kv/gracefuldraining/nodes=3 See this test on roachdash |
(roachtest).kv/gracefuldraining/nodes=3 failed on master@bdff5338ca725bf1cfddf7e3f648bbf02ab42999:
More
Artifacts: /kv/gracefuldraining/nodes=3
See this test on roachdash |
(roachtest).kv/gracefuldraining/nodes=3 failed on master@ee9f47b9ec9476a693464e2dcd09a01bf9d39ad2:
More
Artifacts: /kv/gracefuldraining/nodes=3
See this test on roachdash |
(roachtest).kv/gracefuldraining/nodes=3 failed on master@3d19b2cf6b290a152b23722fc32e995eed3b437b:
More
Artifacts: /kv/gracefuldraining/nodes=3
See this test on roachdash |
This comment has been minimized.
This comment has been minimized.
last failure
cc #62946 |
This comment has been minimized.
This comment has been minimized.
roachtest.kv/gracefuldraining/nodes=3 failed with artifacts on master @ 69308cce3bb7e660908cb3e2724eedd271ce5585:
Reproduce
To reproduce, try: # From https://go.crdb.dev/p/roachstress, perhaps edited lightly.
caffeinate ./roachstress.sh kv/gracefuldraining/nodes=3 Same failure on other branches
|
roachtest.kv/gracefuldraining/nodes=3 failed with artifacts on master @ 6b9d2a15f0c223c8dda04c5b2a39abe784b58bdd:
Reproduce
To reproduce, try: # From https://go.crdb.dev/p/roachstress, perhaps edited lightly.
caffeinate ./roachstress.sh kv/gracefuldraining/nodes=3 Same failure on other branches
|
roachtest.kv/gracefuldraining/nodes=3 failed with artifacts on master @ d43d9fddbebac7eff03804a7d86f7b6af119f24f:
Reproduce
To reproduce, try: # From https://go.crdb.dev/p/roachstress, perhaps edited lightly.
caffeinate ./roachstress.sh kv/gracefuldraining/nodes=3 Same failure on other branches
|
roachtest.kv/gracefuldraining/nodes=3 failed with artifacts on master @ 84ec89c77841016da0b9c4c71772a4304bad45a5:
Reproduce
To reproduce, try: # From https://go.crdb.dev/p/roachstress, perhaps edited lightly.
caffeinate ./roachstress.sh kv/gracefuldraining/nodes=3 Same failure on other branches
|
roachtest.kv/gracefuldraining/nodes=3 failed with artifacts on master @ 7d0fd136a538b22cbf9bfff03b2885b7783711aa:
Reproduce
To reproduce, try: # From https://go.crdb.dev/p/roachstress, perhaps edited lightly.
caffeinate ./roachstress.sh kv/gracefuldraining/nodes=3 Same failure on other branches
|
roachtest.kv/gracefuldraining/nodes=3 failed with artifacts on master @ 0a48b0b74b0a6057f1d418875b97830359a52ec6:
Reproduce
To reproduce, try: # From https://go.crdb.dev/p/roachstress, perhaps edited lightly.
caffeinate ./roachstress.sh kv/gracefuldraining/nodes=3 Same failure on other branches
|
roachtest.kv/gracefuldraining/nodes=3 failed with artifacts on master @ 5a5b3dc446fcfc2d3e28b6775ae9bb1a63376210:
Reproduce
To reproduce, try: # From https://go.crdb.dev/p/roachstress, perhaps edited lightly.
caffeinate ./roachstress.sh kv/gracefuldraining/nodes=3 Same failure on other branches
|
Ran this on master and it is passing. I'm going to stress overnight and see if there are any failures to debug. If none pop up, I'll unskip this test as it seems useful to have. |
Fails 4/57 runs on master. The test itself seems valuable, I'll see if there's any quick changes to make this less flaky. |
Looking at one of the test failures.
EDIT: The time period on the graph was incorrect, I made the mistake of forgetting my Grafana is configured in local time (-3hr) of the roachprod timestamps. The actual graph, is very similar but you can see from the workload runner below we do hit an issue. The graphs in prometheus are rated over 30s, which could explain why: |
Enable `kv/gracefuldraining/nodes=3`. The test was skipped in cockroachdb#67798 due to flakes. The test is updated slightly to prevent future flakes. Before the changes in this commit, the test failed about 7% of sampled test runs (/50). The failures were caused by the QPS metric dropping below the target threshold during drain/restarts. The QPS metric the test used (from the internal time series) did not match up the scraped metric from Prometheus or the workload when QPS dropped. This commit updates the test to gather the QPS metric from Prometheus rather than internal time series. There were 0 failures over 20 runs. Resolves: cockroachdb#59094 Release note: None
It seems that the workload rate did indeed drop for 5 seconds from the baseline 1k ops/s in one of the test failures. So the metrics are correct, the test could use the workload runner instantaneous ops/s for a better signal. I think the value must have been smoothed out over 30s but there definitely was an impact that probably resulted from draining/restarting.
|
This appears like a legitimate issue. The cause I'm not certain on. The cluster does have a very undesirable symptom however, in that the leases are thrashing due to stale/incorrect data + racing between the replicate queue and store rebalancer between n1 and n2. This could be due to the more frequent gossiping when many lease transfers happen. Which is what we do during a drain. If the gossip is frequently untimely and overwriting the storepool estimates with target store descriptors, which recently received a lease, the lease load won't be included for 5 seconds. This will cause thrashing. I believe I've seen this elsewhere in the ycsb test and lowered gossip frequency due to capacity changes as result. In any case, the failure is legitimate and I'm investigating some updates to the store rebalancer to prevent the thrashing, which could possibly be the culprit. |
Removing the GA blocker. The failure rate appears lower than when this test was skipped. We do want to re enable the test as it is useful. However, we need to fix the cause of the failures. |
This commit re-enables the `kv/gracefuldraining/nodes=3` roachtest. The test is still likely to fail occasionally however has produced interesting findings just in testing to re-enable. Informs: cockroachdb#59094 Release note: None
98720: roachtest: enable kv/gracefuldraining/nodes=3 r=andrewbaptist a=kvoli This commit re-enables the `kv/gracefuldraining/nodes=3` roachtest. The test is still likely to fail occasionally however has produced interesting findings just in testing to re-enable. Informs: #59094 Release note: None 101729: streamingccl: don't require TLS certificates r=dt a=stevendanna Users may want to use password auth to simplify their replication setup. While we may recommend TLS certificate auth, I don't see a strong reason to _require_ it. Epic: none Release note: None 102825: kvserver,storepool: misc rebalance logging improvements r=andrewbaptist a=kvoli The store list string returned the mean leases, ranges and queries-per-second float values without limiting the number of decimal places. This led to log lines with needlessly long decimals: `avg-ranges=40.66666666666667... avg-leases=10.166666666666666...` This PR updates the store list string formatting to 2 decimal places for float values. Previously, the easiest method of determining the current rebalance objective from logs was to view the cluster setting and check for logging indicating a mixed version cluster - this was cumbersome. This PR annotates the ctx in the store rebalancer loop with an additional tag: `obj`. The `obj` tag indicates the current rebalance objective, either `cpu` or `qps` currently. resolves: #102812 Release note: None Co-authored-by: Austen McClernon <[email protected]> Co-authored-by: Steven Danna <[email protected]>
Going to consolidate test discussion on #103270 - since any future failures will show there due to the test naming format changing. |
(roachtest).kv/gracefuldraining/nodes=3 failed on master@7b0ccdda99b81613e70f421c9374483c3feddff3:
More
Artifacts: /kv/gracefuldraining/nodes=3
See this test on roachdash
powered by pkg/cmd/internal/issues
Jira issue: CRDB-3333
Epic CRDB-18656
The text was updated successfully, but these errors were encountered: