-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: c2c/BulkOps/full failed #114341
Comments
Cutover failed on a timeout because the replication lag never caught up during the test. I wonder why the auto replanner didn't help.
|
For some reason, node lag replanning was never executed:
If it were executed, we would have seen this log line So, either all nodes were equally far behind (i.e. no node was more than 10 minutes behind every other node), or there's some bug in our replan policy. |
@adityamaru One node (node 2) is lagging behind on catchup scans here, which suggests that its hwm should be far behind all other nodes. To confirm, it would be great to see the replication_frontier.txt, but its not in the debug zip or artifacts. If I enabled a periodic collection of the replication frontier in all c2c roachtests, would the debug zip machinery collect it? |
roachtest.c2c/BulkOps/full failed with artifacts on master @ 7f96bac8ce44a128b1f98f5462369a9910c802e5:
Parameters: |
I was able to repro the failure. For some reason, the replication-frontier.txt file is empty... |
Was the cluster setting to dump the frontier set to something other than 0? |
oh wait |
ok ,replication frontier looks fine. we should replan on node 2's lag, but it doesnt:
|
aha, i think I figured out the problem after rerunning the test with some extra logging: when we consider checking for lagging nodes, we always update
|
This patch prevents the lastNodeLagCheck time from updating every time the frontier processor receives a checkpoint, which can happen every few seconds. This previously prevented the node lag replanning check to trigger because this time needed to be older than 10 minutes. Rather, this timestamp should only update if we actually compute the lag check. Fixes cockroachdb#114341 Release note: none
roachtest.c2c/BulkOps/full failed with artifacts on master @ e19c24fb62d24595e74c0bae0aaad0a736c2bdc7:
Parameters: |
roachtest.c2c/BulkOps/full failed with artifacts on master @ aa812a63b8023d35b7bd62d12c6c47708f48a817:
Parameters: |
114525: streamingccl: prevent node lag replanning starvation r=stevendanna a=msbutler This patch prevents the lastNodeLagCheck time from updating every time the frontier processor receives a checkpoint, which can happen every few seconds. This previously prevented the node lag replanning check to trigger because this time needed to be older than 10 minutes. Rather, this timestamp should only update if we actually compute the lag check. Fixes #114341 Release note: none Co-authored-by: Michael Butler <[email protected]>
This patch prevents the lastNodeLagCheck time from updating every time the frontier processor receives a checkpoint, which can happen every few seconds. This previously prevented the node lag replanning check to trigger because this time needed to be older than 10 minutes. Rather, this timestamp should only update if we actually compute the lag check. Fixes #114341 Release note: none
roachtest.c2c/BulkOps/full failed with artifacts on master @ 063fa2b8930019a16adc0bca5b8363d8e80f3132:
Parameters:
ROACHTEST_arch=amd64
,ROACHTEST_cloud=gce
,ROACHTEST_cpu=8
,ROACHTEST_encrypted=false
,ROACHTEST_fs=ext4
,ROACHTEST_localSSD=false
,ROACHTEST_metamorphicBuild=true
,ROACHTEST_ssd=0
Help
See: roachtest README
See: How To Investigate (internal)
See: Grafana
This test on roachdash | Improve this report!
Jira issue: CRDB-33470
The text was updated successfully, but these errors were encountered: