-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: c2c/BulkOps/full failed #114706
Comments
well this time we did replan:
I'll need to figure out if we then redistributed the catchup scans. |
Ohhh, we never replanned because of this code snippet: cockroach/pkg/ccl/streamingccl/streamingest/stream_ingestion_frontier_processor.go Line 547 in b61cbb9
|
roachtest.c2c/BulkOps/full failed with artifacts on release-23.2 @ f47ba2eda81179ea16b868df55986614839d17e9:
Parameters: |
…t advanced Previously, the frontier processor would return a lag replanning error if it detected a lagging node and after the hwm had advanced during the flow. This implies the frontier processor could replan as soon as a lagging node finished its catchup scan and bumped the hwm, but was still far behind the other nodes, as we observed in #114706. Ideally, the frontier processor should not throw this replanning error because the lagging node is making progress and because replanning can cause repeated work. This patch prevents this scenario by teaching the frontier processor to only throw a replanning error if: - the hwm has advanced in the flow - two consecutive lagging node checks detected a lagging node and the hwm has not advanced during those two checks. Informs #114706 Release note: none
114861: server: remove admin checks from server-v2 r=rafiss a=rafiss These are now replaced by checks for the VIEWCLUSTERMETADATA privilege. The equivalent change was already made on the v1 server. The following endpoints are affected (these endpoints are not used in the frontend, so this change does not have any user-facing impact): - /sessions/ - /nodes/ - /nodes/{node_id}/ranges/ - /ranges/{range_id:[0-9]+}/ - /events/ informs #114384 informs #79571 informs #109814 Release note: None 115000: streamingccl: only return lag replanning error if lagging node has not advanced r=stevendanna a=msbutler Previously, the frontier processor would return a lag replanning error if it detected a lagging node and after the hwm had advanced during the flow. This implies the frontier processor could replan as soon as a lagging node finished its catchup scan and bumped the hwm, but was still far behind the other nodes, as we observed in #114706. Ideally, the frontier processor should not throw this replanning error because the lagging node is making progress and because replanning can cause repeated work. This patch prevents this scenario by teaching the frontier processor to only throw a replanning error if: - the hwm has advanced in the flow - two consecutive lagging node checks detected a lagging node and the hwm has not advanced during those two checks. Informs #114706 Release note: none 115046: catalog/lease: fix flake for TestTableCreationPushesTxnsInRecentPast r=fqazi a=fqazi Previously, TestTableCreationPushesTxnsInRecentPast could flake because we attempted to increase odds of hitting the uncertainty interval error by adding a delay on KV RPC calls. This wasn't effective and could still intermittent failures, and instead we are going to directly modify the uncertainty interval by setting a large MaxOffset on the clock. This will cause the desired behaviour in a more deterministic way. Fixes: #114366 Release note: None Co-authored-by: Rafi Shamim <[email protected]> Co-authored-by: Michael Butler <[email protected]> Co-authored-by: Faizan Qazi <[email protected]>
…t advanced Previously, the frontier processor would return a lag replanning error if it detected a lagging node and after the hwm had advanced during the flow. This implies the frontier processor could replan as soon as a lagging node finished its catchup scan and bumped the hwm, but was still far behind the other nodes, as we observed in #114706. Ideally, the frontier processor should not throw this replanning error because the lagging node is making progress and because replanning can cause repeated work. This patch prevents this scenario by teaching the frontier processor to only throw a replanning error if: - the hwm has advanced in the flow - two consecutive lagging node checks detected a lagging node and the hwm has not advanced during those two checks. Informs #114706 Release note: none
…tting This patch adds the default off physical_replication.consumer.split_on_job_retry setting, which, when enabled issues admin splits over the topology after a job level retry triggered by distSQL replanning. This setting may help c2c catch up more quickly after a replanning event, as it would prevent the destination side from issueing admin splits during catchup scans. Informs cockroachdb#114706 Release note: none
fixed by #115101 |
argh, closed this on the wrong backport. we're waiting on #115103 to merge |
roachtest.c2c/BulkOps/full failed with artifacts on release-23.2 @ 48d5dc3efefacae5cebd23ac81c46f4248eb39f0:
Parameters: |
…tting This patch adds the default off physical_replication.consumer.split_on_job_retry setting, which, when enabled issues admin splits over the topology after a job level retry triggered by distSQL replanning. This setting may help c2c catch up more quickly after a replanning event, as it would prevent the destination side from issueing admin splits during catchup scans. Informs cockroachdb#114706 Release note: none
…tting This patch adds the default off physical_replication.consumer.split_on_job_retry setting, which, when enabled issues admin splits over the topology after a job level retry triggered by distSQL replanning. This setting may help c2c catch up more quickly after a replanning event, as it would prevent the destination side from issueing admin splits during catchup scans. Informs cockroachdb#114706 Release note: none
closing this in favor of #115415 |
roachtest.c2c/BulkOps/full failed with artifacts on release-23.2 @ c9f6f30496fe34af8d17410e5307f359458aaf52:
Parameters:
ROACHTEST_arch=amd64
,ROACHTEST_cloud=gce
,ROACHTEST_cpu=8
,ROACHTEST_encrypted=false
,ROACHTEST_fs=ext4
,ROACHTEST_localSSD=false
,ROACHTEST_metamorphicBuild=false
,ROACHTEST_ssd=0
Help
See: roachtest README
See: How To Investigate (internal)
See: Grafana
This test on roachdash | Improve this report!
Jira issue: CRDB-33644
The text was updated successfully, but these errors were encountered: