-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachprod: DNS infra flakes #110884
Comments
herkolategan
added
C-bug
Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
A-testing
Testing tools and infrastructure
T-testeng
TestEng Team
labels
Sep 19, 2023
cc @cockroachdb/test-eng |
This was referenced Sep 19, 2023
herkolategan
added a commit
to herkolategan/cockroach
that referenced
this issue
Sep 19, 2023
Previously we had failures in net.LookupSRV that appear to be network flakes. It would be preferable to not fail a whole `roachtest` on a single DNS network flake. This change wraps a retry around the dns lookup in order to prevent flakiness and have a chance of recovery. The `waitForRecordsAvailable` already has a retry mechanism and thus sets the attempts to 1 when performing a lookup to confirm the records are available. Fixes: cockroachdb#110884 Epic: None Release Note: None
craig bot
pushed a commit
that referenced
this issue
Sep 20, 2023
110832: roachprod: retry dns lookup on network failures r=renatolabs a=herkolategan Previously we had failures in net.LookupSRV that appear to be network flakes. It would be preferable to not fail a whole roachtest on a single DNS network flake. This change wraps a retry around the dns lookup in order to prevent flakiness and have a chance of recovery. Fixes: #110884 Epic: None Release Note: None 110918: changefeedccl: deflake TestAlterChangefeedAddTargetsDuringBackfill r=miretskiy a=jayshrivastava Note this commit is similar to 1295da9, but applies to a different test. Previously, this test failed because of the maximum allowed checkpoint frequency being too low (every 10ms). This test fails consistently when the frequency lower (ex. every 500ms) and fails very, very rarely when the frequency is 10ms. To fix these rare flakes, this change sets the frequency to once every nanosecond, which is the higest possible frequency value since setting 0 will disable checkpointing. The reason the test fails with a large frequency is as follows: The test waits to observe a checkpoint during a schema change backfill. Because the changefeed is running normally before the schema change and backfill occurs, it is regularly checkpointing the highwater. Thus, it's possible for the changefeed to checkpoint the highwater, then complete the entire backfill without checkpointing within 10ms. No checkpoints will be written during the backfill in that scenario because 10ms have not passed since the last checkpoint. In this scenario, the test fails to see a checkpoint written during the backfill and times out. Fixes: #110796 Release note: None Epic: None Please enter a valid issue or epic reference: Co-authored-by: Herko Lategan <[email protected]> Co-authored-by: Jayant Shrivastava <[email protected]>
herkolategan
added a commit
to herkolategan/cockroach
that referenced
this issue
Oct 4, 2023
Previously we had failures in net.LookupSRV that appear to be network flakes. It would be preferable to not fail a whole `roachtest` on a single DNS network flake. This change wraps a retry around the dns lookup in order to prevent flakiness and have a chance of recovery. Fixes: cockroachdb#110884 Epic: None Release Note: None
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Infra flake related issues have started to present after DNS services were introduced to
roachprod
.Intermittent network problems causes the DNS lookup to fail and will appear in the logs as a "lookup" error.
These intermittent network problems should preferably not cause a
roachtest
to fail and there should at least be a few retries in the event of network flakiness.Jira issue: CRDB-31661
The text was updated successfully, but these errors were encountered: