Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachprod: DNS infra flakes #110884

Closed
herkolategan opened this issue Sep 19, 2023 · 1 comment · Fixed by #110832
Closed

roachprod: DNS infra flakes #110884

herkolategan opened this issue Sep 19, 2023 · 1 comment · Fixed by #110832
Assignees
Labels
A-testing Testing tools and infrastructure C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-testeng TestEng Team

Comments

@herkolategan
Copy link
Collaborator

herkolategan commented Sep 19, 2023

Infra flake related issues have started to present after DNS services were introduced to roachprod.
Intermittent network problems causes the DNS lookup to fail and will appear in the logs as a "lookup" error.

These intermittent network problems should preferably not cause a roachtest to fail and there should at least be a few retries in the event of network flakiness.

Jira issue: CRDB-31661

@herkolategan herkolategan added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-testing Testing tools and infrastructure T-testeng TestEng Team labels Sep 19, 2023
@herkolategan herkolategan self-assigned this Sep 19, 2023
@blathers-crl
Copy link

blathers-crl bot commented Sep 19, 2023

cc @cockroachdb/test-eng

herkolategan added a commit to herkolategan/cockroach that referenced this issue Sep 19, 2023
Previously we had failures in net.LookupSRV that appear to be network flakes. It
would be preferable to not fail a whole `roachtest` on a single DNS network flake.
This change wraps a retry around the dns lookup in order to prevent flakiness
and have a chance of recovery.

The `waitForRecordsAvailable` already has a retry mechanism and thus sets the
attempts to 1 when performing a lookup to confirm the records are available.

Fixes: cockroachdb#110884

Epic: None
Release Note: None
craig bot pushed a commit that referenced this issue Sep 20, 2023
110832: roachprod: retry dns lookup on network failures r=renatolabs a=herkolategan

Previously we had failures in net.LookupSRV that appear to be network flakes. It would be preferable to not fail a whole roachtest on a single DNS network flake. This change wraps a retry around the dns lookup in order to prevent flakiness and have a chance of recovery.

Fixes: #110884

Epic: None
Release Note: None

110918: changefeedccl: deflake TestAlterChangefeedAddTargetsDuringBackfill r=miretskiy a=jayshrivastava

Note this commit is similar to 1295da9, but applies to a different test.

Previously, this test failed because of the maximum allowed checkpoint frequency being too low (every 10ms). This test fails consistently when the frequency lower (ex. every 500ms) and fails very, very rarely when the frequency is 10ms. To fix these rare flakes, this change sets the frequency to once every nanosecond, which is the higest possible frequency value since setting 0 will disable checkpointing.

The reason the test fails with a large frequency is as follows: The test waits to observe a checkpoint during a schema change backfill. Because the changefeed is running normally before the schema change and backfill occurs, it is regularly checkpointing the highwater. Thus, it's possible for the changefeed to checkpoint the highwater, then complete the entire backfill without checkpointing within 10ms. No checkpoints will be written during the backfill in that scenario because 10ms have not passed since the last checkpoint. In this scenario, the test fails to see a checkpoint written during the backfill and times out.

Fixes: #110796
Release note: None
Epic: None

 Please enter a valid issue or epic reference:

Co-authored-by: Herko Lategan <[email protected]>
Co-authored-by: Jayant Shrivastava <[email protected]>
@craig craig bot closed this as completed in b727e63 Sep 20, 2023
herkolategan added a commit to herkolategan/cockroach that referenced this issue Oct 4, 2023
Previously we had failures in net.LookupSRV that appear to be network flakes. It
would be preferable to not fail a whole `roachtest` on a single DNS network flake.
This change wraps a retry around the dns lookup in order to prevent flakiness
and have a chance of recovery.

Fixes: cockroachdb#110884

Epic: None
Release Note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-testing Testing tools and infrastructure C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-testeng TestEng Team
Projects
No open projects
Status: Done
Development

Successfully merging a pull request may close this issue.

1 participant