Skip to content

Commit

Permalink
Browse files Browse the repository at this point in the history
110832: roachprod: retry dns lookup on network failures r=renatolabs a=herkolategan

Previously we had failures in net.LookupSRV that appear to be network flakes. It would be preferable to not fail a whole roachtest on a single DNS network flake. This change wraps a retry around the dns lookup in order to prevent flakiness and have a chance of recovery.

Fixes: #110884

Epic: None
Release Note: None

110918: changefeedccl: deflake TestAlterChangefeedAddTargetsDuringBackfill r=miretskiy a=jayshrivastava

Note this commit is similar to 1295da9, but applies to a different test.

Previously, this test failed because of the maximum allowed checkpoint frequency being too low (every 10ms). This test fails consistently when the frequency lower (ex. every 500ms) and fails very, very rarely when the frequency is 10ms. To fix these rare flakes, this change sets the frequency to once every nanosecond, which is the higest possible frequency value since setting 0 will disable checkpointing.

The reason the test fails with a large frequency is as follows: The test waits to observe a checkpoint during a schema change backfill. Because the changefeed is running normally before the schema change and backfill occurs, it is regularly checkpointing the highwater. Thus, it's possible for the changefeed to checkpoint the highwater, then complete the entire backfill without checkpointing within 10ms. No checkpoints will be written during the backfill in that scenario because 10ms have not passed since the last checkpoint. In this scenario, the test fails to see a checkpoint written during the backfill and times out.

Fixes: #110796
Release note: None
Epic: None

 Please enter a valid issue or epic reference:

Co-authored-by: Herko Lategan <[email protected]>
Co-authored-by: Jayant Shrivastava <[email protected]>
  • Loading branch information
3 people committed Sep 20, 2023
3 parents e35052b + b727e63 + 34752b0 commit 39dc158
Show file tree
Hide file tree
Showing 2 changed files with 32 additions and 21 deletions.
6 changes: 4 additions & 2 deletions pkg/ccl/changefeedccl/alter_changefeed_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -1170,9 +1170,11 @@ func TestAlterChangefeedAddTargetsDuringSchemaChangeError(t *testing.T) {
require.NoError(t, jobFeed.Pause())

var maxCheckpointSize int64 = 100 << 20
// Checkpoint progress frequently, and set the checkpoint size limit.
// Ensure that checkpoints happen every time by setting a large checkpoint size.
// Because setting 0 for the FrontierCheckpointFrequency disables checkpointing,
// setting 1 nanosecond is the smallest possible value.
changefeedbase.FrontierCheckpointFrequency.Override(
context.Background(), &s.Server.ClusterSettings().SV, 10*time.Millisecond)
context.Background(), &s.Server.ClusterSettings().SV, 1*time.Nanosecond)
changefeedbase.FrontierCheckpointMaxBytes.Override(
context.Background(), &s.Server.ClusterSettings().SV, maxCheckpointSize)

Expand Down
47 changes: 28 additions & 19 deletions pkg/roachprod/vm/gce/dns.go
Original file line number Diff line number Diff line change
Expand Up @@ -151,22 +151,33 @@ func (n dnsProvider) Domain() string {
return dnsDomain
}

// lookupSRVRecords uses standard net tools to perform a DNS lookup. For
// lookups, we prefer this to using the gcloud command as it is faster, and
// preferable when service information is being queried regularly.
// lookupSRVRecords uses standard net tools to perform a DNS lookup. This
// function will retry the lookup several times if there are any intermittent
// network problems. For lookups, we prefer this to using the gcloud command as
// it is faster, and preferable when service information is being queried
// regularly.
func (n dnsProvider) lookupSRVRecords(
ctx context.Context, service, proto, name string,
) ([]vm.DNSRecord, error) {
cName, srvRecords, err := n.resolver.LookupSRV(ctx, service, proto, name)
if dnsError := (*net.DNSError)(nil); errors.As(err, &dnsError) {
// We ignore some errors here as they are likely due to the record name not
// existing. The net.LookupSRV function tends to return "server misbehaving"
// and "no such host" errors when no record entries are found. Hence, making
// the errors ambiguous and not useful. The errors are not exported, so we
// have to check the error message.
if dnsError.Err != "server misbehaving" && dnsError.Err != "no such host" && !dnsError.IsNotFound {
return nil, err
var err error
var cName string
var srvRecords []*net.SRV
err = retry.WithMaxAttempts(ctx, retry.Options{}, 10, func() error {
cName, srvRecords, err = n.resolver.LookupSRV(ctx, service, proto, name)
if dnsError := (*net.DNSError)(nil); errors.As(err, &dnsError) {
// We ignore some errors here as they are likely due to the record name not
// existing. The net.LookupSRV function tends to return "server misbehaving"
// and "no such host" errors when no record entries are found. Hence, making
// the errors ambiguous and not useful. The errors are not exported, so we
// have to check the error message.
if dnsError.Err != "server misbehaving" && dnsError.Err != "no such host" && !dnsError.IsNotFound {
return dnsError
}
}
return nil
})
if err != nil {
return nil, err
}
records := make([]vm.DNSRecord, len(srvRecords))
for i, srvRecord := range srvRecords {
Expand Down Expand Up @@ -237,10 +248,7 @@ func (n dnsProvider) waitForRecordsAvailable(ctx context.Context, records ...vm.
}] = struct{}{}
}

return retry.WithMaxAttempts(ctx, retry.Options{
InitialBackoff: 5 * time.Second,
Multiplier: 1,
}, 20, func() error {
for attempts := 0; attempts < 30; attempts++ {
for key := range notAvailable {
foundRecords, err := n.lookupSRVRecords(ctx, "", "", key.name)
if err != nil {
Expand All @@ -256,7 +264,8 @@ func (n dnsProvider) waitForRecordsAvailable(ctx context.Context, records ...vm.
if len(notAvailable) == 0 {
return nil
}
return errors.Newf("waiting for DNS records to become available: %d out of %d records not available",
len(notAvailable), len(records))
})
time.Sleep(2 * time.Second)
}
return errors.Newf("waiting for DNS records to become available: %d out of %d records not available",
len(notAvailable), len(records))
}

0 comments on commit 39dc158

Please sign in to comment.