Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: don't inhibit cluster reuse on DNS deletion errors #124678

Merged
merged 1 commit into from
May 24, 2024

Conversation

andreimatei
Copy link
Contributor

Before this patch, a failure to delete a cluster's DNS records resulted in roachtest refusing to reuse that cluster for other tests. In general, refusing to reuse a cluster that has not been completely wiped is a sane policy (on the argument that the next test running on that cluster might be impacted by the cluster's dirty state), but DNS records in particular don't matter. So, let's be more tolerant of such errors.

For people outside of CRL, that DNS deletion seems to always fail (probably because no DNS record was created in the first place) -- so this patch helps me in particular.

Epic: None
Release note: None

Before this patch, a failure to delete a cluster's DNS records resulted
in roachtest refusing to reuse that cluster for other tests. In general,
refusing to reuse a cluster that has not been completely wiped is a sane
policy (on the argument that the next test running on that cluster
might be impacted by the cluster's dirty state), but DNS records in
particular don't matter. So, let's be more tolerant of such errors.

For people outside of CRL, that DNS deletion seems to always fail
(probably because no DNS record was created in the first place) -- so
this patch helps me in particular.

Epic: None
Release note: None
@andreimatei andreimatei requested a review from a team as a code owner May 24, 2024 20:27
@andreimatei andreimatei requested review from DarrylWong and renatolabs and removed request for a team May 24, 2024 20:27
Copy link

blathers-crl bot commented May 24, 2024

Thank you for contributing to CockroachDB. Please ensure you have followed the guidelines for creating a PR.

My owl senses detect your PR is good for review. Please keep an eye out for any test failures in CI.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@blathers-crl blathers-crl bot added the O-community Originated from the community label May 24, 2024
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@andreimatei
Copy link
Contributor Author

cc @RaduBerinde @srosenberg - a one liner for your kind consideration

Copy link
Member

@RaduBerinde RaduBerinde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bors r+

@craig craig bot merged commit 9fe2ea7 into cockroachdb:master May 24, 2024
20 of 22 checks passed
@srosenberg
Copy link
Member

but DNS records in particular don't matter. So, let's be more tolerant of such errors.

They do now since DNS is also used as a discovery service for multi-tenant roachprod clusters. @herkolategan I seem to recall we had some issue(s) around cluster reuse and stale DNS service record(s)?

@renatolabs
Copy link
Contributor

Yes, I believe reusing the cluster with stale records may cause the next test's call to c.Start to fail (multi-tenant or not). I think the better fix would have been to skip deleting DNS records if none were created.

@andreimatei andreimatei deleted the roachtest-tolerate-dns branch May 28, 2024 17:41
@andreimatei
Copy link
Contributor Author

Yes, I believe reusing the cluster with stale records may cause the next test's call to c.Start to fail (multi-tenant or not)

I have not observed that, fwiw.
But in any case, I'll send a revert; I didn't understand the implications.

@RaduBerinde
Copy link
Member

Sorry for jumping the gun on merging.

@srosenberg
Copy link
Member

I have not observed that, fwiw. But in any case, I'll send a revert; I didn't understand the implications.

No worries. If it's a real issue for you, I can send a PR to do what Renato suggested. We already have a hook in RegisterServices where we can store the fact, and use it to conditionally opt out from DestroyDNS on wipe. Do you happen to have a stack trace? It's not immediately clear which of the gcloud CLI commands (used by DestroyDNS) would fail if there were no records created.

Sorry for jumping the gun on merging.

No problem. Unfortunately, even seemingly trivial changes to roachtest/roachprod may have subtle side-effects. That's one of the reasons we usually kick off a run in CI with SELECT_PROBABILITY for smoke testing. Except in this case, it wouldn't have helped since the issue is non-deterministic. Quoting from an internal slack thread, @DarrylWong observed this rather subtle case of cluster reuse "interference" as a result of not invoking DestroyDNS,

I'm running into an issue with removing default port assumptions when two tests are run after each other on the same reused cluster. The first test (i.e. pgx) will randomly find a port (29000) and run fine. The second test (i.e. pop) relies on 26257 so it is explicitly specified. Run alone, pop passes but when run after pgx, it fails.
After some testing, it seems like what's happening is that in c.Start() we don't immediately create a new service at the specified port. Instead we first call DiscoverService to see if that node already has a service. If it does, then we forgo creating the service. In the above example, this means pop's call to c.Start() never creates a service on port 26257 because it see's we already have one on port 29000.
Long story short, my question here is if always using the found port regardless of what startOpts is intended behavior? If so then is the only way around this to disable cluster reuse when a port is specified? If not then I'm thinking we have to check if the discovered port is the same as the one specified and if it isn't then we have to delete the records?

andreimatei added a commit to andreimatei/cockroach that referenced this pull request May 28, 2024
…tion errors"

This reverts commit 596a3a2 (PR cockroachdb#124678).
That commit made roachtest tolerate DNS record deletion errors, so that
clusters can be reused even though their DNS records failed to be
cleaned up. This seems to have been a bad idea, though, since stale DNS
records can be a problem for reused clusters; see
cockroachdb#124678 (comment)

I now understand that there are two types of DNS records - normal ones
(`A` records?) and `SRV` records. The former are associated with
roachprod VMs. The latter are associated with cockroach nodes from host
or virtual clusters, and are used for some sort of service discovery. It
is the destruction of these SRV records that the original patch dealt
with. Failure to delete these records might have consequences for the
future tests using the cluster.

Epic: none
Release note: None
@andreimatei
Copy link
Contributor Author

Reverting in #124768

If it's a real issue for you, I can send a PR to do what Renato suggested. We already have a hook in RegisterServices where we can store the fact, and use it to conditionally opt out from DestroyDNS on wipe. Do you happen to have a stack trace? It's not immediately clear which of the gcloud CLI commands (used by DestroyDNS) would fail if there were no records created.

Let me look into it more, and understand exactly what wasn't working for me.

@andreimatei
Copy link
Contributor Author

andreimatei commented May 28, 2024

I've figured out my problem -- in #120340 there was confusion between the public DNS zone used for the A records, and the private DNS zone used for the SRV records. The DNS record creation (in particular, the SRV records creation) worked fine, but then the deletion tried to delete them from the public zone, instead of the private one, and failed.
This manifested in the smoke test that you've ran in failures to reuse clusters. So the smoke test succeeded, I believe, but took longer than it should have. In the _runner_logs files, you can see errors like:

[w7] 21:49:51 test_runner.go:641: Unable to reuse cluster: teamcity-15383971-1716500320-21-n1cpu4 due to: TRANSIENT_ERROR(dns_problem): output: ERROR: (gcloud.dns.record-sets.delete) HTTPError 403: Forbidden: exit status 1. Will attempt to create a fresh one

I'll push a fixed version of #120340, and this time it'll be good™.


We already have a hook in RegisterServices where we can store the fact, and use it to conditionally opt out from DestroyDNS on wipe. Do you happen to have a stack trace? It's not immediately clear which of the gcloud CLI commands (used by DestroyDNS) would fail if there were no records created.

I won't touch this any more myself :), but I think that there is a problem here that should probably be addressed. The SRV records are conditionally created (here), but unconditionally deleted (here). The deletion fails when the record doesn't exist ([1]). And when deletion fails, clusters are not reused.

Specifically, the conditional creation is inhibited when the test does not run on the default "provider" (i.e. GCE) or on the default GCE project.

[1]: The error I was getting locally was:

Unable to reuse cluster: andrei-1716924081-01-n4cpu4 due to: TRANSIENT_ERROR(dns_problem): output: ERROR: (gcloud.dns.record-sets.delete) HTTPError 404: The 'parameters.name' resource named '_system-sql._tcp.andrei-1716924081-01-n4cpu4.roachprod-managed.dataexmachina.dev.' does not exist.: exit status 1. Will attempt to create a fresh one

Notice the does not exist. I'm not entirely sure why your smoke test run showed a different error (Forbidden); maybe it has to do with how users/IAM are setup different for the CRL GCP project.

craig bot pushed a commit that referenced this pull request May 28, 2024
124768: roachtest: Revert "roachtest: don't inhibit cluster reuse on DNS deletion errors" r=srosenberg a=andreimatei

This reverts commit 596a3a2 (PR #124678). That commit made roachtest tolerate DNS record deletion errors, so that clusters can be reused even though their DNS records failed to be cleaned up. This seems to have been a bad idea, though, since stale DNS records can be a problem for reused clusters; see
#124678 (comment)

I now understand that there are two types of DNS records - normal ones (`A` records?) and `SRV` records. The former are associated with roachprod VMs. The latter are associated with cockroach nodes from host or virtual clusters, and are used for some sort of service discovery. It is the destruction of these SRV records that the original patch dealt with. Failure to delete these records might have consequences for the future tests using the cluster.

Epic: none
Release note: None

Co-authored-by: Andrei Matei <[email protected]>
@srosenberg
Copy link
Member

I've figured out my problem -- in #120340 there was confusion between the public DNS zone used for the A records, and the private DNS zone used for the SRV records.

Nice! Both zones are actually public but mutually exclusive. I am guessing Forbidden in our case was because of the wrong zone name.

I won't touch this any more myself :), but I think that there is a problem here that should probably be addressed. The SRV records are conditionally created (here), but unconditionally deleted (here). The deletion fails when the record doesn't exist ([1]). And when deletion fails, clusters are not reused.

I'll follow up with another PR. Let's first merge 102340, after the smoke test clears it.

@herkolategan
Copy link
Collaborator

If it does, then we forgo creating the service. In the above example, this means pop's call to c.Start() never creates a service on port 26257 because it see's we already have one on port 29000.

@srosenberg Getting caught up, but you are correct this is the main reason we destroy the records. In theory we should be throwing an error if we try to register the same service with a different port, or as @DarrylWong mentioned possibly re-register the service under the new port.

I won't touch this any more myself :), but I think that there is a problem here that should probably be addressed.

@andreimatei I agree this is a bug that should be addressed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
O-community Originated from the community
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants