-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: don't inhibit cluster reuse on DNS deletion errors #124678
Conversation
Before this patch, a failure to delete a cluster's DNS records resulted in roachtest refusing to reuse that cluster for other tests. In general, refusing to reuse a cluster that has not been completely wiped is a sane policy (on the argument that the next test running on that cluster might be impacted by the cluster's dirty state), but DNS records in particular don't matter. So, let's be more tolerant of such errors. For people outside of CRL, that DNS deletion seems to always fail (probably because no DNS record was created in the first place) -- so this patch helps me in particular. Epic: None Release note: None
Thank you for contributing to CockroachDB. Please ensure you have followed the guidelines for creating a PR. My owl senses detect your PR is good for review. Please keep an eye out for any test failures in CI. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
cc @RaduBerinde @srosenberg - a one liner for your kind consideration |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bors r+
They do now since DNS is also used as a discovery service for multi-tenant roachprod clusters. @herkolategan I seem to recall we had some issue(s) around cluster reuse and stale DNS service record(s)? |
Yes, I believe reusing the cluster with stale records may cause the next test's call to |
I have not observed that, fwiw. |
Sorry for jumping the gun on merging. |
No worries. If it's a real issue for you, I can send a PR to do what Renato suggested. We already have a hook in
No problem. Unfortunately, even seemingly trivial changes to roachtest/roachprod may have subtle side-effects. That's one of the reasons we usually kick off a run in CI with
|
…tion errors" This reverts commit 596a3a2 (PR cockroachdb#124678). That commit made roachtest tolerate DNS record deletion errors, so that clusters can be reused even though their DNS records failed to be cleaned up. This seems to have been a bad idea, though, since stale DNS records can be a problem for reused clusters; see cockroachdb#124678 (comment) I now understand that there are two types of DNS records - normal ones (`A` records?) and `SRV` records. The former are associated with roachprod VMs. The latter are associated with cockroach nodes from host or virtual clusters, and are used for some sort of service discovery. It is the destruction of these SRV records that the original patch dealt with. Failure to delete these records might have consequences for the future tests using the cluster. Epic: none Release note: None
Reverting in #124768
Let me look into it more, and understand exactly what wasn't working for me. |
I've figured out my problem -- in #120340 there was confusion between the public DNS zone used for the A records, and the private DNS zone used for the SRV records. The DNS record creation (in particular, the SRV records creation) worked fine, but then the deletion tried to delete them from the public zone, instead of the private one, and failed.
I'll push a fixed version of #120340, and this time it'll be good™.
I won't touch this any more myself :), but I think that there is a problem here that should probably be addressed. The SRV records are conditionally created (here), but unconditionally deleted (here). The deletion fails when the record doesn't exist ([1]). And when deletion fails, clusters are not reused. Specifically, the conditional creation is inhibited when the test does not run on the default "provider" (i.e. GCE) or on the default GCE project. [1]: The error I was getting locally was:
Notice the |
124768: roachtest: Revert "roachtest: don't inhibit cluster reuse on DNS deletion errors" r=srosenberg a=andreimatei This reverts commit 596a3a2 (PR #124678). That commit made roachtest tolerate DNS record deletion errors, so that clusters can be reused even though their DNS records failed to be cleaned up. This seems to have been a bad idea, though, since stale DNS records can be a problem for reused clusters; see #124678 (comment) I now understand that there are two types of DNS records - normal ones (`A` records?) and `SRV` records. The former are associated with roachprod VMs. The latter are associated with cockroach nodes from host or virtual clusters, and are used for some sort of service discovery. It is the destruction of these SRV records that the original patch dealt with. Failure to delete these records might have consequences for the future tests using the cluster. Epic: none Release note: None Co-authored-by: Andrei Matei <[email protected]>
Nice! Both zones are actually public but mutually exclusive. I am guessing
I'll follow up with another PR. Let's first merge 102340, after the smoke test clears it. |
@srosenberg Getting caught up, but you are correct this is the main reason we destroy the records. In theory we should be throwing an error if we try to register the same service with a different port, or as @DarrylWong mentioned possibly re-register the service under the new port.
@andreimatei I agree this is a bug that should be addressed. |
Before this patch, a failure to delete a cluster's DNS records resulted in roachtest refusing to reuse that cluster for other tests. In general, refusing to reuse a cluster that has not been completely wiped is a sane policy (on the argument that the next test running on that cluster might be impacted by the cluster's dirty state), but DNS records in particular don't matter. So, let's be more tolerant of such errors.
For people outside of CRL, that DNS deletion seems to always fail (probably because no DNS record was created in the first place) -- so this patch helps me in particular.
Epic: None
Release note: None