roachprod: provide workaround for long-running AWS clusters #98682
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
In #98076, we started validating hostnames before running any commands to avoid situations where a stale cache could lead to unintended interference with other clusters due to public IP reuse. The check relies on the VM's
hostname
matching the expected cluster name in the cache. GCP and Azure clusters set the hostname to the instance name by default, but that is not the case for AWS; the aforementioned PR explicitly sets the hostname when the instance is created.However, in the case of long running AWS clusters (created before host validation was introduced) or clusters that are created with an outdated version of
roachprod
, the hostname will still be the default AWS hostname, and any interaction with that cluster will fail if using a recentroachprod
version. To remedy this situation, this commit includes:better error reporting. When we attempt to run a command on an AWS cluster and host validation fails, we display a message to the user explaining that their hostnames may need fixing.
if the user confirms that the cluster still exists (by running
roachprod list
), they are able to automatically fix the hostnames to the expected value by running a newfix-long-running-aws-hostnames
command. This is a temporary workaround that should be removed once we no longer have clusters that would be affected by this issue.This commit will be reverted once we no longer have clusters created with the default hostnames; this will be easier to achieve once we have an easy way for everyone to upgrade their
roachprod
(see #97311).Epic: none
Release note: None