Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachprod: provide workaround for long-running AWS clusters #98682

Merged

Conversation

renatolabs
Copy link
Contributor

In #98076, we started validating hostnames before running any commands to avoid situations where a stale cache could lead to unintended interference with other clusters due to public IP reuse. The check relies on the VM's hostname matching the expected cluster name in the cache. GCP and Azure clusters set the hostname to the instance name by default, but that is not the case for AWS; the aforementioned PR explicitly sets the hostname when the instance is created.

However, in the case of long running AWS clusters (created before host validation was introduced) or clusters that are created with an outdated version of roachprod, the hostname will still be the default AWS hostname, and any interaction with that cluster will fail if using a recent roachprod version. To remedy this situation, this commit includes:

  • better error reporting. When we attempt to run a command on an AWS cluster and host validation fails, we display a message to the user explaining that their hostnames may need fixing.

  • if the user confirms that the cluster still exists (by running roachprod list), they are able to automatically fix the hostnames to the expected value by running a new fix-long-running-aws-hostnames command. This is a temporary workaround that should be removed once we no longer have clusters that would be affected by this issue.

This commit will be reverted once we no longer have clusters created with the default hostnames; this will be easier to achieve once we have an easy way for everyone to upgrade their roachprod (see #97311).

Epic: none

Release note: None

In cockroachdb#98076, we started validating hostnames before running any commands
to avoid situations where a stale cache could lead to unintended
interference with other clusters due to public IP reuse. The check
relies on the VM's `hostname` matching the expected cluster name in
the cache. GCP and Azure clusters set the hostname to the instance
name by default, but that is not the case for AWS; the aforementioned
PR explicitly sets the hostname when the instance is created.

However, in the case of long running AWS clusters (created before host
validation was introduced) or clusters that are created with an
outdated version of `roachprod`, the hostname will still be the
default AWS hostname, and any interaction with that cluster will fail
if using a recent `roachprod` version. To remedy this situation, this
commit includes:

* better error reporting. When we attempt to run a command on an AWS
cluster and host validation fails, we display a message to the user
explaining that their hostnames may need fixing.

* if the user confirms that the cluster still exists (by running
`roachprod list`), they are able to automatically fix the hostnames to
the expected value by running a new `fix-long-running-aws-hostnames`
command. This is a temporary workaround that should be removed once we
no longer have clusters that would be affected by this issue.

This commit will be reverted once we no longer have clusters created
with the default hostnames; this will be easier to achieve once we
have an easy way for everyone to upgrade their `roachprod` (see cockroachdb#97311).

Epic: none

Release note: None
@renatolabs renatolabs requested a review from a team as a code owner March 15, 2023 15:42
@renatolabs renatolabs requested review from srosenberg and smg260 and removed request for a team March 15, 2023 15:42
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@renatolabs
Copy link
Contributor Author

bors r=srosenberg

TFTR!

@craig
Copy link
Contributor

craig bot commented Mar 16, 2023

Build succeeded:

@craig craig bot merged commit 683545c into cockroachdb:master Mar 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants