Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachprod: only use cluster with positive lifetime #101434

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

aliher1911
Copy link
Contributor

Previously it was possible to use locally cached cluster which is already expired and was GCd. That could lead to operations affecting VMs that belong to different clusters that is now reusing old cluster IPs.
This PR adds an expiration check when reading cluster info from local file cache.

Epic: none

Release note: None

Previously it was possible to use locally cached cluster which is
already expired and was GCd. That could lead to operations affecting
VMs that belong to different clusters that is now reusing old cluster
IPs.
This PR adds an expiration check when reading cluster info from local
file cache.

Epic: none

Release note: None
@aliher1911 aliher1911 requested a review from a team as a code owner April 13, 2023 11:25
@aliher1911 aliher1911 requested review from srosenberg and smg260 and removed request for a team April 13, 2023 11:25
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@aliher1911
Copy link
Contributor Author

aliher1911 commented Apr 13, 2023

Not needed I guess. Because we should check cluster name.

@aliher1911 aliher1911 closed this Apr 13, 2023
@aliher1911
Copy link
Contributor Author

After testing existing fix to check cluster name on target node I got this:

~/g/s/g/c/cockroach ❯❯❯ roachprod start oleg-backup                                                                                                                                                                                                                                            master ✭ ◼
17:24:42 cockroach.go:184: oleg-backup: starting nodes
17:24:47 cluster_synced.go:2446: 0: failed to upload start script: : SSH_PROBLEM: exit status 255
(1) secondary error attachment
  | failed to upload start script: : SSH_PROBLEM: exit status 255
  | (1) attached stack trace
  |   -- stack trace:
  |   | github.com/cockroachdb/cockroach/pkg/roachprod/install.(*SyncedCluster).startNode.func1
  |   | 	github.com/cockroachdb/cockroach/pkg/roachprod/install/cockroach.go:394
  |   | github.com/cockroachdb/cockroach/pkg/roachprod/install.(*SyncedCluster).startNode
  |   | 	github.com/cockroachdb/cockroach/pkg/roachprod/install/cockroach.go:398
  |   | github.com/cockroachdb/cockroach/pkg/roachprod/install.(*SyncedCluster).Start.func1
  |   | 	github.com/cockroachdb/cockroach/pkg/roachprod/install/cockroach.go:192
  |   | github.com/cockroachdb/cockroach/pkg/roachprod/install.(*SyncedCluster).ParallelE.func1.1.1
  |   | 	github.com/cockroachdb/cockroach/pkg/roachprod/install/cluster_synced.go:2498
  |   | github.com/cockroachdb/cockroach/pkg/roachprod/install.runWithMaybeRetry
  |   | 	github.com/cockroachdb/cockroach/pkg/roachprod/install/cluster_synced.go:162
  |   | github.com/cockroachdb/cockroach/pkg/roachprod/install.(*SyncedCluster).ParallelE.func1.1
  |   | 	github.com/cockroachdb/cockroach/pkg/roachprod/install/cluster_synced.go:2498
  |   | runtime.goexit
  |   | 	GOROOT/src/runtime/asm_amd64.s:1594
  | Wraps: (2) failed to upload start script:
  | Wraps: (3) SSH_PROBLEM
  | Wraps: (4) forced error mark
  |   | "SSH error occurred with exit code 255"
  |   | github.com/cockroachdb/errors/withstack/*withstack.withStack::
  | Wraps: (5) exit status 255
  | Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) errors.SSH (4) *markers.withMark (5) *exec.ExitError
Wraps: (2) attached stack trace
  -- stack trace:
  | github.com/cockroachdb/cockroach/pkg/roachprod/install.(*SyncedCluster).startNode.func1
  | 	github.com/cockroachdb/cockroach/pkg/roachprod/install/cockroach.go:394
  | github.com/cockroachdb/cockroach/pkg/roachprod/install.(*SyncedCluster).startNode
  | 	github.com/cockroachdb/cockroach/pkg/roachprod/install/cockroach.go:398
  | github.com/cockroachdb/cockroach/pkg/roachprod/install.(*SyncedCluster).Start.func1
  | 	github.com/cockroachdb/cockroach/pkg/roachprod/install/cockroach.go:192
  | github.com/cockroachdb/cockroach/pkg/roachprod/install.(*SyncedCluster).ParallelE.func1.1.1
  | 	github.com/cockroachdb/cockroach/pkg/roachprod/install/cluster_synced.go:2498
  | github.com/cockroachdb/cockroach/pkg/roachprod/install.runWithMaybeRetry
  | 	github.com/cockroachdb/cockroach/pkg/roachprod/install/cluster_synced.go:162
  | github.com/cockroachdb/cockroach/pkg/roachprod/install.(*SyncedCluster).ParallelE.func1.1
  | 	github.com/cockroachdb/cockroach/pkg/roachprod/install/cluster_synced.go:2498
  | runtime.goexit
  | 	GOROOT/src/runtime/asm_amd64.s:1594
Wraps: (3) failed to upload start script:
Wraps: (4) SSH_PROBLEM
Wraps: (5) forced error mark
  | "SSH error occurred with exit code 255"
  | github.com/cockroachdb/errors/withstack/*withstack.withStack::
Wraps: (6) exit status 255
Error types: (1) *secondary.withSecondaryError (2) *withstack.withStack (3) *errutil.withPrefix (4) errors.SSH (5) *markers.withMark (6) *exec.ExitError:
Error: parallel execution failure: failed to upload start script: : SSH_PROBLEM: exit status 255
(1) attached stack trace
  -- stack trace:
  | github.com/cockroachdb/cockroach/pkg/roachprod/install.(*SyncedCluster).ParallelE
  | 	github.com/cockroachdb/cockroach/pkg/roachprod/install/cluster_synced.go:2547
  | github.com/cockroachdb/cockroach/pkg/roachprod/install.(*SyncedCluster).Parallel
  | 	github.com/cockroachdb/cockroach/pkg/roachprod/install/cluster_synced.go:2444
  | github.com/cockroachdb/cockroach/pkg/roachprod/install.(*SyncedCluster).Start
  | 	github.com/cockroachdb/cockroach/pkg/roachprod/install/cockroach.go:187
  | github.com/cockroachdb/cockroach/pkg/roachprod.Start
  | 	github.com/cockroachdb/cockroach/pkg/roachprod/roachprod.go:695
  | main.glob..func9
  | 	main/pkg/cmd/roachprod/main.go:465
  | main.wrap.func1
  | 	main/pkg/cmd/roachprod/main.go:73
  | github.com/spf13/cobra.(*Command).execute
  | 	github.com/spf13/cobra/external/com_github_spf13_cobra/command.go:860
  | github.com/spf13/cobra.(*Command).ExecuteC
  | 	github.com/spf13/cobra/external/com_github_spf13_cobra/command.go:974
  | github.com/spf13/cobra.(*Command).Execute
  | 	github.com/spf13/cobra/external/com_github_spf13_cobra/command.go:902
  | main.main
  | 	main/pkg/cmd/roachprod/main.go:1226
  | [...repeated from below...]
Wraps: (2) parallel execution failure
Wraps: (3) secondary error attachment
  | failed to upload start script: : SSH_PROBLEM: exit status 255
  | (1) attached stack trace
  |   -- stack trace:
  |   | github.com/cockroachdb/cockroach/pkg/roachprod/install.(*SyncedCluster).startNode.func1
  |   | 	github.com/cockroachdb/cockroach/pkg/roachprod/install/cockroach.go:394
  |   | github.com/cockroachdb/cockroach/pkg/roachprod/install.(*SyncedCluster).startNode
  |   | 	github.com/cockroachdb/cockroach/pkg/roachprod/install/cockroach.go:398
  |   | github.com/cockroachdb/cockroach/pkg/roachprod/install.(*SyncedCluster).Start.func1
  |   | 	github.com/cockroachdb/cockroach/pkg/roachprod/install/cockroach.go:192
  |   | github.com/cockroachdb/cockroach/pkg/roachprod/install.(*SyncedCluster).ParallelE.func1.1.1
  |   | 	github.com/cockroachdb/cockroach/pkg/roachprod/install/cluster_synced.go:2498
  |   | github.com/cockroachdb/cockroach/pkg/roachprod/install.runWithMaybeRetry
  |   | 	github.com/cockroachdb/cockroach/pkg/roachprod/install/cluster_synced.go:162
  |   | github.com/cockroachdb/cockroach/pkg/roachprod/install.(*SyncedCluster).ParallelE.func1.1
  |   | 	github.com/cockroachdb/cockroach/pkg/roachprod/install/cluster_synced.go:2498
  |   | runtime.goexit
  |   | 	GOROOT/src/runtime/asm_amd64.s:1594
  | Wraps: (2) failed to upload start script:
  | Wraps: (3) SSH_PROBLEM
  | Wraps: (4) forced error mark
  |   | "SSH error occurred with exit code 255"
  |   | github.com/cockroachdb/errors/withstack/*withstack.withStack::
  | Wraps: (5) exit status 255
  | Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) errors.SSH (4) *markers.withMark (5) *exec.ExitError
Wraps: (4) attached stack trace
  -- stack trace:
  | github.com/cockroachdb/cockroach/pkg/roachprod/install.(*SyncedCluster).startNode.func1
  | 	github.com/cockroachdb/cockroach/pkg/roachprod/install/cockroach.go:394
  | github.com/cockroachdb/cockroach/pkg/roachprod/install.(*SyncedCluster).startNode
  | 	github.com/cockroachdb/cockroach/pkg/roachprod/install/cockroach.go:398
  | github.com/cockroachdb/cockroach/pkg/roachprod/install.(*SyncedCluster).Start.func1
  | 	github.com/cockroachdb/cockroach/pkg/roachprod/install/cockroach.go:192
  | github.com/cockroachdb/cockroach/pkg/roachprod/install.(*SyncedCluster).ParallelE.func1.1.1
  | 	github.com/cockroachdb/cockroach/pkg/roachprod/install/cluster_synced.go:2498
  | github.com/cockroachdb/cockroach/pkg/roachprod/install.runWithMaybeRetry
  | 	github.com/cockroachdb/cockroach/pkg/roachprod/install/cluster_synced.go:162
  | github.com/cockroachdb/cockroach/pkg/roachprod/install.(*SyncedCluster).ParallelE.func1.1
  | 	github.com/cockroachdb/cockroach/pkg/roachprod/install/cluster_synced.go:2498
  | runtime.goexit
  | 	GOROOT/src/runtime/asm_amd64.s:1594
Wraps: (5) failed to upload start script:
Wraps: (6) SSH_PROBLEM
Wraps: (7) forced error mark
  | "SSH error occurred with exit code 255"
  | github.com/cockroachdb/errors/withstack/*withstack.withStack::
Wraps: (8) exit status 255
Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *secondary.withSecondaryError (4) *withstack.withStack (5) *errutil.withPrefix (6) errors.SSH (7) *markers.withMark (8) *exec.ExitError

Which just states SSH error because I think IP is not not used for other cluster.
I think making explicit checks of lifetime is still helpful and user friendly.

We should also improve error messages that roachprod outputs.

@aliher1911
Copy link
Contributor Author

Reopening as general improvement request.

@aliher1911 aliher1911 reopened this Apr 14, 2023
@aliher1911 aliher1911 self-assigned this Apr 14, 2023
@renatolabs
Copy link
Contributor

Checking the lifetime is a possibility, but IMO a more long term solution would be to surface authentication errors more explicitly (and don't retry them), instead of the generic SSH_PROBLEM exit stats 255 that we have at the moment.

There are some thoughts in #100875.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants