Skip to content
This repository has been archived by the owner on Feb 9, 2024. It is now read-only.

(5.5) Background plan resume / plan tail #1899

Merged
merged 4 commits into from
Jul 23, 2020
Merged

Conversation

r0mant
Copy link
Contributor

@r0mant r0mant commented Jul 22, 2020

Description

This pull requests implements a couple of resiliency improvements for the plan resume command. The main issue was that users would often launch it and lose SSH connection (esp. relevant if launching from our web UI's terminal) which would basically terminate the resume since it executes in foreground.

  • Update gravity plan resume to launch resume from a one-shot systemd service by default.
    • It also accepts --block flag for the old behavior to run in foreground.
  • Add gravity plan --tail to be able to monitor plan progress.
    • This is useful not just for resume but in general for watching the plan.
    • Tail will exit with 0 if plan completes successfully, and non-0 if a phase completes with an error.

Type of change

  • New feature (non-breaking change which adds functionality)

Linked tickets and other PRs

TODOs

  • Self-review the change
  • Write tests
  • Perform manual testing
  • Address review feedback

Testing done

Resume in background (now default)

ubuntu@node-1:~/upgrade$ sudo ./upload
Wed Jul 22 01:10:36 UTC	Importing cluster image telekube v5.5.50-dev.8
Wed Jul 22 01:11:15 UTC	Synchronizing application with Docker registry 192.168.99.102:5000
Wed Jul 22 01:11:35 UTC	Verifying cluster health
Wed Jul 22 01:11:35 UTC	Cluster image has been uploaded
ubuntu@node-1:~/upgrade$ sudo ./gravity upgrade --manual
Wed Jul 22 01:12:04 UTC	Upgrading cluster from 5.5.40 to 5.5.50-dev.8
Wed Jul 22 01:12:04 UTC	Deploying agents on cluster nodes
Wed Jul 22 01:12:08 UTC	Deployed agent on node-1 (192.168.99.102)
The operation has been created in manual mode.

See https://gravitational.com/gravity/docs/cluster/#managing-an-ongoing-operation for details on working with operation plan.
ubuntu@node-1:~/upgrade$ sudo ./gravity plan resume --help
usage: gravity plan resume [<flags>]

Resume last aborted operation

Flags:
      --help                 Show context-sensitive help (also try --help-long and --help-man).
      --debug                Enable debug mode
  -q, --quiet                Suppress any extra output to stdout
      --insecure             Skip TLS verification
      --state-dir=STATE-DIR  Directory for local state
      --log-file="/var/log/gravity-install.log"
                             log file with diagnostic information
      --operation-id=OPERATION-ID
                             ID of the active operation. If not specified, the last operation will be used
      --block                Launch plan resume in foreground instead of a systemd unit

ubuntu@node-1:~/upgrade$ sudo ./gravity plan resume
Wed Jul 22 01:12:23 UTC	Starting gravity-resume.service service
Wed Jul 22 01:12:23 UTC	Service gravity-resume.service has been launched.

To monitor the operation progress:

  sudo gravity plan --operation-id=d6d43cae-edcd-49e7-b056-cb1a9f50ffed --tail

To monitor the service logs:

  sudo journalctl -u gravity-resume.service -f

ubuntu@node-1:~/upgrade$ sudo ./gravity plan --operation-id=d6d43cae-edcd-49e7-b056-cb1a9f50ffed --tail
Wed Jul 22 01:12:25 UTC	[  1/ 26] Phase /init/node-1 is completed
Wed Jul 22 01:12:26 UTC	[  2/ 26] Phase /checks is completed
Wed Jul 22 01:12:27 UTC	[  3/ 26] Phase /pre-update is in_progress
Wed Jul 22 01:12:33 UTC	[  3/ 26] Phase /pre-update is completed
Wed Jul 22 01:12:33 UTC	[  4/ 26] Phase /bootstrap/node-1 is in_progress
...
Wed Jul 22 01:18:07 UTC	[ 26/ 26] Phase /gc/node-1 is in_progress
Wed Jul 22 01:18:08 UTC	[ 26/ 26] Phase /gc/node-1 is completed
Wed Jul 22 01:18:09 UTC	Operation plan is completed

Resume blocking (old behavior)

ubuntu@node-1:~/upgrade$ sudo ./upload
Wed Jul 22 01:31:33 UTC	Importing cluster image telekube v5.5.50-dev.8
Wed Jul 22 01:32:18 UTC	Synchronizing application with Docker registry 192.168.99.102:5000
Wed Jul 22 01:32:36 UTC	Verifying cluster health
Wed Jul 22 01:32:36 UTC	Cluster image has been uploaded
ubuntu@node-1:~/upgrade$ sudo ./gravity upgrade --manual
Wed Jul 22 01:32:41 UTC	Upgrading cluster from 5.5.40 to 5.5.50-dev.8
Wed Jul 22 01:32:41 UTC	Deploying agents on cluster nodes
Wed Jul 22 01:32:45 UTC	Deployed agent on node-1 (192.168.99.102)
The operation has been created in manual mode.

See https://gravitational.com/gravity/docs/cluster/#managing-an-ongoing-operation for details on working with operation plan.
ubuntu@node-1:~/upgrade$ sudo ./gravity plan resume --block
Wed Jul 22 01:32:52 UTC	Executing "/init/node-1" locally
Wed Jul 22 01:32:53 UTC	initializing the operation
Wed Jul 22 01:32:53 UTC	Executing "/checks" locally
Wed Jul 22 01:32:55 UTC	Executing "/pre-update" locally
Wed Jul 22 01:33:01 UTC	Executing "/bootstrap/node-1" locally
	Still executing "/bootstrap/node-1" locally (10 seconds elapsed)
...
Wed Jul 22 01:37:57 UTC	Executing "/migration/labels" locally
Wed Jul 22 01:37:58 UTC	Executing "/app/telekube" locally
Wed Jul 22 01:37:59 UTC	Executing "/gc/node-1" locally
Wed Jul 22 01:38:00 UTC	operation(update(b95dcc7c-29c6-48f8-8f4e-a47a14abd018), cluster=test, created=2020-07-22 01:32) finished in 5 minutes

@r0mant r0mant requested a review from a team July 22, 2020 01:50
@r0mant r0mant self-assigned this Jul 22, 2020
lib/fsm/follow.go Outdated Show resolved Hide resolved
lib/fsm/follow.go Show resolved Hide resolved
}
diff, err := DiffPlan(*previousPlan, *newPlan)
if err != nil {
logrus.WithError(err).Error("Failed to diff plans.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I wonder what this error would mean to the user? Also, what could it mean to the loop itself - will it be able to terminate if this continues? Shouldn't diffing always succeed (w/o looking at the implementation of DiffPlan) - i.e. either return a diff or no diff?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a little tricky, the diff should always succeed, yes, as long as getPlan() always returns the same plan (which it should). Otherwise this would likely mean a programming error on our part, so not much a user can do.

lib/fsm/follow_test.go Show resolved Hide resolved
lib/fsm/testhelpers.go Outdated Show resolved Hide resolved
lib/fsm/utils.go Show resolved Hide resolved
return trace.Wrap(err)
}
// Make sure to launch the unit command with the --block flag.
args := append(os.Args[1:], "--debug", "--block")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is susceptible to the same issue as fixed here - something to watch for and optionally rewrite with changes from #1895.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's susceptible to the same issue. In the cases you're referring to we had to either replace the flags (remove/add), or update their values. In this case we just need to append a --block flag. Worst case scenario, it'll end up with two --debug flags which is fine. So unless I'm missing anything, I'd like to keep this simple.


// launchOneshotService launches the specified command as a one-shot systemd
// service with the specified name.
func launchOneshotService(name string, args []string) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we have a couple of similar methods which I originally tried to see if they will fit this use-case, but they're different in a couple of ways. This method, for example, makes sure there's no another service running to prevent resume from being launched multiple times simultaneously.

if err != nil {
return trace.Wrap(err)
func displayExpandOperationPlan(localEnv *localenv.LocalEnvironment, environ LocalEnvironmentFactory, opKey ops.SiteOperationKey, opts displayPlanOptions) error {
return outputOrFollowPlan(localEnv, getExpandOperationPlanFunc(environ, opKey), opts)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess you could move getXXXPlanFunc inside of the respective displayXXXPlan variant since they're no used elsewhere.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's how I did it originally, but it was a bit messy. I also think having those method in global scope might be a bit more convenient if something/someone ever needs to use them.

lib/fsm/fsm.go Outdated Show resolved Hide resolved
lib/fsm/follow.go Outdated Show resolved Hide resolved
@@ -139,6 +140,64 @@ func ResolvePlan(plan storage.OperationPlan, changelog storage.PlanChangelog) *s
return &plan
}

// DiffPlan returns the difference between the previous and the next plans in the
// form of a changelog.
func DiffPlan(prevPlan *storage.OperationPlan, nextPlan storage.OperationPlan) (diff []storage.PlanChange, err error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: pass both plans as pointers. Or is there a reason for passing nextPlan as a struct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it's actually on purpose. To indicate that the prevPlan is optional (nil can be passed), while the nextPlan should always be passed in.

lib/fsm/follow.go Outdated Show resolved Hide resolved
lib/fsm/follow_test.go Outdated Show resolved Hide resolved
lib/fsm/follow.go Outdated Show resolved Hide resolved
lib/fsm/follow.go Outdated Show resolved Hide resolved
lib/fsm/follow.go Outdated Show resolved Hide resolved
@@ -202,6 +202,10 @@ func outputOrFollowPlan(localEnv *localenv.LocalEnvironment, getPlan fsm.GetPlan
}

func followPlan(localEnv *localenv.LocalEnvironment, getPlan fsm.GetPlanFunc) error {
plan, err := getPlan()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this will now make the follow command half-resilient - before it could tolerate initial transient failure and now will fail immediately forcing the user to retry. Not sure which is better though.

@r0mant r0mant merged commit 5a8b0d8 into version/5.5.x Jul 23, 2020
@r0mant r0mant deleted the roman/5.5/resume branch July 23, 2020 16:00
r0mant added a commit that referenced this pull request Jul 28, 2020
r0mant added a commit that referenced this pull request Jul 28, 2020
r0mant added a commit that referenced this pull request Jul 28, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants