Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run script before workspace stop #4677

Closed
Tracked by #5318
bpmct opened this issue Oct 21, 2022 · 9 comments · Fixed by #6139
Closed
Tracked by #5318

Run script before workspace stop #4677

bpmct opened this issue Oct 21, 2022 · 9 comments · Fixed by #6139
Assignees

Comments

@bpmct
Copy link
Member

bpmct commented Oct 21, 2022

The Coder agent has a startup_script to run arbitrary actions when it is started.

A shutdown_script could be added to run clean-up operations, collect stats, etc. before Coder actually changes the workspace state. A configurable timeout could be added to ensure a shutdown still occurs even if the script fails/hangs.

@bpmct bpmct changed the title Runs script before workspace stop Run script before workspace stop Oct 25, 2022
@mtojek mtojek self-assigned this Nov 24, 2022
@mtojek
Copy link
Member

mtojek commented Nov 24, 2022

Looking at it

@mtojek
Copy link
Member

mtojek commented Nov 30, 2022

Hi @mafredri! I spent some time researching possible options to implement the feature. For people unfamiliar with the challenge, the submitted solution won't solve the problem due to the way Terraform operates.

Proposed solution:

The idea is to introduce a new feature to remotely execute predefined commands - startup and shutdown scripts, and let provisioners report the exact stage in the lifecycle (before and after tf apply).

In the future, we can allow for custom shell commands.

ProvisionerDaemon checks-in twice to notify ProvisionerServer about lifecycle moments: pre-provision, post-provision.

Starting a workspace:

On pre-provision, ProvisionerServer will clear all queued commands for the specific agent.

On post-provision, ProvisionerServer will insert an entry to the workspace_agents_commands table to order agents to call the "startup_script" -
one row per agent. An agent periodically checks the healthcheck endpoint and fetches incoming commands. When received a command,
it will execute it, and confirm its status with Coderd API POST .../commands.

Stopping a workspace:

On pre-provision, ProvisionerServer will insert an entry to the workspace_agents_commands table to order agents to call the "shutdown_script".
Follow-up actions are the same as in "Starting a workspace".

On post-provision, ProvisionerServer will clear all queued commands for the specific agent.

References:

service ProvisionerDaemon {
    rpc AcquireJob(Empty) returns (AcquiredJob);
    rpc CommitQuota(CommitQuotaRequest) returns (CommitQuotaResponse);
    rpc UpdateJob(UpdateJobRequest) returns (UpdateJobResponse);

    // OnPreProvision notifies the ProvisionerServer that the daemon is ready for starting the provisioning (tf apply).
    // ProvisionerServer will respond with OK status, if all actions on the agents' side are done, for instance, shutdown_script has been executed.
    //
    // Daemon should keep repeating the call until receiving the OK status.
    rpc OnPreProvision(OnPreProvisionRequest) returns (OnPreProvisionResponse);

    // OnPostProvisionDone notifies the ProvisionerServer that the daemon finished the provisioning (tf apply).
    // ProvisionerServer will respond with OK status, if all actions on the agents' side are done, for instance, startup_script has been executed.
    //
    // Daemon should keep repeating the call until receiving the OK status.
    rpc OnPostProvision(OnPostProvisionRequest) returns (OnPostProvisionResponse);

    rpc FailJob(FailedJob) returns (Empty);

    // CompleteJob should return "bad state" response if the lifecycle order is not correct: PreProvision - PostProvision - Complete.
    rpc CompleteJob(CompletedJob) returns (Empty);
}

Database:

Table workspace_agents_commands:

command_id,
workspace_agent_id,
timestamp,
command (startup_script, shutdown_script)
command_status (ok, error, <empty>)

Coderd API:

  1. Extend /workspaceagents/<agent>/app-health to return queued commands. Alternatively, we can add a new endpoint GET /workspaceagents/<agent>/commands
  2. POST /workspaceagents/<agent>/commands/<command_id>: report the command progress

We can discuss it here or offline. Let me know what you think about the concept.

@mafredri
Copy link
Member

mafredri commented Dec 1, 2022

@mtojek I like your suggestion for how to deal with startup/stop scripts (I could imagine this format to be extended to support cron/scheduled commands as well). I do, however, worry about making start/stop scripts a part of the provisioning stage.

If we consider that there are a limited number of provisioners and we're creating jobs for workspaces that could be taking 30 minutes to start up (startup script preparation). That's a long time to be using up a provisioning slot.

In this sense, it might be best to keep the provisioning part as short as possible (like currently), maybe we just introduce new states for the workspace(s) / agents.

Let's consider we have a workspace with two agents, one has a startup script that takes 30 minutes and another which only takes 5 seconds. With pre/post provision, I guess this would be represented as workspace provisioning for 30 minutes and the 5 second agent not being ready until then? It's not incorrect, but it feels like a "partially started up" state is more appropriate for the workspace in this case.

(I focused on startup here, but the same applies for shutdown.)

In summary, it might be best to complete the provision quickly, and handle workspace/agent status differently. I don't have a concrete suggestion for how to accomplish this, however.

Maybe we could represent this with new workspace/agent states:

  • Agent
    • Add starting / stopping states (normal connections wait/block in this state, but something like ssh --debug can bypass)
  • Workspace
    • Current starting / stopping becomes provisioning / deprovisioning
    • "New" starting / stopping whilst any one agent is still starting/executing the script
      • E.g. after agent(s) have stopped, we move to deprovisioning

I haven't fully thought this through, though. There may be better ways to accomplish this too. (I would've liked for provisioning / deprovisioning to be called creating / deleting, but this conflicts with the act of creating and deleting workspaces. 😅)

@mtojek
Copy link
Member

mtojek commented Dec 1, 2022

If we consider that there are a limited number of provisioners and we're creating jobs for workspaces that could be taking 30 minutes to start up (startup script preparation). That's a long time to be using up a provisioning slot.

This is a valid concern, but it rather suggests an asynchronous nature of provisioners. Most likely it will increase the complexity of the solution, but won't lock single provisioners.

It's not incorrect, but it feels like a "partially started up" state is more appropriate for the workspace in this case.

Well, it depends on the use-case. It's a similar situation to having the Dockerfile healthcheck defined. Without the healthcheck it's ready to use, but you may defer the "healthy" state if you can verify the real health condition.

In summary, it might be best to complete the provision quickly, and handle workspace/agent status differently. I don't have a concrete suggestion for how to accomplish this, however.

I understand your concerns. I'm afraid that it means that we will still need a wrapper around provisioners that will watch environments and report health. Alternatively, we can leave it to agents to just notify coderd when they are ready to be used by customers, which looks like what we have now - without a more accurate health report.

Let's say, we accept this assumption and just depend on an agent to report the "100% healthy" condition. We will still need a mechanism/logic/thread that will call the shutdown_script on the agent side and wait until it's executed.

Maybe it's the right moment to distinguish two kinds of provisioners:

  1. Infrastructure - covered by Terraform.
  2. Dev Environment - covered by ?, responsible for installing software on top of infra, with a well-defined lifecycle: Starting, Started, Running, Stopping, Stopped. Unfortunately, it means that agents will start hitting Coderd API more actively.

Summing up:

I see two options:

  1. Elongating provisioning time by including the wait-time to fully bootstrap dev environments.
  2. Introduce a new virtual provisioner type that operates on the agent level - just to prepare the software layer. I'm pretty sure that how the AWS CodeDeploy works internally or it's internal ancestor, Apollo :)

Let me know your thoughts.

@mtojek
Copy link
Member

mtojek commented Dec 1, 2022

Hey @kylecarbs @bpmct! It would be great to see what's your take on these design concepts so that we won't end up with a half-year project :)

@bpmct
Copy link
Member Author

bpmct commented Dec 1, 2022

I'm not the best to comment on how we want to refactor our provisioner but I had a few thoughts as I'm reading this thread.

Infrastructure - covered by Terraform.
Dev Environment - covered by ?, responsible for installing software on top of infra, with a well-defined lifecycle: Starting, Started, Running, Stopping, Stopped. Unfortunately, it means that agents will start hitting Coderd API more actively.

This sounds nice and may solve a similar problem. I've encountered a few scenarios where I'd rather not use Terraform for workspace state changes. For example, we have to use a user-data hack to stop AWS instances. Instead of running a terraform apply, I'd rather have Coder run aws instance <start/stop> or similar.

I was also working on a vcluster template where I'd rather have Coder run vcluster <pause/resume> on state changes. Trying to do this via Terraform is difficult because the Kubernetes provider always expects a valid address (which fails when the cluster is paused).

On the other hand, having Terraform cover both "infrastructure + dev environment" works well in some scenarios, such as Kubernetes and GCP templates. Using count to simply remove a compute but keep the PVC (always has count: 1 is quite elegant. So I imagine we'd want to allow Terraform to cover both cases.

--

I always imagined the agent would run the shutdown script prior to any terraform apply operations. Is this the desired behavior for both of the options you proposed @mtojek?

e.g. these things happen synchronously

user clicks "stop" ->
coder agent runs shutdown script ->
the script exits or times out (e.g. 15 seconds) ->
terraform apply runs

If we were to allow users completely separate the concept of infrastructure and the dev environment, we could also support

user clicks "stop" ->
coder agent runs shutdown script ->
the script exits or times out (e.g. 15 seconds) ->
terraform apply runs || custom provisioner runs || nothing happens

were you also thinking about conditionally running the infra provisioner based on the script?

user clicks "stop" ->
coder agent runs shutdown script ->
the script exits or times out (e.g. 15 seconds) ->
if exit 0:
  terraform apply || other provisioner || do nothing
if exit 1 or time out:
  error state

@mtojek
Copy link
Member

mtojek commented Dec 2, 2022

For example, we have to use a hashicorp/terraform-provider-aws#22 (comment) to stop AWS instances. Instead of running a terraform apply, I'd rather have Coder run aws instance <start/stop> or similar.

Yes, it sounds like a more natural way to solve this problem 😄

Using count to simply remove a compute but keep the PVC (always has count: 1 is quite elegant

From the architecture design perspective, it seems to me to as a hack, so in this case, the lifecycle similar to CodeDeploy's would be more organized and flexible. Users could even run some smoke tests to verify if the workspace is ready.

I always imagined the agent would run the shutdown script prior to any terraform apply operations. Is this the desired behavior for both of the options you proposed @mtojek?

Yes, it's a hard requirement. To be honest It depends on deciding which entity should trigger it and how to make sure that it has been executed. I can prepare a more detailed design doc if you want.

@bpmct bpmct mentioned this issue Dec 6, 2022
5 tasks
@kconley-sq
Copy link
Contributor

I am interested in using this shutdown_script feature to make the Coder agent run a command like sync and/or fsfreeze to ensure that any pending data writes held in RAM are written out to storage before Terraform runs to change the state of the workspace's storage volumes.

I am using AWS EBS snapshots in my workspace template to persist data in Coder workspaces across workspace restarts, so if data in an EC2 instance's RAM has not been written out to storage before Coder runs Terraform, then Terraform may create an EBS snapshot that does not contain all of the workspace's data that should be persisted which causes that unwritten data to be lost when the EBS volume is destroyed on workspace stop.

Unfortunately, keeping the EBS volume around across workspace restarts to avoid that issue would be too expensive for my use case. 😞

As a workaround today I am setting stop_instance_before_detaching = true on the EBS volume attachment I am using in my workspace template to stop the EC2 instance before the EBS snapshot is taken of the EBS volume, but similar to what @bpmct mentioned I feel like this use case would be better handled using a shutdown_script instead of relying on the implicit behavior of an EC2 instance being stopped to sync RAM to storage as part of Linux shutdown.

mafredri added a commit that referenced this issue Jan 30, 2023
This change allows the agent to handle common shutdown signals like
interrupt, hangup and terminate and initiate a graceful shutdown.

As long as terraform providers initiate graceful shutdowns via the
aforementioned signals, things like SSH connections will be closed
immediately on shutdown instead of being left hanging/timing out due to
the agent being abruptly killed.

Refs: #4677, #5901
mafredri added a commit that referenced this issue Jan 30, 2023
This change allows the agent to handle common shutdown signals like
interrupt, hangup and terminate and initiate a graceful shutdown.

As long as terraform providers initiate graceful shutdowns via the
aforementioned signals, things like SSH connections will be closed
immediately on shutdown instead of being left hanging/timing out due to
the agent being abruptly killed.

Refs: #4677, #5901
@ammario ammario assigned mafredri and unassigned mtojek Feb 6, 2023
@mafredri
Copy link
Member

A lot of topics related to shutdown are touched upon in this issue, so I've decided to break it up into smaller parts.

This issue will be considered fixed by #5914, and future improvements to graceful shutdown will be tracked in #6180.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants