-
Notifications
You must be signed in to change notification settings - Fork 732
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Run script before workspace stop #4677
Comments
Looking at it |
Hi @mafredri! I spent some time researching possible options to implement the feature. For people unfamiliar with the challenge, the submitted solution won't solve the problem due to the way Terraform operates. Proposed solution: The idea is to introduce a new feature to remotely execute predefined commands - startup and shutdown scripts, and let provisioners report the exact stage in the lifecycle (before and after In the future, we can allow for custom shell commands.
Starting a workspace: On pre-provision, On post-provision, Stopping a workspace: On pre-provision, On post-provision, References: service ProvisionerDaemon {
rpc AcquireJob(Empty) returns (AcquiredJob);
rpc CommitQuota(CommitQuotaRequest) returns (CommitQuotaResponse);
rpc UpdateJob(UpdateJobRequest) returns (UpdateJobResponse);
// OnPreProvision notifies the ProvisionerServer that the daemon is ready for starting the provisioning (tf apply).
// ProvisionerServer will respond with OK status, if all actions on the agents' side are done, for instance, shutdown_script has been executed.
//
// Daemon should keep repeating the call until receiving the OK status.
rpc OnPreProvision(OnPreProvisionRequest) returns (OnPreProvisionResponse);
// OnPostProvisionDone notifies the ProvisionerServer that the daemon finished the provisioning (tf apply).
// ProvisionerServer will respond with OK status, if all actions on the agents' side are done, for instance, startup_script has been executed.
//
// Daemon should keep repeating the call until receiving the OK status.
rpc OnPostProvision(OnPostProvisionRequest) returns (OnPostProvisionResponse);
rpc FailJob(FailedJob) returns (Empty);
// CompleteJob should return "bad state" response if the lifecycle order is not correct: PreProvision - PostProvision - Complete.
rpc CompleteJob(CompletedJob) returns (Empty);
} Database: Table
Coderd API:
We can discuss it here or offline. Let me know what you think about the concept. |
@mtojek I like your suggestion for how to deal with startup/stop scripts (I could imagine this format to be extended to support cron/scheduled commands as well). I do, however, worry about making start/stop scripts a part of the provisioning stage. If we consider that there are a limited number of provisioners and we're creating jobs for workspaces that could be taking 30 minutes to start up (startup script preparation). That's a long time to be using up a provisioning slot. In this sense, it might be best to keep the provisioning part as short as possible (like currently), maybe we just introduce new states for the workspace(s) / agents. Let's consider we have a workspace with two agents, one has a startup script that takes 30 minutes and another which only takes 5 seconds. With pre/post provision, I guess this would be represented as workspace provisioning for 30 minutes and the 5 second agent not being ready until then? It's not incorrect, but it feels like a "partially started up" state is more appropriate for the workspace in this case. (I focused on startup here, but the same applies for shutdown.) In summary, it might be best to complete the provision quickly, and handle workspace/agent status differently. I don't have a concrete suggestion for how to accomplish this, however. Maybe we could represent this with new workspace/agent states:
I haven't fully thought this through, though. There may be better ways to accomplish this too. (I would've liked for provisioning / deprovisioning to be called creating / deleting, but this conflicts with the act of creating and deleting workspaces. 😅) |
This is a valid concern, but it rather suggests an asynchronous nature of provisioners. Most likely it will increase the complexity of the solution, but won't lock single provisioners.
Well, it depends on the use-case. It's a similar situation to having the Dockerfile healthcheck defined. Without the healthcheck it's ready to use, but you may defer the "healthy" state if you can verify the real health condition.
I understand your concerns. I'm afraid that it means that we will still need a wrapper around provisioners that will watch environments and report health. Alternatively, we can leave it to agents to just notify coderd when they are ready to be used by customers, which looks like what we have now - without a more accurate health report. Let's say, we accept this assumption and just depend on an agent to report the "100% healthy" condition. We will still need a mechanism/logic/thread that will call the shutdown_script on the agent side and wait until it's executed. Maybe it's the right moment to distinguish two kinds of provisioners:
Summing up: I see two options:
Let me know your thoughts. |
Hey @kylecarbs @bpmct! It would be great to see what's your take on these design concepts so that we won't end up with a half-year project :) |
I'm not the best to comment on how we want to refactor our provisioner but I had a few thoughts as I'm reading this thread.
This sounds nice and may solve a similar problem. I've encountered a few scenarios where I'd rather not use Terraform for workspace state changes. For example, we have to use a user-data hack to stop AWS instances. Instead of running a I was also working on a vcluster template where I'd rather have Coder run On the other hand, having Terraform cover both "infrastructure + dev environment" works well in some scenarios, such as Kubernetes and GCP templates. Using -- I always imagined the agent would run the shutdown script prior to any e.g. these things happen synchronously
If we were to allow users completely separate the concept of infrastructure and the dev environment, we could also support
were you also thinking about conditionally running the infra provisioner based on the script?
|
Yes, it sounds like a more natural way to solve this problem 😄
From the architecture design perspective, it seems to me to as a hack, so in this case, the lifecycle similar to CodeDeploy's would be more organized and flexible. Users could even run some smoke tests to verify if the workspace is ready.
Yes, it's a hard requirement. To be honest It depends on deciding which entity should trigger it and how to make sure that it has been executed. I can prepare a more detailed design doc if you want. |
I am interested in using this I am using AWS EBS snapshots in my workspace template to persist data in Coder workspaces across workspace restarts, so if data in an EC2 instance's RAM has not been written out to storage before Coder runs Terraform, then Terraform may create an EBS snapshot that does not contain all of the workspace's data that should be persisted which causes that unwritten data to be lost when the EBS volume is destroyed on workspace stop. Unfortunately, keeping the EBS volume around across workspace restarts to avoid that issue would be too expensive for my use case. 😞 As a workaround today I am setting |
This change allows the agent to handle common shutdown signals like interrupt, hangup and terminate and initiate a graceful shutdown. As long as terraform providers initiate graceful shutdowns via the aforementioned signals, things like SSH connections will be closed immediately on shutdown instead of being left hanging/timing out due to the agent being abruptly killed. Refs: #4677, #5901
This change allows the agent to handle common shutdown signals like interrupt, hangup and terminate and initiate a graceful shutdown. As long as terraform providers initiate graceful shutdowns via the aforementioned signals, things like SSH connections will be closed immediately on shutdown instead of being left hanging/timing out due to the agent being abruptly killed. Refs: #4677, #5901
The Coder agent has a startup_script to run arbitrary actions when it is started.
A
shutdown_script
could be added to run clean-up operations, collect stats, etc. before Coder actually changes the workspace state. A configurabletimeout
could be added to ensure a shutdown still occurs even if the script fails/hangs.The text was updated successfully, but these errors were encountered: